|
Compiling
"C" And "C++" Programs On Unix Systems
- gcc/g++
Table Of Contents:
- Preface
- Compiling
A Single-Source "C" Program
- Running
The Resulting Program
- Creating
Debug-Ready Code
- Creating
Optimized Code
- Getting
Extra Compiler Warnings
- Compiling
A Single-Source "C++" Program
- Compiling
A Multi-Source "C" Program
- Getting
a Deeper Understanding - Compilation Steps
Preface - How To
Read This Document
This document tries
to give the reader basic knowledge in compiling C and C++
programs on a Unix system. If you've no knowledge as to
how to compile C programs under Unix (for instance, you
did that until now on other operating systems), you'd
better read this tutorial first, and then write a few
programs before you try to get to gdb, makefiles or C
libraries.
If you're already familiar with that, it's recommended to
learn about makefiles, and then go and learn other C
programming topics and practice the usage of makefiles,
before going on to read about C libraries. This last issue
is only relevant to larger projects, while makefiles make
sense even for a small program composed of but a few
source files.
As a policy, we'll
stick with the basic features of programming tools
mentioned here, so that the information will apply to more
than a single tool version. This way, you might find the
information here useful, even if the system you're using
does not have the GNU tools installed.
In this lovely
tutorial, we'll deal with compilation of a C program,
using the compiler directly from the command line. It
might be that you'll eventually use a more sophisticated
interface (an IDE - Integrated Development Environment) of
some sort, but the common denominator you'll always find
is the plain command line interface. Further more, even if
you use an IDE, it could help you understand how things
work "behind the scenes". We'll see how to
compile a program, how to combine several source files
into a single program, how to add debug information and
how to optimize code.
Compiling A
Single-Source "C" Program
The easiest case of
compilation is when you have all your source code set in a
single file. This removes any unnecessary steps of
synchronizing several files or thinking too much. Lets
assume there is a file named 'single_main.c'
that we want to compile. We will do so using a command
line similar to this:
cc single_main.c
Note that we assume the compiler is called "cc".
If you're using a GNU compiler, you'll write 'gcc'
instead. If you're using a Solaris system, you might use
'acc', and so on. Every compiler might show its messages
(errors, warnings, etc.) differently, but in all cases,
you'll get a file 'a.out' as a result, if the compilation
completed successfully. Note that some older systems (e.g.
SunOs) come with a C compiler that does not understand
ANSI-C, but rather the older 'K&R' C style. In such a
case, you'll need to use gcc (hopefully it is installed),
or learn the differences between ANSI-C and K&R C (not
recommended if you don't really have to), or move
to a different system.
You might complain
that 'a.out' is a too generic name (where does it come
from anyway? - well, that's a historical name, due to the
usage of something called "a.out format" for
programs compiled on older Unix systems). Suppose that you
want the resulting program to be called
"single_main". In that case, you could use the
following line to compile it:
cc single_main.c -o
single_main
Every compiler I've met so far (including the glorious gcc)
recognized the '-o' flag as "name the resulting
executable file 'single_main'".
Running The
Resulting Program
Once we created the
program, we wish to run it. This is usually done by simply
typing its name, as in:
single_main
However, this requires that the current directory be in
our PATH (which is a variable telling our Unix shell where
to look for programs we're trying to run). In many cases,
this directory is not placed in our PATH. Aha! - we say.
Then lets show this computer who is smarter, and thus we
try:
./single_main
This time we explicitly told our Unix shell that we want
to run the program from the current directory. If we're
lucky enough, this will suffice. However, yet one more
obstacle could block our path - file permission flags.
When a file is
created in the system, it is immediately given some access
permission flags. These flags tell the system who should
be given access to the file, and what kind of access will
be given to them. Traditional Unix systems use 3 kinds of
entities to which they grant (or deny) access: The user
which owns the file, the group which owns the file, and
everybody else. Each of these entities may be given access
to read the file ('r'), write to the file ('w') and
execute the file ('x').
Now, when the
compiler created the program file for us, we became owners
of the file. Normally, the compiler would make sure that
we get all permissions to the file - read, write and
execute. It might be, thought that something went wrong,
and the permissions are set differently. In that case, we
can set the permissions of the file properly (the owner of
a file can normally change the permission flags of the
file), using a command like this:
chmod u+rwx single_main
This means "the user ('u') should be given ('+')
permissions read ('r'), write ('w') and execute ('x') to
the file 'single_main'. Now we'll surely be able to run
our program. Again, normally you'll have no problem
running the file, but if you copy it to a different
directory, or transfer it to a different computer over the
network, it might loose its original permissions, and thus
you'll need to set them properly, as shown above. Note too
that you cannot just move the file to a different computer
an expect it to run - it has to be a computer with a
matching operating system (to understand the executable
file format), and matching CPU architecture (to understand
the machine-language code that the executable file
contains).
Finally, the
run-time environment has to match. For example, if we
compiled the program on an operating system with one
version of the standard C library, and we try to run it on
a version with an incompatible standard C library, the
program might crush, or complain that it cannot find the
relevant C library. This is especially true for systems
that evolve quickly (e.g. Linux with libc5 vs. Linux with
libc6), so beware.
Creating
Debug-Ready Code
Normally, when we
write a program, we want to be able to debug it - that is,
test it using a debugger that allows running it step by
step, setting a break point before a given command is
executed, looking at contents of variables during program
execution, and so on. In order for the debugger to be able
to relate between the executable program and the original
source code, we need to tell the compiler to insert
information to the resulting executable program that'll
help the debugger. This information is called "debug
information". In order to add that to our program,
lets compile it differently:
cc -g single_main.c -o
single_main
The '-g' flag tells the compiler to use debug info, and is
recognized by mostly any compiler out there. You will note
that the resulting file is much larger than that created
without usage of the '-g' flag. The difference in size is
due to the debug information. We may still remove this
debug information using the strip command,
like this:
strip single_main
You'll note that the size of the file now is even smaller
than if we didn't use the '-g' flag in the first place.
This is because even a program compiled without the '-g'
flag contains some symbol information (function names, for
instance), that the strip command removes.
You may want to read strip's manual page (man
strip) to understand more about what this command does.
Creating Optimized
Code
After we created a
program and debugged it properly, we normally want it to
compile into an efficient code, and the resulting file to
be as small as possible. The compiler can help us by
optimizing the code, either for speed (to run faster), or
for space (to occupy a smaller space), or some combination
of the two. The basic way to create an optimized program
would be like this:
cc -O single_main.c -o
single_main
The '-O' flag tells the compiler to optimize the code.
This also means the compilation will take longer, as the
compiler tries to apply various optimization algorithms to
the code. This optimization is supposed to be
conservative, in that it ensures us the code will still
perform the same functionality as it did when compiled
without optimization (well, unless there are bugs in our
compiler). Usually can define an optimization level by
adding a number to the '-O' flag. The higher the number -
the better optimized the resulting program will be, and
the slower the compiler will complete the compilation. One
should note that because optimization alters the code in
various ways, as we increase the optimization level of the
code, the chances are higher that an improper optimization
will actually alter our code, as some of them tend to be
non-conservative, or are simply rather complex, and
contain bugs. For example, for a long time it was known
that using a compilation level higher than 2 (or was it
higher than 3?) with gcc results bugs in the executable
program. After being warned, if we still want to use a
different optimization level (lets say 4), we can do it
this way:
cc -O4 single_compile.c -o
single_compile
And we're done with it. If you'll read your compiler's
manual page, you'll soon notice that it supports an almost
infinite number of command line options dealing with
optimization. Using them properly requires thorough
understanding of compilation theory and source code
optimization theory, or you might damage your resulting
code. A good compilation theory course (preferably based
on "the Dragon Book" by Aho, Sethi and Ulman)
could do you good.
Getting Extra
Compiler Warnings
Normally the
compiler only generates error messages about erroneous
code that does not comply with the C standard, and
warnings about things that usually tend to cause errors
during runtime. However, we can usually instruct the
compiler to give us even more warnings, which is useful to
improve the quality of our source code, and to expose bugs
that will really bug us later. With gcc, this is done
using the '-W' flag. For example, to get the compiler to
use all types of warnings it is familiar with, we'll use a
command line like this:
cc -Wall single_source.c
-o single_source
This will first annoy us - we'll get all sorts of warnings
that might seem irrelevant. However, it is better to
eliminate the warnings than to eliminate the usage of this
flag. Usually, this option will save us more time than it
will cause us to waste, and if used consistently, we will
get used to coding proper code without thinking too much
about it. One should also note that some code that works
on some architecture with one compiler, might break if we
use a different compiler, or a different system, to
compile the code on. When developing on the first system,
we'll never see these bugs, but when moving the code to a
different platform, the bug will suddenly appear. Also, in
many cases we eventually will want to move the code to a
new system, even if we had no such intentions initially.
Note that sometimes
'-Wall' will give you too many errors, and then you could
try to use some less verbose warning level. Read the
compiler's manual to learn about the various '-W' options,
and use those that would give you the greatest benefit.
Initially they might sound too strange to make any sense,
but if you are (or when you will become) a more
experienced programmer, you will learn which could be of
good use to you.
Compiling A
Single-Source "C++" Program
Now that we saw how
to compile C programs, the transition to C++ programs is
rather simple. All we need to do is use a C++ compiler, in
place of the C compiler we used so far. So, if our program
source is in a file named 'single_main.cc'
('cc' to denote C++ code. Some programmers prefer a suffix
of 'C' for C++ code), we will use a command such as the
following:
g++ single_main.cc -o
single_main
Or on some systems you'll use "CC" instead of
"g++" (for example, with Sun's compiler for
Solaris), or "aCC" (HP's compiler), and so on.
You would note that with C++ compilers there is less
uniformity regarding command line options, partially
because until recently the language was evolving and had
no agreed standard. But still, at least with g++, you will
use "-g" for debug information in the code, and
"-O" for optimization.
Compiling A
Multi-Source "C" Program
So you learned how
to compile a single-source program properly (hopefully by
now you played a little with the compiler and tried out a
few examples of your own). Yet, sooner or later you'll see
that having all the source in a single file is rather
limiting, for several reasons:
- As the file
grows, compilation time tends to grow, and for each
little change, the whole program has to be
re-compiled.
- It is very hard,
if not impossible, that several people will work on
the same project together in this manner.
- Managing your
code becomes harder. Backing out erroneous changes
becomes nearly impossible.
The solution to this
would be to split the source code into multiple files,
each containing a set of closely-related functions (or, in
C++, all the source code for a single class).
There are two
possible ways to compile a multi-source C program. The
first is to use a single command line to compile all the
files. Suppose that we have a program whose source is
found in files "main.c",
"a.c"
and "b.c"
(found in directory "multi-source"
of this tutorial). We could compile it this way:
cc main.c a.c b.c -o
hello_world
This will cause the compiler to compile each of the given
files separately, and then link them all together to one
executable file named "hello_world". Two
comments about this program:
- If we define a
function (or a variable) in one file, and try to
access them from a second file, we need to declare
them as external symbols in that second file. This is
done using the C
"extern"
keyword.
- The order of
presenting the source files on the command line may be
altered. The compiler (actually, the linker) will know
how to take the relevant code from each file into the
final program, even if the first source file tries to
use a function defined in the second or third source
file.
The problem with this
way of compilation is that even if we only make a change
in one of the source files, all of them will be
re-compiled when we run the compiler again.
In order to overcome
this limitation, we could divide the compilation process
into two phases - compiling, and linking. Lets first see
how this is done, and then explain:
cc -c main.cc
cc -c a.c
cc -c b.c
cc main.o a.o b.o -o hello_world
The first 3 commands have each taken one source file, and
compiled it into something called "object file",
with the same names, but with a ".o" suffix. It
is the "-c" flag that tells the compiler only to
create an object file, and not to generate a final
executable file just yet. The object file contains the
code for the source file in machine language, but with
some unresolved symbols. For example, the "main.o"
file refers to a symbol named "func_a", which is
a function defined in file "a.c". Surely we
cannot run the code like that. Thus, after creating the 3
object files, we use the 4th command to link the 3 object
files into one program. The linker (which is invoked by
the compiler now) takes all the symbols from the 3 object
files, and links them together - it makes sure that when
"func_a" is invoked from the code in object file
"main.o", the function code in object file
"a.o" gets executed. Further more, the linker
also links the standard C library into the program, in
this case, to resolve the "printf" symbol
properly.
To see why this
complexity actually helps us, we should note that normally
the link phase is much faster than the compilation phase.
This is especially true when doing optimizations, since
that step is done before linking. Now, lets assume we
change the source file "a.c", and we want to
re-compile the program. We'll only need now two commands:
cc -c a.c
cc main.o a.o b.o -o hello_world
In our small example, it's hard to notice the speed-up,
but in a case of having few tens of files each containing
a few hundred lines of source-code, the time saving is
significant; not to mention even larger projects.
Getting a Deeper
Understanding - Compilation Steps
Now that we've
learned that compilation is not just a simple process,
lets try to see what is the complete list of steps taken
by the compiler in order to compile a C program.
- Driver
- what we invoked as "cc". This is actually
the "engine", that drives the whole set of
tools the compiler is made of. We invoke it, and it
begins to invoke the other tools one by one, passing
the output of each tool as an input to the next tool.
- C
Pre-Processor - normally called "cpp".
It takes a C source file, and handles all the
pre-processor definitions (#include files, #define
macros, conditional source code inclusion with #ifdef,
etc.) You can invoke it separately on your program,
usually with a command like:
cc -E single_source.c
Try this and see what the resulting code looks like.
- The
C Compiler - normally called
"cc1". This is the actual compiler, that
translates the input file into assembly language. As
you saw, we used the "-c" flag to invoke it,
along with the C Pre-Processor, (and possibly the
optimizer too, read on), and the assembler.
- Optimizer
- sometimes comes as a separate module and sometimes
as the found inside the compiler module. This one
handles the optimization on a representation of the
code that is language-neutral. This way, you can use
the same optimizer for compilers of different
programming languages.
- Assembler
- sometimes called "as". This takes the
assembly code generated by the compiler, and
translates it into machine language code kept in
object files. With gcc, you could tell the driver to
generated only the assembly code, by a command like:
cc -S single_source.c
- Linker-Loader
- This is the tool that takes all the object files
(and C libraries), and links them together, to form
one executable file, in a format the operating system
supports. A Common format these days is known as
"ELF". On SunOs systems, and other older
systems, a format named "a.out" was used.
This format defines the internal structure of the
executable file - location of data segment, location
of source code segment, location of debug information
and so on.
As you see, the
compilation is split in to many different phases. Not all
compiler employs exactly the same phases, and sometimes
(e.g. for C++ compilers) the situation is even more
complex. But the basic idea is quite similar - split the
compiler into many different parts, to give the programmer
more flexibility, and to allow the compiler developers to
re-use as many modules as possible in different compilers
for different languages (by replacing the preprocessor and
compiler modules), or for different architectures (by
replacing the assembly and linker-loader parts).
|