top of page

How is a C/C++ program built?

When talking about programs, programmers often say that "the program compiled successfully" (or not). But compiling isn't quite the same as creating an executable file for the program to run. In fact, creating an executable is a multistage process involving three steps: preprocessing, compilation and linking. In reality, even if a program compiled fine, it might still not be able to run due to some linking errors. This is why although the total process of turning a C/C++ source code file into an executable is handled by the compiler, I think it might better be referred to as the building process.


So, how are these three building stages handled?

Preprocessing

  • During the first part of translation, the compiler invokes a program called a preprocessor. The preprocessor handles the preprocessor directives, like #include or #define. It does not mind the syntax of C/C++, which is why it must be used with care.

  • It works on one C/C++ source file at a time by replacing #include directives with the content of the respective files (which is usually just declarations), doing replacement of macros (#define), and selecting different portions of text depending on #if, #ifdef and #ifndef directives.

  • This is also where #pragma directives are processed, which are methods for providing additional information to the compiler.


Compilation

This is where the money is at.

  • First, the compiler parses the pure C/C++ source code (now without any preprocessor directives). During this stage, it runs different analysis: lexical (tokenization), syntactical (parsing and analyzing), and semantic (type checking, declarations before use, etc.). These might report what we know as compiler errors, such as syntax errors.

  • Now the compiler runs optimization algorithms on the program logic - it tries to minimize the program's execution time and memory requirement (and hence its power consumption as well). This includes function inlining, hoisting invariants out of loops, etc. Today, a highly optimizing compiler enables developers to write the most readable and maintainable source code with the confidence that the compiler can generate the optimal binary implementation.

  • The compiler then converts the source code into assembly code, and invokes an assembler to assemble that code into machine code, producing an actual binary file in some format.

  • The output is an object file with an .o or .obj extension, containing the compiled code (in binary form). The object file also includes a data structure called a symbol table, which maps the different symbols (variables, functions) in the object file to names (this will be useful to the linker).

  • Object files can refer to symbols that are not defined. This is the case where one uses a declaration without providing a defintion for it in the same file. The compiler doesn't mind this, and will happily produce the object file as long as the source code is well-formed. In such a case, the compiler puts a placeholder of "unresolved external symbol", telling the linker to look up the symbol in other tables of object files.

  • Compilers usually allow you to stop at this point and compile each source code file separately. This is useful because if you change a single file, you will not have to recompile all the files of your program. You may also choose to compile separate object files so you can put them in archives, called static libraries, for easier reuse later on.


Linking

  • The linker is what produces the final compilation output from the object files the compiler produced.

  • It links all the object files by replacing the references to undefined symbols with the correct addresses. Each of these symbols can be defined in other object files or in libraries. If they are defined in libraries other than the standard library, we will need to tell the linker about them using linker flags in compilation command.

  • At this stage the most common errors are missing definitions or multiple definitions. The former means that a symbol's definition doesn't exist in any file or library, while the latter means that the same symbol was defined in two different object files or libraries.


The output of the building process can be either an executable, or a shared library (also called shared object or dynamic library) with an .so or .dll extension, both of which will most commonly be in ELF (Executable and Linkable Format) in Unix systems.

Comentários


Drop Me a Line, Let Me Know What You Think

Thanks for submitting!

© Copyright 2020, Roy Mattar.

bottom of page