Compilation Stages in C

Compilation stages
Compilation stages
What is a Preprocessor/Preprocessing?
  • The preprocessor takes ‘C’ source code as an input and provides output as pure ‘C’ source code.
  • The preprocessor does the following tasks.
    • Removes the header file and adds content of it.
    • Removal of comments.
    • Macro expansion, Replace the macro name with its value.
    • The preprocessor provides pure ‘C’ code as an output, because comments, preprocessor directives, and header file names are removed.
  • let’s understand with the below example of code. the file name of the source code is program.c
#include<stdio.h>

#define KERNEL 2

//let's learn together
/*hello embeddedkernel*/

int main()
{
        printf("hello from embeddedkernel.com %d\n", KERNEL);
}
  • compile the above code using the below command in the Linux terminal.
    • gcc -E program.c -o program.i
  • The above command generates program.i which is pure C code.
  • Below is the content of program.i
1) Header file #include<stdio.h> content.

# 1 "pre_processor.c"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 418 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "pre_processor.c" 2
# 1 "/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/stdio.h" 1 3 4
# 64 "/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/stdio.h" 3 4
# 1 "/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/_stdio.h" 1 3 4


__attribute__((__deprecated__("This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of gets(3), it is highly recommended that you use fgets(3) instead.")))

char *gets(char *);

void perror(const char *) __attribute__((__cold__));
int printf(const char * restrict, ...) __attribute__((__format__ (__printf__, 1, 2))); //2)Printf declaration included from stdio.h
int putc(int, FILE *);
int putchar(int);
int puts(const char *);

extern int __vsprintf_chk (char * restrict, int, size_t,
      const char * restrict, va_list);



extern int __vsnprintf_chk (char * restrict, size_t, int, size_t,
       const char * restrict, va_list);
# 410 "/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/stdio.h" 2 3 4
# 2 "pre_processor.c" 2


//3) main function
int main()
{
 printf("hello from embeddedkernel.com %d\n", 2); //4) KERNEL is replaced by it's value '2'
}
  • Four points are highlighted in blue in the above output. Please press ctrl + f and search for the below points in the above program.i output.
  • Showing the required part of program.i to explain.
  • 1) Header file #include<stdio.h> content.
    • It has removed #include<stdio.h> and added content of it. header file contains a declaration of library functions like printf.
    • Before the main function(point 3) everything is the content of the header file stdio.h.
  • 2)Printf declaration included from stdio.h
    • printf’s declaration added as part of header file inclusion.
    • stdio.h has many declarations. So, output is of program.i is long.
  • 3) main function code.
  • 4) KERNEL is replaced by it’s value ‘2’
    • The macro expansion has happened in the preprocessor stage and KERNEL is replaced by it’s value ‘2’.
  • Also, observe that all the comments are removed in program.i. It’s pure ‘C’ source code now.
What is Translator or Translating state?
  • The translator is also called a compiler.
  • A translator performs the following tasks.
    • It checks for any syntax error in the code. if any semicolon is missing, bracket is missing, operator is not used properly, using undeclared variable, comma is missing, using reserve keyword as a variable name, improper variable name, or any wrong syntax as per programming language rules.
    • It translates the C source code to the assembly language.
  • Let’s translate the above program.c or program.i using the command “gcc -S program.c -o program.s”(this takes .c file as input and translates it into the assembly) or “gcc -S program.i -o program.s”(this takes .i file as input and translates it into the assembly).
  • After executing the above command it will generate program.s assembly file as shown below.
vim program.s

        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 14, 0     sdk_version 14, 4
        .globl  _main                           ; -- Begin function main
        .p2align        2
_main:                                  ; @main
        .cfi_startproc
; %bb.0:
        sub     sp, sp, #32
        .cfi_def_cfa_offset 32
        stp     x29, x30, [sp, #16]             ; 16-byte Folded Spill
        add     x29, sp, #16
        .cfi_def_cfa w29, 16
        .cfi_offset w30, -8
        .cfi_offset w29, -16
        mov     x9, sp
        mov     x8, #2
        str     x8, [x9]
        adrp    x0, l_.str@PAGE
        add     x0, x0, l_.str@PAGEOFF
        bl      _printf
        mov     w0, #0
        ldp     x29, x30, [sp, #16]             ; 16-byte Folded Reload
        add     sp, sp, #32
        ret
        .cfi_endproc
                                        ; -- End function
        .section        __TEXT,__cstring,cstring_literals
l_.str:                                 ; @.str
        .asciz  "hello from embeddedkernel.com %d\n"

.subsections_via_symbols
  • .section _TEXT_ means code(instructions) section.
  • main function code is present under _main: ; @main
What is Assembler or Assembling?
  • As the name suggests, Assembler converts assembly code into binary or object code.
  • The below commands are used to convert C source code, Pure C code, or Assembly code to convert into the object file.
    • gcc -c program.c -o program.o
    • gcc -c program.i -o program.o
    • gcc -c program.s -o program.o
  • A binary or object file generated from an assembler is understandable to the machine but still, it is not executable.
  • Below is the context of the Object file.
vim program.o

Ïúíþ^L^@^@^A^@^@^@^@^A^@^@^@^D^@^@^@¸^A^@^@^@ ^@^@^@^@^@^@^Y^@^@^@8^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@x^@^@^@^@^@^@^@Ø^A^@^@^@^@^@^@x^@^@^@^@^@^@^@^G^@^@^@^G^@^@^@^C^@^@^@^@^@^@^@__text^@^@^@^@^@^@^@^@^@^@__TEXT^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@4^@^@^@^@^@^@^@Ø^A^@^@^B^@^@^@P^B^@^@^C^@^@^@^@^D^@<80>^@^@^@^@^@^@^@^@^@^@^@^@__cstring^@^@^@^@^@^@^@__TEXT^@^@^@^@^@^@^@^@^@^@4^@^@^@^@^@^@^@"^@^@^@^@^@^@^@^L^B^@^@^@^@^@^@^@^@^@^@^@^@^@^@^B^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@__compact_unwind__LD^@^@^@^@^@^@^@^@^@^@^@^@X^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@0^B^@^@^C^@^@^@h^B^@^@^A^@^@^@^@^@^@^B^@^@^@^@^@^@^@^@^@^@^@^@2^@^@^@^X^@^@^@^A^@^@^@^@^@^N^@^@^D^N^@^@^@^@^@^B^@^@^@^X^@^@^@p^B^@^@^F^@^@^@Ð^B^@^@(^@^@^@^K^@^@^@P^@^@^@^@^@^@^@^D^@^@^@^D^@^@^@^A^@^@^@^E^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ÿ<83>^@Ñý{^A©ýC^@<91>é^C^@<91>H^@<80>Ò(^A^@ù^@^@^@<90>^@^@^@<91>^@^@^@<94>^@^@<80>Rý{A©ÿ<83>^@<91>À^C_Öhello from embeddedkernel.com %d
^@^@^@^@^@^@^@^@^@^@^@4^@^@^@^@^@^@^D^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^E^@^@-^\^@^@^@^A^@^@L^X^@^@^@^A^@^@=^@^@^@^@^A^@^@^F"^@^@^@^N^A^@^@^@^@^@^@^@^@^@^@^A^@^@^@^N^B^@^@4^@^@^@^@^@^@^@^\^@^@^@^N^B^@^@4^@^@^@^@^@^@^@^V^@^@^@^N^C^@^@X^@^@^@^@^@^@^@^H^@^@^@^O^A^@^@^@^@^@^@^@^@^@^@^N^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@l_.str^@_main^@_printf^@ltmp2^@ltmp1^@ltmp0^@
  • Machines can understand the above code but humans can not.
  • I have highlighted a few things in the above code with blue color, which we can understand like “hello from embeddedkernel.com” string, “main^@_printf” main, and print function.
What is Linked or Linking?
  • It links the calling function to the called function.
  • Creates _start function and _start calls main function.
  • The below commands are used to convert C source code, Pure C code, Assembly code, or object code to convert into the executable binary.
    • gcc program.c -o program.exe
    • gcc program.i -o program.exe
    • gcc program.s -o program.exe
    • gcc program.o -o program.exe
  • program.exe is an executable. It is also a binary or machine-understandable file like an object file.
  • However, the Object file can not execute on the machine because it doesn’t have linking between functions.
  • Executable is having linking between functions and _start function to start the execution.

NOTE: Giving intermediate filenames extensions like .i .s .o is necessary, as they will be used by the compiler as inputs. As GCC is not extension-independent, it doesn’t accept input files with any other extension than it should be having.

What is a Disassembler?
  • Humans can not understand machine code.
  • If need to debug an object file or executable file then need to convert it into the assembly or source file to understand.
  • The disassembler is the tool used to convert binary code to the assembly code.
  • Linux has tool like objdump which can be used to convert binary files to assembly files.
  • Click here to see more details about objdump on the Linux command page.
What is the task of the linker? What is the difference between an object file and an executable? Why executable can run on the machine and the object file can not run? explain with an example.
  • let’s run objdump on program.o object file.
ek@ek:~/ek/$ objdump -D program.o


program.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <main>:
   0:   f3 0f 1e fa             endbr64
   4:   55                      push   %rbp
   5:   48 89 e5                mov    %rsp,%rbp
   8:   be 02 00 00 00          mov    $0x2,%esi
   d:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # 14 <main+0x14>
  14:   48 89 c7                mov    %rax,%rdi
  17:   b8 00 00 00 00          mov    $0x0,%eax
  1c:   e8 00 00 00 00          call   21 <main+0x21>
  21:   b8 00 00 00 00          mov    $0x0,%eax
  26:   5d                      pop    %rbp
  27:   c3                      ret
  • As shown above, the object file is converted to assembly. The only required portion of the file is pasted here for explanation.
  • main function is there. But, which function is calling main and main calls to whom, those mappings are not done yet.
  • Now, let’s run objdump on program.exe executable file.
program.exe:     file format elf64-x86-64

Disassembly of section .text:

0000000000001060 <_start>:
    1060:       f3 0f 1e fa             endbr64
    1064:       31 ed                   xor    %ebp,%ebp
    1066:       49 89 d1                mov    %rdx,%r9
    1069:       5e                      pop    %rsi
    106a:       48 89 e2                mov    %rsp,%rdx
    106d:       48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
    1071:       50                      push   %rax
    1072:       54                      push   %rsp
    1073:       45 31 c0                xor    %r8d,%r8d
    1076:       31 c9                   xor    %ecx,%ecx
    1078:       48 8d 3d ca 00 00 00    lea    0xca(%rip),%rdi        # 1149 <main>
    107f:       ff 15 53 2f 00 00       call   *0x2f53(%rip)        # 3fd8 <__libc_start_main@GLIBC_2.34>
    1085:       f4                      hlt
    1086:       66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
    108d:       00 00 00



0000000000001149 <main>:
    1149:       f3 0f 1e fa             endbr64
    114d:       55                      push   %rbp
    114e:       48 89 e5                mov    %rsp,%rbp
    1151:       be 02 00 00 00          mov    $0x2,%esi
    1156:       48 8d 05 ab 0e 00 00    lea    0xeab(%rip),%rax        # 2008 <_IO_stdin_used+0x8>
    115d:       48 89 c7                mov    %rax,%rdi
    1160:       b8 00 00 00 00          mov    $0x0,%eax
    1165:       e8 e6 fe ff ff          call   1050 <printf@plt>
    116a:       b8 00 00 00 00          mov    $0x0,%eax
    116f:       5d                      pop    %rbp
    1170:       c3                      ret

Disassembly of section .plt.sec:

0000000000001050 <printf@plt>:
    1050:       f3 0f 1e fa             endbr64
    1054:       f2 ff 25 75 2f 00 00    bnd jmp *0x2f75(%rip)        # 3fd0 <printf@GLIBC_2.2.5>
    105b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  • As shown in the above executable objdump output, 0000000000001060 <_start>: is calling main at line number 1078.
  • main is calling printf at line number 1165.
  • So, there is a mapping between functions in program.exe file.
  • Conclusion: The object file is a binary file that a machine can understand. but as mappings between functions are not done. And the entry point _start is not there. object file can not run. whereas the executable has all the mapping between functions. It has an entry point set to _start which calls main. So, the executable can run on the machine.

NOTE: Giving intermediate filenames extensions like .io .s is necessary, as they will be used by the compiler as inputs. As GCC is not extension-independent, it doesn’t accept input files with any other extension than it should be having.

What is the _start in C? What is the entry point of the executable?
  • As shown in the above example, the linker creates _start function.
  • Usually _start is the entry point for C program, _start calls the main and from there execution continues as the program flow is written.

NOTE: No matter in what order functions are present in the program & in what order in the executable, _start function created by the linker in the executable calls main first. When main() returns a value, it is collected by the start function and also contains a proper exit procedure. _start is an entry point. So when the loader loads the application into RAM, execution starts from _start function.

Can we write a program without a main? will it compile?
  • Yes, We can write a C program without main.
  • The compilation will fail.
  • We need to compile it using -nostartfiles flag in the compilation command.
//vim program.c

void fun2();
void fun1()
{
        fun2();
}
void fun2()
{

}

gcc program.c -o program.exe

/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/Scrt1.o: in function `_start':
(.text+0x1b): undefined reference to `main'
collect2: error: ld returned 1 exit status
  • General compilation will give an error that _start is not linked to main.
  • let’s compile it using -nostartfiles flag.
//vim program.c

void fun2();
void fun1()
{
        fun2();
}
void fun2()
{

}

 gcc -nostartfiles program.c -o program.exe 
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000001000
  • Compilation works fine here.
  • It gives a warning that _start symbols is not found. And defaulting at address “0000000000001000” as a starting point.
  • let’s convert the above program.exe to the assembly file.
Disassembly of section .text:

0000000000001000 <fun1>:
    1000:       f3 0f 1e fa                     endbr64
    1004:       55                                  push   %rbp
    1005:       48 89 e5                      mov    %rsp,%rbp
    1008:       b8 00 00 00 00          mov    $0x0,%eax
    100d:       e8 03 00 00 00           call   1015 <fun2>
    1012:       90                                  nop
    1013:       5d                                 pop    %rbp
    1014:       c3                                  ret
0000000000001015 <fun2>:
     1045:       f3 0f 1e fa                  endbr64
    1049:       55                                push   %rbp
    104a:       48 89 e5                     mov    %rsp,%rbp
    104d:       48 8d 05 ac 0f 00 00    lea    0xfac(%rip),%rax        # 2000 <f2+0xfbb>
    1054:       48 89 c7                     mov    %rax,%rdi
    1057:       e8 c4 ff ff ff               call   1020 <puts@plt>
    105c:       90                                nop
    105d:       5d                                pop    %rbp
    105e:       c3                                ret
  • Observe that, the starting address given by the compiler and the address of fun1 are same.
  • Then fun1 calls fun2 as shown in the above code.
  • So, when there is no main function, And the program is compiled using -nostartfiles option then it will select the first function written in the program as an entry function. In our program, it is fun1.
  • let’s run the code.
ankit@antikIN:~/Antik/ek/code/compiler$ ./program.exe 
Segmentation fault (core dumped)
  • As mentioned earlier, _start is the starting point and also it provides a proper exit procedure.
  • In our code, there is no _start function, and no exit procedure followed. So, it got crashed.
  • We need to provide some exit function in fun1 which will exit the program normally.
  • So, the conclusion is, that writing a program is possible without a main but we need to handle the program sequence. what should be the entry point and also the exit procedure? But, if the main is written then the starting flow is fixed, and also no need to provide an exit procedure.

NOTE: Writing main() function in the program is not mandatory as just now we have seen that without main() function programs can be written and executed also. But if main() is not present in our program, execution starts from the first-ever function available in an executable. So writing a program without main() function, the programmer will have to set every function in order, but if a program is lengthy, this process becomes tiresome. So, we write our programs using main() function. So that _start function will jump directly to the main function no matter where the main is present in our program. So using main() in our program ensures that we do not have to worry about the orders of functions present in our program.

1 thought on “Compilation Stages in C”

Leave a Comment