Compilation stages
What is a Preprocessor/Preprocessing?
- The preprocessor takes ‘C’ source code as an input and provides output as pure ‘C’ source code.
- The preprocessor does the following tasks.
- Removes the header file and adds content of it.
- Removal of comments.
- Macro expansion, Replace the macro name with its value.
- The preprocessor provides pure ‘C’ code as an output, because comments, preprocessor directives, and header file names are removed.
- let’s understand with the below example of code. the file name of the source code is program.c
#include<stdio.h>
#define KERNEL 2
//let's learn together
/*hello embeddedkernel*/
int main()
{
printf("hello from embeddedkernel.com %d\n", KERNEL);
}
- compile the above code using the below command in the Linux terminal.
- gcc -E program.c -o program.i
- The above command generates program.i which is pure C code.
- Below is the content of program.i
1) Header file #include<stdio.h> content.
# 1 "pre_processor.c"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 418 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "pre_processor.c" 2
# 1 "/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/stdio.h" 1 3 4
# 64 "/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/stdio.h" 3 4
# 1 "/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/_stdio.h" 1 3 4
__attribute__((__deprecated__("This function is provided for compatibility reasons only. Due to security concerns inherent in the design of gets(3), it is highly recommended that you use fgets(3) instead.")))
char *gets(char *);
void perror(const char *) __attribute__((__cold__));
int printf(const char * restrict, ...) __attribute__((__format__ (__printf__, 1, 2))); //2)Printf declaration included from stdio.h
int putc(int, FILE *);
int putchar(int);
int puts(const char *);
extern int __vsprintf_chk (char * restrict, int, size_t,
const char * restrict, va_list);
extern int __vsnprintf_chk (char * restrict, size_t, int, size_t,
const char * restrict, va_list);
# 410 "/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/stdio.h" 2 3 4
# 2 "pre_processor.c" 2
//3) main function
int main()
{
printf("hello from embeddedkernel.com %d\n", 2); //4) KERNEL is replaced by it's value '2'
}
- Four points are highlighted in blue in the above output. Please press ctrl + f and search for the below points in the above program.i output.
- Showing the required part of program.i to explain.
- 1) Header file #include<stdio.h> content.
- It has removed #include<stdio.h> and added content of it. header file contains a declaration of library functions like printf.
- Before the main function(point 3) everything is the content of the header file stdio.h.
- 2)Printf declaration included from stdio.h
- printf’s declaration added as part of header file inclusion.
- stdio.h has many declarations. So, output is of program.i is long.
- 3) main function code.
- 4) KERNEL is replaced by it’s value ‘2’
- The macro expansion has happened in the preprocessor stage and KERNEL is replaced by it’s value ‘2’.
- Also, observe that all the comments are removed in program.i. It’s pure ‘C’ source code now.
What is Translator or Translating state?
- The translator is also called a compiler.
- A translator performs the following tasks.
- It checks for any syntax error in the code. if any semicolon is missing, bracket is missing, operator is not used properly, using undeclared variable, comma is missing, using reserve keyword as a variable name, improper variable name, or any wrong syntax as per programming language rules.
- It translates the C source code to the assembly language.
- Let’s translate the above program.c or program.i using the command “gcc -S program.c -o program.s”(this takes .c file as input and translates it into the assembly) or “gcc -S program.i -o program.s”(this takes .i file as input and translates it into the assembly).
- After executing the above command it will generate program.s assembly file as shown below.
vim program.s
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0 sdk_version 14, 4
.globl _main ; -- Begin function main
.p2align 2
_main: ; @main
.cfi_startproc
; %bb.0:
sub sp, sp, #32
.cfi_def_cfa_offset 32
stp x29, x30, [sp, #16] ; 16-byte Folded Spill
add x29, sp, #16
.cfi_def_cfa w29, 16
.cfi_offset w30, -8
.cfi_offset w29, -16
mov x9, sp
mov x8, #2
str x8, [x9]
adrp x0, l_.str@PAGE
add x0, x0, l_.str@PAGEOFF
bl _printf
mov w0, #0
ldp x29, x30, [sp, #16] ; 16-byte Folded Reload
add sp, sp, #32
ret
.cfi_endproc
; -- End function
.section __TEXT,__cstring,cstring_literals
l_.str: ; @.str
.asciz "hello from embeddedkernel.com %d\n"
.subsections_via_symbols
- .section _TEXT_ means code(instructions) section.
- main function code is present under _main: ; @main
What is Assembler or Assembling?
- As the name suggests, Assembler converts assembly code into binary or object code.
- The below commands are used to convert C source code, Pure C code, or Assembly code to convert into the object file.
- gcc -c program.c -o program.o
- gcc -c program.i -o program.o
- gcc -c program.s -o program.o
- A binary or object file generated from an assembler is understandable to the machine but still, it is not executable.
- Below is the context of the Object file.
vim program.o
Ïúíþ^L^@^@^A^@^@^@^@^A^@^@^@^D^@^@^@¸^A^@^@^@ ^@^@^@^@^@^@^Y^@^@^@8^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@x^@^@^@^@^@^@^@Ø^A^@^@^@^@^@^@x^@^@^@^@^@^@^@^G^@^@^@^G^@^@^@^C^@^@^@^@^@^@^@__text^@^@^@^@^@^@^@^@^@^@__TEXT^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@4^@^@^@^@^@^@^@Ø^A^@^@^B^@^@^@P^B^@^@^C^@^@^@^@^D^@<80>^@^@^@^@^@^@^@^@^@^@^@^@__cstring^@^@^@^@^@^@^@__TEXT^@^@^@^@^@^@^@^@^@^@4^@^@^@^@^@^@^@"^@^@^@^@^@^@^@^L^B^@^@^@^@^@^@^@^@^@^@^@^@^@^@^B^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@__compact_unwind__LD^@^@^@^@^@^@^@^@^@^@^@^@X^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@0^B^@^@^C^@^@^@h^B^@^@^A^@^@^@^@^@^@^B^@^@^@^@^@^@^@^@^@^@^@^@2^@^@^@^X^@^@^@^A^@^@^@^@^@^N^@^@^D^N^@^@^@^@^@^B^@^@^@^X^@^@^@p^B^@^@^F^@^@^@Ð^B^@^@(^@^@^@^K^@^@^@P^@^@^@^@^@^@^@^D^@^@^@^D^@^@^@^A^@^@^@^E^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ÿ<83>^@Ñý{^A©ýC^@<91>é^C^@<91>H^@<80>Ò(^A^@ù^@^@^@<90>^@^@^@<91>^@^@^@<94>^@^@<80>Rý{A©ÿ<83>^@<91>À^C_Öhello from embeddedkernel.com %d
^@^@^@^@^@^@^@^@^@^@^@4^@^@^@^@^@^@^D^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^E^@^@-^\^@^@^@^A^@^@L^X^@^@^@^A^@^@=^@^@^@^@^A^@^@^F"^@^@^@^N^A^@^@^@^@^@^@^@^@^@^@^A^@^@^@^N^B^@^@4^@^@^@^@^@^@^@^\^@^@^@^N^B^@^@4^@^@^@^@^@^@^@^V^@^@^@^N^C^@^@X^@^@^@^@^@^@^@^H^@^@^@^O^A^@^@^@^@^@^@^@^@^@^@^N^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@l_.str^@_main^@_printf^@ltmp2^@ltmp1^@ltmp0^@
- Machines can understand the above code but humans can not.
- I have highlighted a few things in the above code with blue color, which we can understand like “hello from embeddedkernel.com” string, “main^@_printf” main, and print function.
What is Linked or Linking?
- It links the calling function to the called function.
- Creates _start function and _start calls main function.
- The below commands are used to convert C source code, Pure C code, Assembly code, or object code to convert into the executable binary.
- gcc program.c -o program.exe
- gcc program.i -o program.exe
- gcc program.s -o program.exe
- gcc program.o -o program.exe
- program.exe is an executable. It is also a binary or machine-understandable file like an object file.
- However, the Object file can not execute on the machine because it doesn’t have linking between functions.
- Executable is having linking between functions and _start function to start the execution.
NOTE: Giving intermediate filenames extensions like .i .s .o is necessary, as they will be used by the compiler as inputs. As GCC is not extension-independent, it doesn’t accept input files with any other extension than it should be having.
What is a Disassembler?
- Humans can not understand machine code.
- If need to debug an object file or executable file then need to convert it into the assembly or source file to understand.
- The disassembler is the tool used to convert binary code to the assembly code.
- Linux has tool like objdump which can be used to convert binary files to assembly files.
- Click here to see more details about objdump on the Linux command page.
What is the task of the linker? What is the difference between an object file and an executable? Why executable can run on the machine and the object file can not run? explain with an example.
- let’s run objdump on program.o object file.
ek@ek:~/ek/$ objdump -D program.o
program.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
0: f3 0f 1e fa endbr64
4: 55 push %rbp
5: 48 89 e5 mov %rsp,%rbp
8: be 02 00 00 00 mov $0x2,%esi
d: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # 14 <main+0x14>
14: 48 89 c7 mov %rax,%rdi
17: b8 00 00 00 00 mov $0x0,%eax
1c: e8 00 00 00 00 call 21 <main+0x21>
21: b8 00 00 00 00 mov $0x0,%eax
26: 5d pop %rbp
27: c3 ret
- As shown above, the object file is converted to assembly. The only required portion of the file is pasted here for explanation.
- main function is there. But, which function is calling main and main calls to whom, those mappings are not done yet.
- Now, let’s run objdump on program.exe executable file.
program.exe: file format elf64-x86-64
Disassembly of section .text:
0000000000001060 <_start>:
1060: f3 0f 1e fa endbr64
1064: 31 ed xor %ebp,%ebp
1066: 49 89 d1 mov %rdx,%r9
1069: 5e pop %rsi
106a: 48 89 e2 mov %rsp,%rdx
106d: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
1071: 50 push %rax
1072: 54 push %rsp
1073: 45 31 c0 xor %r8d,%r8d
1076: 31 c9 xor %ecx,%ecx
1078: 48 8d 3d ca 00 00 00 lea 0xca(%rip),%rdi # 1149 <main>
107f: ff 15 53 2f 00 00 call *0x2f53(%rip) # 3fd8 <__libc_start_main@GLIBC_2.34>
1085: f4 hlt
1086: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
108d: 00 00 00
0000000000001149 <main>:
1149: f3 0f 1e fa endbr64
114d: 55 push %rbp
114e: 48 89 e5 mov %rsp,%rbp
1151: be 02 00 00 00 mov $0x2,%esi
1156: 48 8d 05 ab 0e 00 00 lea 0xeab(%rip),%rax # 2008 <_IO_stdin_used+0x8>
115d: 48 89 c7 mov %rax,%rdi
1160: b8 00 00 00 00 mov $0x0,%eax
1165: e8 e6 fe ff ff call 1050 <printf@plt>
116a: b8 00 00 00 00 mov $0x0,%eax
116f: 5d pop %rbp
1170: c3 ret
Disassembly of section .plt.sec:
0000000000001050 <printf@plt>:
1050: f3 0f 1e fa endbr64
1054: f2 ff 25 75 2f 00 00 bnd jmp *0x2f75(%rip) # 3fd0 <printf@GLIBC_2.2.5>
105b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
- As shown in the above executable objdump output, 0000000000001060 <_start>: is calling main at line number 1078.
- main is calling printf at line number 1165.
- So, there is a mapping between functions in program.exe file.
- Conclusion: The object file is a binary file that a machine can understand. but as mappings between functions are not done. And the entry point _start is not there. object file can not run. whereas the executable has all the mapping between functions. It has an entry point set to _start which calls main. So, the executable can run on the machine.
NOTE: Giving intermediate filenames extensions like .io .s is necessary, as they will be used by the compiler as inputs. As GCC is not extension-independent, it doesn’t accept input files with any other extension than it should be having.
What is the _start in C? What is the entry point of the executable?
- As shown in the above example, the linker creates _start function.
- Usually _start is the entry point for C program, _start calls the main and from there execution continues as the program flow is written.
NOTE: No matter in what order functions are present in the program & in what order in the executable, _start function created by the linker in the executable calls main first. When main() returns a value, it is collected by the start function and also contains a proper exit procedure. _start is an entry point. So when the loader loads the application into RAM, execution starts from _start function.
Can we write a program without a main? will it compile?
- Yes, We can write a C program without main.
- The compilation will fail.
- We need to compile it using -nostartfiles flag in the compilation command.
//vim program.c
void fun2();
void fun1()
{
fun2();
}
void fun2()
{
}
gcc program.c -o program.exe
/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/Scrt1.o: in function `_start':
(.text+0x1b): undefined reference to `main'
collect2: error: ld returned 1 exit status
- General compilation will give an error that _start is not linked to main.
- let’s compile it using -nostartfiles flag.
//vim program.c
void fun2();
void fun1()
{
fun2();
}
void fun2()
{
}
gcc -nostartfiles program.c -o program.exe
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000001000
- Compilation works fine here.
- It gives a warning that _start symbols is not found. And defaulting at address “0000000000001000” as a starting point.
- let’s convert the above program.exe to the assembly file.
Disassembly of section .text:
0000000000001000 <fun1>:
1000: f3 0f 1e fa endbr64
1004: 55 push %rbp
1005: 48 89 e5 mov %rsp,%rbp
1008: b8 00 00 00 00 mov $0x0,%eax
100d: e8 03 00 00 00 call 1015 <fun2>
1012: 90 nop
1013: 5d pop %rbp
1014: c3 ret
0000000000001015 <fun2>:
1045: f3 0f 1e fa endbr64
1049: 55 push %rbp
104a: 48 89 e5 mov %rsp,%rbp
104d: 48 8d 05 ac 0f 00 00 lea 0xfac(%rip),%rax # 2000 <f2+0xfbb>
1054: 48 89 c7 mov %rax,%rdi
1057: e8 c4 ff ff ff call 1020 <puts@plt>
105c: 90 nop
105d: 5d pop %rbp
105e: c3 ret
- Observe that, the starting address given by the compiler and the address of fun1 are same.
- Then fun1 calls fun2 as shown in the above code.
- So, when there is no main function, And the program is compiled using -nostartfiles option then it will select the first function written in the program as an entry function. In our program, it is fun1.
- let’s run the code.
ankit@antikIN:~/Antik/ek/code/compiler$ ./program.exe
Segmentation fault (core dumped)
- As mentioned earlier, _start is the starting point and also it provides a proper exit procedure.
- In our code, there is no _start function, and no exit procedure followed. So, it got crashed.
- We need to provide some exit function in fun1 which will exit the program normally.
- So, the conclusion is, that writing a program is possible without a main but we need to handle the program sequence. what should be the entry point and also the exit procedure? But, if the main is written then the starting flow is fixed, and also no need to provide an exit procedure.
NOTE: Writing main() function in the program is not mandatory as just now we have seen that without main() function programs can be written and executed also. But if main() is not present in our program, execution starts from the first-ever function available in an executable. So writing a program without main() function, the programmer will have to set every function in order, but if a program is lengthy, this process becomes tiresome. So, we write our programs using main() function. So that _start function will jump directly to the main function no matter where the main is present in our program. So using main() in our program ensures that we do not have to worry about the orders of functions present in our program.
1 thought on “Compilation Stages in C”