Project 1: "Hello, Toolchain" — Build Pipeline Explorer
Project 1: “Hello, Toolchain” — Build Pipeline Explorer
Build a CLI tool that reveals every transformation your C code undergoes from source to running process.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1-2 weeks |
| Language | C (Alternatives: Rust, Zig, C++) |
| Prerequisites | Basic C, comfort with build tools, basic debugging |
| Key Topics | Compilation pipeline, object files, symbols, linking, process layout |
| CS:APP Chapters | 1, 7 |
1. Learning Objectives
By completing this project, you will:
- Trace the complete compilation pipeline: Explain what happens at each stage from
.cto running process - Read object file metadata: Parse and interpret ELF section headers, symbol tables, and relocation entries
- Understand static vs dynamic linking: Predict how symbol resolution differs and demonstrate the tradeoffs
- Map runtime memory layout: Connect debugger output to the theoretical process address space model
- Debug using binary artifacts: Given a crash address, trace back through the pipeline to find the source
- Use professional tooling fluently: Master
gcc,objdump,readelf,nm,ldd, and debuggers
2. Theoretical Foundation
2.1 The Compilation Pipeline
When you type gcc hello.c -o hello, you’re invoking not one tool but four distinct programs, each performing a specific transformation:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ hello.c │────▶│ hello.i │────▶│ hello.s │────▶│ hello.o │────▶│ hello │
│ (source) │ cpp │(preprocessed)│ cc1 │ (assembly) │ as │ (object) │ ld │(executable) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Text Text Text Binary Binary
Stage 1: Preprocessing (cpp)
The C preprocessor handles all directives starting with #:
// Before preprocessing
#include <stdio.h>
#define MAX 100
int main() {
printf("Max is %d\n", MAX);
}
// After preprocessing (hello.i) - simplified
// ... thousands of lines from stdio.h ...
typedef struct _IO_FILE FILE;
extern int printf(const char *__format, ...);
// ... more declarations ...
int main() {
printf("Max is %d\n", 100); // MAX replaced with 100
}
Key insight: The preprocessor is a text-to-text transformation. It knows nothing about C syntax—it just does text substitution and file inclusion.
Run: gcc -E hello.c -o hello.i
Stage 2: Compilation (cc1)
The compiler proper transforms C into assembly. This is where:
- Syntax is parsed into an Abstract Syntax Tree
- Type checking occurs
- Optimizations are applied
- Code is generated for the target architecture
# hello.s (simplified x86-64)
.section .rodata
.LC0:
.string "Max is %d\n"
.text
.globl main
main:
pushq %rbp
movq %rsp, %rbp
movl $100, %esi # Second argument (MAX value)
leaq .LC0(%rip), %rdi # First argument (format string)
movl $0, %eax # varargs convention
call printf@PLT
movl $0, %eax # return 0
popq %rbp
ret
Run: gcc -S hello.c -o hello.s
Stage 3: Assembly (as)
The assembler converts human-readable assembly to machine code, producing an object file (.o). This is the first binary stage.
Object files contain:
- Machine code (the
.textsection) - Data (
.data,.rodata,.bsssections) - Symbol table (what symbols are defined/referenced)
- Relocation entries (addresses to fix up during linking)
Run: gcc -c hello.c -o hello.o
Stage 4: Linking (ld)
The linker combines object files and libraries into an executable:
- Symbol resolution: Match each symbol reference to exactly one definition
- Relocation: Fix up addresses now that final layout is known
- Layout: Organize sections into a loadable format
┌─────────────────────────────────────────────────────────────┐
│ Linker's Job │
├─────────────────────────────────────────────────────────────┤
│ Input: hello.o + libc.so (or libc.a) │
│ │
│ 1. Find definition for 'printf' → in libc │
│ 2. Calculate final addresses for all symbols │
│ 3. Patch call printf@PLT with actual jump target │
│ 4. Create program headers for OS loader │
│ 5. Output: Executable ELF file │
└─────────────────────────────────────────────────────────────┘
2.2 ELF Object File Format
ELF (Executable and Linkable Format) is the standard binary format on Linux/Unix. Understanding it is crucial:
┌─────────────────────────────────────┐
│ ELF Header │ Magic number, architecture, entry point
├─────────────────────────────────────┤
│ Program Headers │ How to load into memory (for executables)
├─────────────────────────────────────┤
│ .text │ Executable code
├─────────────────────────────────────┤
│ .rodata │ Read-only data (string literals, constants)
├─────────────────────────────────────┤
│ .data │ Initialized global/static variables
├─────────────────────────────────────┤
│ .bss │ Uninitialized global/static (zeroed at load)
├─────────────────────────────────────┤
│ .symtab │ Symbol table
├─────────────────────────────────────┤
│ .strtab │ String table (names for symbols)
├─────────────────────────────────────┤
│ .rel.text │ Relocation entries for .text
├─────────────────────────────────────┤
│ Section Headers │ Describes all sections
└─────────────────────────────────────┘
Symbol Types
| Type | Meaning | Example |
|---|---|---|
T |
Text (code) symbol, globally visible | main |
t |
Text symbol, local to file | helper_func with static |
D |
Data symbol, globally visible | int global_var = 5; |
B |
BSS symbol (uninitialized) | int uninit_global; |
U |
Undefined (needs to be resolved) | Reference to printf |
2.3 Static vs Dynamic Linking
Static Linking (gcc -static):
- All library code is copied into the executable
- Larger binary, but self-contained
- No runtime dependencies
- Symbol resolution happens entirely at link time
Dynamic Linking (default):
- Library code stays in shared libraries (
.sofiles) - Smaller binaries, shared memory for common libraries
- Resolution happens partly at load time
- PLT (Procedure Linkage Table) and GOT (Global Offset Table) enable lazy binding
Static: hello (executable) contains printf code
Dynamic: hello (executable) contains PLT stub → jumps to libc.so at runtime
2.4 Process Memory Layout
When the OS loads your program, it creates an address space:
High addresses (0x7fff...)
┌─────────────────────────────────────┐
│ Kernel Space │ Not accessible to user code
├─────────────────────────────────────┤
│ Stack │ ↓ grows down (local vars, return addrs)
│ │ │
│ ▼ │
├─────────────────────────────────────┤
│ │ (unmapped region)
├─────────────────────────────────────┤
│ ▲ │
│ │ │
│ Heap │ ↑ grows up (malloc'd memory)
├─────────────────────────────────────┤
│ .bss (uninitialized) │
├─────────────────────────────────────┤
│ .data (initialized) │
├─────────────────────────────────────┤
│ .rodata (read-only) │
├─────────────────────────────────────┤
│ .text (code) │
└─────────────────────────────────────┘
Low addresses (0x0...)
2.5 Why This Matters
Understanding the compilation pipeline enables you to:
- Debug effectively: A crash at address
0x401234means something—you can trace it to the exact instruction and source line - Optimize intelligently: Understanding what the compiler does helps you write code it can optimize well
- Write portable code: Knowing about linking helps you manage symbol visibility and avoid conflicts
- Security analysis: Buffer overflows, return-oriented programming, and other attacks exploit this layout
- Embedded systems: Resource-constrained systems require understanding exactly what ends up in the binary
2.6 Common Misconceptions
Misconception 1: “The compiler converts C directly to machine code” Reality: There are four distinct stages, each with its own tools and artifacts.
Misconception 2: “Object files are executables” Reality: Object files contain machine code but can’t run—they have unresolved symbols and no load information.
Misconception 3: “Static linking is always better for deployment” Reality: Dynamic linking saves memory (shared libraries), enables updates without recompilation, and is required for some features (plugins, LGPL compliance).
Misconception 4: “The stack grows up” Reality: On x86/x86-64, the stack grows DOWN (toward lower addresses). This is crucial for understanding buffer overflows.
3. Project Specification
3.1 What You Will Build
A command-line “pipeline explainer” that takes a C source file and produces a structured report showing:
- Every artifact produced at each compilation stage
- Symbol and section analysis of the object file
- Linking information (resolved symbols, required libraries)
- Runtime process layout (via controlled execution with debugger)
3.2 Functional Requirements
- Pipeline Capture (
--pipeline):- Save preprocessed output (
.i) - Save assembly output (
.s) - Save object file (
.o) - Save final executable
- Save preprocessed output (
- Object Analysis (
--analyze):- List all sections with sizes and types
- List symbols by category (defined, undefined, local, global)
- Show relocation entries
- Link Analysis (
--link):- Identify required shared libraries
- Compare static vs dynamic link outputs
- Show symbol resolution decisions
- Runtime Inspection (
--runtime):- Launch under debugger, pause at main
- Report stack and heap boundaries
- Show loaded shared libraries
- Demonstrate memory regions
- Crash Analysis (
--crash <address>):- Given a crash address, trace back to source line
- Show the relevant assembly
- Explain what stage produced that code
3.3 Non-Functional Requirements
- Determinism: Same input produces same output (modulo runtime addresses)
- Portability: Works on Linux x86-64 (primary) and macOS (stretch goal)
- Educational: Output explains “why” not just “what”
- Clean: Uses proper error handling, no crashes on malformed input
3.4 Example Usage / Output
$ ./pipeline-explorer hello.c
=== PIPELINE ANALYSIS for hello.c ===
[Stage 1: Preprocessing]
Input: hello.c (234 bytes)
Output: hello.i (42,891 bytes)
Included: stdio.h, stdlib.h, limits.h, ...
Macros expanded: 3 (MAX, DEBUG, VERSION)
[Stage 2: Compilation]
Input: hello.i
Output: hello.s (1,247 bytes)
Functions generated: 2 (main, helper)
Optimization level: -O0
[Stage 3: Assembly]
Input: hello.s
Output: hello.o (2,456 bytes)
Sections:
.text : 168 bytes (executable code)
.rodata : 24 bytes (string literals)
.data : 0 bytes
.bss : 0 bytes
Symbol Table:
DEFINED (Global): main (T), helper (t)
UNDEFINED: printf (U), exit (U)
Relocations: 4 entries
- printf @ offset 0x24 (PC-relative call)
- .rodata @ offset 0x18 (data reference)
[Stage 4: Linking]
Input: hello.o + libc.so.6
Output: hello (16,432 bytes)
Symbol Resolution:
printf -> libc.so.6 (0x7f... at runtime)
exit -> libc.so.6 (0x7f... at runtime)
Dependencies: libc.so.6, ld-linux-x86-64.so.2
[Runtime Layout] (paused at main)
Text: 0x401000 - 0x401234
ROData: 0x402000 - 0x402100
Data: 0x404000 - 0x404008
BSS: 0x404008 - 0x404010
Heap: 0x405000 - (grows up)
Stack: 0x7fff... - (grows down)
Stack at main():
Return address: 0x7f4a3c... (__libc_start_main+234)
Saved RBP: 0x0
Local vars: argc @ rbp-4, argv @ rbp-16
=== END REPORT ===
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────────────────┐
│ pipeline-explorer │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Compiler │──▶│ Object │──▶│ Link │──▶│ Runtime │ │
│ │ Invoker │ │ Analyzer │ │ Analyzer │ │ Inspector │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐│
│ │ Report Generator ││
│ └──────────────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ [Formatted Output] │
└─────────────────────────────────────────────────────────────────────────┘
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Compiler Invoker | Runs gcc with flags to produce each stage artifact | Uses -save-temps or explicit stage flags; captures stderr for warnings |
| Object Analyzer | Parses ELF object files | Can use readelf/objdump output OR parse ELF directly (stretch goal) |
| Link Analyzer | Compares static/dynamic linking, identifies dependencies | Uses ldd, nm, compares sizes |
| Runtime Inspector | Launches executable under debugger, queries state | Uses GDB in batch mode with scripted commands |
| Report Generator | Formats analysis into readable output | Structured text, optionally JSON for machine consumption |
4.3 Data Structures
// Core structures for the analyzer
typedef struct {
char *name;
size_t size;
uint64_t address;
char type; // 'T', 'D', 'B', 'U', etc.
int is_global;
} Symbol;
typedef struct {
char *name;
size_t size;
uint64_t address;
uint32_t flags; // SHF_* flags
} Section;
typedef struct {
uint64_t offset;
char *symbol_name;
int type; // R_X86_64_* relocation type
} Relocation;
typedef struct {
Section *sections;
size_t section_count;
Symbol *symbols;
size_t symbol_count;
Relocation *relocations;
size_t relocation_count;
} ObjectAnalysis;
typedef struct {
uint64_t text_start, text_end;
uint64_t data_start, data_end;
uint64_t heap_start;
uint64_t stack_top;
char **loaded_libraries;
size_t library_count;
} RuntimeLayout;
4.4 Algorithm Overview
Main Algorithm Flow:
- Parse command line → determine source file and requested analyses
- Invoke pipeline stages:
- Run
gcc -E→ capture preprocessed output - Run
gcc -S→ capture assembly - Run
gcc -c→ create object file - Run
gcc→ create executable (both static and dynamic versions)
- Run
- Analyze object file:
- Run
readelf -h→ header info - Run
readelf -S→ section table - Run
readelf -s→ symbol table - Run
readelf -r→ relocations - Parse outputs into data structures
- Run
- Analyze linking:
- Compare file sizes
- Run
lddon dynamic version - Run
nmto see resolved symbols
- Runtime inspection:
- Launch with
gdb -batch -ex "..." executable - Query memory mappings (
info proc mappings) - Query stack (
info frame) - Parse GDB output
- Launch with
- Generate report → format and output all collected data
Complexity Analysis:
- Time: O(n) where n is source file size, dominated by gcc invocations
- Space: O(m) where m is size of largest artifact (typically the executable)
5. Implementation Guide
5.1 Development Environment Setup
# Required tools
sudo apt-get install gcc gdb binutils build-essential
# Verify installations
gcc --version
gdb --version
readelf --version
objdump --version
# Create project structure
mkdir -p pipeline-explorer/{src,tests,artifacts}
cd pipeline-explorer
5.2 Project Structure
pipeline-explorer/
├── src/
│ ├── main.c # Entry point, CLI parsing
│ ├── compiler.c # Invokes gcc stages
│ ├── object_analyzer.c # Parses ELF via readelf/objdump
│ ├── link_analyzer.c # Analyzes linking
│ ├── runtime_inspector.c # GDB automation
│ ├── report.c # Output formatting
│ └── util.c # String handling, process execution
├── include/
│ ├── pipeline.h # Shared data structures
│ └── util.h # Utility declarations
├── tests/
│ ├── simple.c # Basic test case
│ ├── multifile/ # Multi-file test
│ └── expected/ # Expected outputs
├── Makefile
└── README.md
5.3 Implementation Phases
Phase 1: Foundation (Days 1-3)
Goals:
- Set up project structure and build system
- Implement basic pipeline invocation
- Capture all stage artifacts
Tasks:
- Create Makefile that builds your tool
- Implement
run_command()utility that executes shell commands and captures output - Implement
invoke_preprocessor(),invoke_compiler(),invoke_assembler(),invoke_linker() - Verify artifacts are created in a temp directory
Checkpoint: ./pipeline-explorer hello.c produces hello.i, hello.s, hello.o, and hello in an artifacts directory.
Phase 2: Object File Analysis (Days 4-7)
Goals:
- Parse
readelfoutput for sections, symbols, relocations - Build internal data structures
- Format readable output
Tasks:
- Run
readelf -Sand parse section table output - Run
readelf -sand parse symbol table (handling local/global, types) - Run
readelf -rand parse relocation entries - Connect symbols to their sections
- Format and output the analysis
Checkpoint: Running on hello.o shows all sections with sizes, categorizes symbols correctly, and lists relocations.
Phase 3: Link Analysis (Days 8-10)
Goals:
- Compare static vs dynamic linking
- Identify all dependencies
- Show symbol resolution
Tasks:
- Create both static (
-static) and dynamic versions - Use
lddto list shared library dependencies - Use
nmon final executable to show resolved symbols - Calculate and compare file sizes
- Explain the tradeoffs in output
Checkpoint: Output shows library dependencies, size comparison, and where undefined symbols were resolved.
Phase 4: Runtime Inspection (Days 11-13)
Goals:
- Automate GDB to pause and inspect running process
- Extract memory layout
- Show stack state at entry
Tasks:
- Create GDB command script that sets breakpoint at main, runs, queries state
- Parse
info proc mappingsoutput for memory regions - Parse
info frameoutput for stack information - Handle GDB output parsing robustly
- Integrate into final report
Checkpoint: Running on an executable shows correct memory layout matching expected regions.
Phase 5: Polish & Integration (Day 14)
Goals:
- Error handling for all edge cases
- Clean output formatting
- Testing on various inputs
Tasks:
- Handle compilation errors gracefully
- Test with multi-file programs
- Test with C++ if time permits
- Write documentation
- Clean up code, remove debug output
Checkpoint: Tool handles malformed input gracefully, produces clean reports for variety of test cases.
5.4 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| ELF Parsing | Parse directly vs use readelf | Use readelf | Simpler, ELF parsing is a project unto itself |
| GDB Control | Expect scripts vs Python API | Expect scripts | More portable, simpler to implement |
| Output Format | Text only vs JSON option | Text with JSON flag | Text for humans, JSON for tooling integration |
| Temp Files | System temp vs local dir | Local artifacts/ dir | Easier debugging, user can inspect |
| Static Analysis | Required vs optional | Required with –static flag | Shows important tradeoff but not always available |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Test parsing functions in isolation | Symbol parser handles all types correctly |
| Integration Tests | Test full pipeline on sample programs | hello.c produces expected output |
| Regression Tests | Verify fixes don’t break existing | Compare output to known-good baselines |
| Edge Cases | Handle unusual inputs | Empty file, no symbols, huge files |
6.2 Critical Test Cases
- Minimal Program:
int main() { return 0; }Expected: No undefined symbols except _start, minimal sections
- External Dependencies:
#include <stdio.h> #include <math.h> int main() { printf("%f\n", sin(1.0)); return 0; }Expected: Shows printf, sin as undefined → resolved from libc, libm
- Global Variables:
int initialized = 42; int uninitialized; const char *message = "hello"; int main() { return initialized + uninitialized; }Expected: Shows .data, .bss, .rodata usage correctly
- Multi-file:
// main.c extern int helper(int); int main() { return helper(5); } // helper.c int helper(int x) { return x * 2; }Expected: Shows symbol resolution between files
- Static vs Dynamic: Same program linked both ways, compare sizes and dependencies
6.3 Test Data
# Create test suite
mkdir -p tests
# Test 1: Minimal
echo 'int main() { return 0; }' > tests/minimal.c
# Test 2: Printf
echo '#include <stdio.h>
int main() { printf("hello\n"); return 0; }' > tests/printf.c
# Test 3: Global data
echo 'int x = 1; int y; const char *s = "hi";
int main() { return x + y; }' > tests/globals.c
# Expected output patterns to grep for
# minimal: no .data, minimal .text
# printf: printf in undefined symbols
# globals: all three sections populated
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Not handling spaces in paths | Tool fails on “/home/user/My Programs/test.c” | Quote all paths in shell commands |
| Parsing readelf wrong | Missing symbols or wrong types | Test parser on diverse binaries first |
| GDB version differences | Commands fail on different systems | Test GDB commands standalone first |
| Forgetting -g for debug info | No source line mapping | Always compile with -g for debug builds |
| Ignoring stderr | Silent failures | Capture and check stderr from all tools |
| Hardcoded paths | Works on your machine only | Use which() or PATH lookup |
7.2 Debugging Strategies
- Print intermediate states: After each parsing step, dump the data structure
- Test tools independently: Before integrating, verify
readelf,objdump,gdbproduce expected output - Use simple inputs first: Get minimal.c working perfectly before complex cases
- Compare with manual: Run tools manually and compare to your parsed output
- Binary diff for determinism: Same input should produce byte-identical artifacts (ignoring timestamps)
7.3 Performance Traps
- Spawning too many processes: Batch related queries (one readelf call with multiple flags)
- Reading huge outputs: For large binaries, process output line-by-line, don’t load all into memory
- GDB startup time: GDB is slow to start; for multiple queries, use one session with multiple commands
8. Extensions & Challenges
8.1 Beginner Extensions
- Add JSON output:
--format jsonfor machine-readable output - Colorized output: Highlight different symbol types, sections
- Verbose mode: Show actual readelf/objdump commands being run
8.2 Intermediate Extensions
- Direct ELF parsing: Read ELF format directly without external tools
- Disassembly integration: Show assembly for specific functions
- Diff mode: Compare two binaries’ symbols and sections
- Cross-compilation support: Analyze ARM/RISC-V binaries
8.3 Advanced Extensions
- Debug info parsing: Read DWARF to show full source mapping
- Dynamic analysis: Use ptrace to trace actual symbol resolutions at runtime
- Optimization analysis: Show what changes between -O0, -O1, -O2, -O3
- Security audit: Check for common vulnerabilities (no stack canary, RELRO, etc.)
9. Real-World Connections
9.1 Industry Applications
- Build systems (Bazel, CMake): Understand these pipelines to debug build failures
- Package managers (apt, rpm): Know how shared libraries are managed
- Containerization (Docker): Static linking simplifies container images
- Embedded systems: Every byte counts; understanding sections is crucial
- Reverse engineering: This is step 1 of analyzing any binary
9.2 Related Open Source Projects
- binutils: The tools you’re wrapping (readelf, objdump, nm)
- LLVM: Alternative compiler with similar pipeline concepts
- pwntools: Python library for similar analysis (CTF-oriented)
- Ghidra/radare2: Advanced binary analysis tools
- libelf: Library for direct ELF parsing
9.3 Interview Relevance
This project prepares you for questions like:
- “Walk me through what happens when you compile and run a C program”
- “What’s the difference between static and dynamic linking?”
- “How would you debug a segfault at address X?”
- “Explain the process memory layout”
- “What are symbols and relocations in object files?”
10. Resources
10.1 Essential Reading
- CS:APP Chapter 1: “A Tour of Computer Systems” - Overview of the pipeline
- CS:APP Chapter 7: “Linking” - Deep dive into object files and linking
- “Linkers and Loaders” by John Levine: The definitive book on linking
- System V ABI: Official specification for ELF format and calling conventions
10.2 Video Resources
- MIT 6.004 lectures on compilation and linking
- “How do compilers work?” on Computerphile YouTube channel
- LiveOverflow series on binary exploitation (covers ELF in depth)
10.3 Tools & Documentation
- readelf(1):
man readelf- ELF file display - objdump(1):
man objdump- Object file disassembly - nm(1):
man nm- Symbol listing - ldd(1):
man ldd- Shared library dependencies - ld(1):
man ld- Linker documentation
10.4 Related Projects in This Series
- Previous: None (this is the foundation project)
- Next: P2 (Bitwise Data Inspector) builds on your understanding of how data is represented; P10 (ELF Link Map) goes deeper into linking
11. Self-Assessment Checklist
Before considering this project complete, verify:
Understanding
- I can explain each of the 4 pipeline stages without looking at notes
- I can list the main ELF sections and what goes in each
- I can explain what relocations are and why they’re needed
- I understand the difference between static and dynamic linking
- I can draw the process memory layout and explain each region
Implementation
- Tool correctly identifies all pipeline artifacts
- Symbol parsing correctly categorizes defined/undefined, local/global
- Section sizes match what readelf reports
- Runtime inspection shows correct memory regions
- Error handling works for malformed input
Growth
- I debugged at least one issue by examining the actual binary
- I can now read assembly output and connect it to C source
- I’m comfortable using readelf, objdump, nm, ldd, gdb
12. Submission / Completion Criteria
Minimum Viable Completion:
- Pipeline stage artifacts are captured and reported
- Object file sections and symbols are analyzed
- Basic output formatting works
- Runs without crashing on valid input
Full Completion:
- All analysis modes work (pipeline, object, link, runtime)
- Static vs dynamic comparison works
- GDB integration shows memory layout
- Clean error handling
- Tested on multiple input files
Excellence (Going Above & Beyond):
- Direct ELF parsing (no readelf dependency)
- Cross-platform support (macOS)
- JSON output format
- Crash address analysis mode
- DWARF debug info parsing
13. Real World Outcome
When you complete this project, here’s exactly what you’ll see when running your tool:
$ ./toolchain_explorer hello.c
=== PREPROCESSING STAGE ===
Input: hello.c (42 lines)
Output: hello.i (1,847 lines)
Time: 0.023s
Macros expanded:
MAX → 100
DEBUG → 1
VERSION → "1.0.0"
Headers included:
stdio.h → /usr/include/stdio.h (925 lines)
stdlib.h → /usr/include/stdlib.h (743 lines)
limits.h → /usr/include/limits.h (179 lines)
=== COMPILATION STAGE ===
Input: hello.i (1,847 lines)
Output: hello.s (89 lines)
Time: 0.156s
Functions generated:
main @ .text (32 instructions)
helper @ .text (18 instructions)
String literals found:
.LC0: "Hello, %s!\n"
.LC1: "Max value is %d"
=== ASSEMBLY STAGE ===
Input: hello.s (89 lines)
Output: hello.o (2,456 bytes)
Time: 0.012s
ELF Sections:
.text : 168 bytes (executable code)
.rodata : 42 bytes (read-only strings)
.data : 8 bytes (initialized globals)
.bss : 16 bytes (uninitialized globals)
.symtab : 312 bytes (symbol table)
.strtab : 89 bytes (string table)
.rela.text: 72 bytes (relocations)
Symbol Table:
GLOBAL DEFINED:
main T 0x0000000000000000 (32 bytes)
global_count D 0x0000000000000000 (8 bytes)
LOCAL DEFINED:
helper t 0x0000000000000020 (18 bytes)
.LC0 r 0x0000000000000000 (string)
UNDEFINED (need linking):
printf U (from libc)
exit U (from libc)
malloc U (from libc)
Relocations (4 entries):
Offset Type Symbol
0x00000015 R_X86_64_PLT32 printf
0x00000024 R_X86_64_PC32 .LC0
0x0000002f R_X86_64_PLT32 exit
0x0000003a R_X86_64_PLT32 malloc
=== LINKING STAGE ===
Input: hello.o + libc.so.6
Output: hello (16,432 bytes)
Time: 0.089s
Static vs Dynamic Comparison:
Dynamic: 16,432 bytes (uses shared libc)
Static: 823,456 bytes (libc copied in)
Ratio: 50x larger when static!
Shared Library Dependencies:
libc.so.6 → /lib/x86_64-linux-gnu/libc.so.6
ld-linux-x86-64.so.2 → /lib64/ld-linux-x86-64.so.2
Symbol Resolution:
printf → libc.so.6::printf (PLT stub at 0x401030)
exit → libc.so.6::exit (PLT stub at 0x401040)
malloc → libc.so.6::malloc (PLT stub at 0x401050)
=== RUNTIME LAYOUT (paused at main) ===
Memory Map:
0x00400000 - 0x00401000 r-xp .text (code)
0x00401000 - 0x00402000 r--p .rodata
0x00403000 - 0x00404000 rw-p .data + .bss
0x00404000 - 0x00425000 rw-p [heap]
0x7fff8000 - 0x7fffffff rw-p [stack]
Stack at main():
Address Value Meaning
0x7fffffffdde8 0x00007ffff7a2d840 return addr → __libc_start_main+234
0x7fffffffdde0 0x00007fffffffde00 saved %rbp
0x7fffffffddd8 0x0000000000000001 argc = 1
0x7fffffffddd0 0x00007fffffffdeb8 argv pointer
=== PIPELINE COMPLETE ===
Total time: 0.280s
All artifacts saved to: ./artifacts/hello/
14. The Core Question You’re Answering
“When I type
gcc hello.c -o helloand press Enter, what ACTUALLY happens inside my computer before I can run./hello?”
This project transforms “the compiler does stuff” into a complete mental model of four distinct transformations, each with its own tools, formats, and purpose. You’ll understand why linking errors are different from compile errors, why symbols matter, and how your code becomes bytes that the CPU executes.
15. Concepts You Must Understand First
Before starting this project, ensure you understand these concepts:
| Concept | Why It Matters | Where to Learn |
|---|---|---|
| C preprocessor directives (#include, #define) | You’ll trace how these expand | CS:APP 1.2, any C book Ch. 1 |
| Basic assembly syntax (mov, call, ret) | You’ll read compiler output | CS:APP 3.1-3.4 |
| What a function call does at the machine level | You’ll see call/ret in action | CS:APP 3.7 |
| Hexadecimal notation | Object files are full of hex | CS:APP 2.1 |
| What “address” means for code and data | Linking assigns addresses | CS:APP 1.4, 7.1 |
| Difference between source, object, and executable files | Core of this project | CS:APP 1.2, Chapter 7 intro |
16. Questions to Guide Your Design
Work through these questions BEFORE writing code:
-
Input/Output: What does your tool take as input? What outputs should it produce? Where should artifacts be saved?
-
Stage Separation: How will you invoke each stage of gcc separately? What flags control this?
-
Output Parsing: Tools like
readelfandobjdumpproduce text output. How will you parse this reliably? What about different versions? -
Error Handling: What if the C file doesn’t compile? What if readelf isn’t installed? How do you report errors helpfully?
-
GDB Automation: How can you script GDB to pause at main and query process state? What’s the output format?
-
Static Linking: On modern systems, static linking may not work or may require special packages. How will you handle this?
-
Report Format: How will you structure the output? Text? JSON? Should sections be optional?
17. Thinking Exercise
Before writing any code, trace through this by hand:
Take this simple C program:
#include <stdio.h>
#define MSG "Hello"
int count = 0;
int main() {
printf("%s: %d\n", MSG, count);
return 0;
}
Exercise: On paper, answer:
-
Preprocessing: What does the file look like after
#includeand#defineare processed? How many lines? What happened to MSG? -
Compilation: How many functions will be in the assembly? What strings will be in
.rodata? Where iscountstored? -
Assembly: What sections will the object file have? What symbols are defined? What symbols are undefined?
-
Linking: Where does the definition of
printfcome from? What happens if we link statically vs. dynamically? -
Runtime: When the process starts, where in memory is
count? Where is the string “Hello”? Where isprintf’s code?
Verify your answers by running the actual commands (gcc -E, gcc -S, objdump, etc.) before implementing your tool.
18. The Interview Questions They’ll Ask
After completing this project, you’ll be ready for these common interview questions:
- “Walk me through what happens when you compile a C program.”
- Expected: Explain all 4 stages with specific tools (cpp, cc1, as, ld)
- Bonus: Mention intermediate file formats and what changes at each stage
- “What’s the difference between a compile error and a link error?”
- Expected: Compile errors are syntax/type issues; link errors are missing symbols
- Bonus: Give example of each (“undeclared identifier” vs “undefined reference”)
- “Explain static vs. dynamic linking. When would you use each?”
- Expected: Trade-offs of binary size, dependencies, update flexibility
- Bonus: Mention PLT/GOT, security implications (ASLR), and LGPL considerations
- “What’s in an object file?”
- Expected: Machine code, data, symbol table, relocation entries
- Bonus: Explain why relocations exist (addresses aren’t known until linking)
- “How would you debug a ‘segfault at address 0x401234’?”
- Expected: Use objdump/readelf to find what’s at that address, check if it’s in .text
- Bonus: Explain how to map back to source line using debug info
- “What does ‘position-independent code’ mean?”
- Expected: Code that works regardless of where it’s loaded in memory
- Bonus: Explain how this relates to shared libraries and security (ASLR)
19. Hints in Layers
If you’re stuck, reveal hints one at a time:
Hint 1: Getting Started
Start with just capturing the pipeline artifacts. Before parsing anything, make sure you can run:
gcc -E source.c -o source.i
gcc -S source.c -o source.s
gcc -c source.c -o source.o
gcc source.c -o executable
And verify each file exists. Your tool is basically automating this and adding analysis.
Hint 2: Parsing readelf Output
Don’t try to parse the raw binary ELF format yourself (that’s a project unto itself). Use readelf with flags:
readelf -h file.o- ELF headerreadelf -S file.o- Section headersreadelf -s file.o- Symbol tablereadelf -r file.o- Relocations
Parse the text output line by line. Look for consistent column positions or use regex.
Hint 3: GDB Scripting
Create a file gdb_script.txt:
set pagination off
break main
run
info proc mappings
info registers
info frame
quit
Run with: gdb -batch -x gdb_script.txt ./executable
The output is text you can capture and parse.
Hint 4: Static Linking Issues
On Ubuntu/Debian: sudo apt-get install libc6-dev
On modern systems, fully static linking may fail. Make this an optional feature. Your tool should gracefully report “static linking unavailable” rather than crashing.
To attempt static: gcc -static source.c -o executable_static
20. Books That Will Help
| Topic | Book | Chapter/Section |
|---|---|---|
| Overview of compilation | CS:APP 3rd Ed | Chapter 1.2 “Programs Are Translated by Other Programs” |
| Detailed compilation stages | CS:APP 3rd Ed | Chapter 1.3 “It Pays to Understand How Compilation Systems Work” |
| Object file formats | CS:APP 3rd Ed | Chapter 7.3-7.5 “Object Files”, “Relocatable Object Files”, “Symbols and Symbol Tables” |
| Symbol resolution | CS:APP 3rd Ed | Chapter 7.6 “Symbol Resolution” |
| Relocation | CS:APP 3rd Ed | Chapter 7.7 “Relocation” |
| Static vs Dynamic linking | CS:APP 3rd Ed | Chapter 7.10-7.11 “Dynamic Linking”, “Loading Shared Libraries” |
| Process memory layout | CS:APP 3rd Ed | Chapter 7.9 “Loading Executable Object Files” |
| Linker internals (deep dive) | Linkers & Loaders by John Levine | Chapters 1-4 |
| ELF format details | System V ABI | Chapter 4-5 (ELF specification) |
| GNU toolchain | GCC Manual | Section on “Overall Options” for -E, -S, -c |
This guide was expanded from CSAPP_3E_DEEP_LEARNING_PROJECTS.md. For the complete learning path, see the project index.