Project 1: "Hello, Toolchain" — Build Pipeline Explorer

Project 1: “Hello, Toolchain” — Build Pipeline Explorer

Build a CLI tool that reveals every transformation your C code undergoes from source to running process.

Quick Reference

Attribute	Value
Difficulty	Intermediate
Time Estimate	1-2 weeks
Language	C (Alternatives: Rust, Zig, C++)
Prerequisites	Basic C, comfort with build tools, basic debugging
Key Topics	Compilation pipeline, object files, symbols, linking, process layout
CS:APP Chapters	1, 7

1. Learning Objectives

By completing this project, you will:

Trace the complete compilation pipeline: Explain what happens at each stage from .c to running process
Read object file metadata: Parse and interpret ELF section headers, symbol tables, and relocation entries
Understand static vs dynamic linking: Predict how symbol resolution differs and demonstrate the tradeoffs
Map runtime memory layout: Connect debugger output to the theoretical process address space model
Debug using binary artifacts: Given a crash address, trace back through the pipeline to find the source
Use professional tooling fluently: Master gcc, objdump, readelf, nm, ldd, and debuggers

2. Theoretical Foundation

2.1 The Compilation Pipeline

When you type gcc hello.c -o hello, you’re invoking not one tool but four distinct programs, each performing a specific transformation:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   hello.c   │────▶│   hello.i   │────▶│   hello.s   │────▶│   hello.o   │────▶│   hello     │
│   (source)  │ cpp │(preprocessed)│ cc1 │ (assembly)  │ as  │  (object)   │ ld  │(executable) │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
     Text              Text               Text              Binary             Binary

Stage 1: Preprocessing (cpp)

The C preprocessor handles all directives starting with #:

// Before preprocessing
#include <stdio.h>
#define MAX 100

int main() {
    printf("Max is %d\n", MAX);
}

// After preprocessing (hello.i) - simplified
// ... thousands of lines from stdio.h ...
typedef struct _IO_FILE FILE;
extern int printf(const char *__format, ...);
// ... more declarations ...

int main() {
    printf("Max is %d\n", 100);  // MAX replaced with 100
}

Key insight: The preprocessor is a text-to-text transformation. It knows nothing about C syntax—it just does text substitution and file inclusion.

Run: gcc -E hello.c -o hello.i

Stage 2: Compilation (cc1)

The compiler proper transforms C into assembly. This is where:

Syntax is parsed into an Abstract Syntax Tree
Type checking occurs
Optimizations are applied
Code is generated for the target architecture

# hello.s (simplified x86-64)
    .section .rodata
.LC0:
    .string "Max is %d\n"

    .text
    .globl main
main:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    $100, %esi        # Second argument (MAX value)
    leaq    .LC0(%rip), %rdi  # First argument (format string)
    movl    $0, %eax          # varargs convention
    call    printf@PLT
    movl    $0, %eax          # return 0
    popq    %rbp
    ret

Run: gcc -S hello.c -o hello.s

Stage 3: Assembly (as)

The assembler converts human-readable assembly to machine code, producing an object file (.o). This is the first binary stage.

Object files contain:

Machine code (the .text section)
Data (.data, .rodata, .bss sections)
Symbol table (what symbols are defined/referenced)
Relocation entries (addresses to fix up during linking)

Run: gcc -c hello.c -o hello.o

Stage 4: Linking (ld)

The linker combines object files and libraries into an executable:

Symbol resolution: Match each symbol reference to exactly one definition
Relocation: Fix up addresses now that final layout is known
Layout: Organize sections into a loadable format

┌─────────────────────────────────────────────────────────────┐
│                    Linker's Job                             │
├─────────────────────────────────────────────────────────────┤
│  Input: hello.o + libc.so (or libc.a)                       │
│                                                             │
│  1. Find definition for 'printf' → in libc                  │
│  2. Calculate final addresses for all symbols               │
│  3. Patch call printf@PLT with actual jump target          │
│  4. Create program headers for OS loader                    │
│  5. Output: Executable ELF file                             │
└─────────────────────────────────────────────────────────────┘

2.2 ELF Object File Format

ELF (Executable and Linkable Format) is the standard binary format on Linux/Unix. Understanding it is crucial:

┌─────────────────────────────────────┐
│           ELF Header                │  Magic number, architecture, entry point
├─────────────────────────────────────┤
│       Program Headers               │  How to load into memory (for executables)
├─────────────────────────────────────┤
│         .text                       │  Executable code
├─────────────────────────────────────┤
│         .rodata                     │  Read-only data (string literals, constants)
├─────────────────────────────────────┤
│         .data                       │  Initialized global/static variables
├─────────────────────────────────────┤
│         .bss                        │  Uninitialized global/static (zeroed at load)
├─────────────────────────────────────┤
│         .symtab                     │  Symbol table
├─────────────────────────────────────┤
│         .strtab                     │  String table (names for symbols)
├─────────────────────────────────────┤
│         .rel.text                   │  Relocation entries for .text
├─────────────────────────────────────┤
│       Section Headers               │  Describes all sections
└─────────────────────────────────────┘

Symbol Types

Type	Meaning	Example
`T`	Text (code) symbol, globally visible	`main`
`t`	Text symbol, local to file	`helper_func` with `static`
`D`	Data symbol, globally visible	`int global_var = 5;`
`B`	BSS symbol (uninitialized)	`int uninit_global;`
`U`	Undefined (needs to be resolved)	Reference to `printf`

2.3 Static vs Dynamic Linking

Static Linking (gcc -static):

All library code is copied into the executable
Larger binary, but self-contained
No runtime dependencies
Symbol resolution happens entirely at link time

Dynamic Linking (default):

Library code stays in shared libraries (.so files)
Smaller binaries, shared memory for common libraries
Resolution happens partly at load time
PLT (Procedure Linkage Table) and GOT (Global Offset Table) enable lazy binding

Static:  hello (executable) contains printf code
Dynamic: hello (executable) contains PLT stub → jumps to libc.so at runtime

2.4 Process Memory Layout

When the OS loads your program, it creates an address space:

High addresses (0x7fff...)
┌─────────────────────────────────────┐
│            Kernel Space             │  Not accessible to user code
├─────────────────────────────────────┤
│              Stack                  │  ↓ grows down (local vars, return addrs)
│                 │                   │
│                 ▼                   │
├─────────────────────────────────────┤
│                                     │  (unmapped region)
├─────────────────────────────────────┤
│                 ▲                   │
│                 │                   │
│              Heap                   │  ↑ grows up (malloc'd memory)
├─────────────────────────────────────┤
│         .bss (uninitialized)        │
├─────────────────────────────────────┤
│         .data (initialized)         │
├─────────────────────────────────────┤
│         .rodata (read-only)         │
├─────────────────────────────────────┤
│         .text (code)                │
└─────────────────────────────────────┘
Low addresses (0x0...)

2.5 Why This Matters

Understanding the compilation pipeline enables you to:

Debug effectively: A crash at address 0x401234 means something—you can trace it to the exact instruction and source line
Optimize intelligently: Understanding what the compiler does helps you write code it can optimize well
Write portable code: Knowing about linking helps you manage symbol visibility and avoid conflicts
Security analysis: Buffer overflows, return-oriented programming, and other attacks exploit this layout
Embedded systems: Resource-constrained systems require understanding exactly what ends up in the binary

2.6 Common Misconceptions

Misconception 1: “The compiler converts C directly to machine code” Reality: There are four distinct stages, each with its own tools and artifacts.

Misconception 2: “Object files are executables” Reality: Object files contain machine code but can’t run—they have unresolved symbols and no load information.

Misconception 3: “Static linking is always better for deployment” Reality: Dynamic linking saves memory (shared libraries), enables updates without recompilation, and is required for some features (plugins, LGPL compliance).

Misconception 4: “The stack grows up” Reality: On x86/x86-64, the stack grows DOWN (toward lower addresses). This is crucial for understanding buffer overflows.

3. Project Specification

3.1 What You Will Build

A command-line “pipeline explainer” that takes a C source file and produces a structured report showing:

Every artifact produced at each compilation stage
Symbol and section analysis of the object file
Linking information (resolved symbols, required libraries)
Runtime process layout (via controlled execution with debugger)

3.2 Functional Requirements

Pipeline Capture (--pipeline):
- Save preprocessed output (.i)
- Save assembly output (.s)
- Save object file (.o)
- Save final executable
Object Analysis (--analyze):
- List all sections with sizes and types
- List symbols by category (defined, undefined, local, global)
- Show relocation entries
Link Analysis (--link):
- Identify required shared libraries
- Compare static vs dynamic link outputs
- Show symbol resolution decisions
Runtime Inspection (--runtime):
- Launch under debugger, pause at main
- Report stack and heap boundaries
- Show loaded shared libraries
- Demonstrate memory regions
Crash Analysis (--crash <address>):
- Given a crash address, trace back to source line
- Show the relevant assembly
- Explain what stage produced that code

3.3 Non-Functional Requirements

Determinism: Same input produces same output (modulo runtime addresses)
Portability: Works on Linux x86-64 (primary) and macOS (stretch goal)
Educational: Output explains “why” not just “what”
Clean: Uses proper error handling, no crashes on malformed input

3.4 Example Usage / Output

$ ./pipeline-explorer hello.c

=== PIPELINE ANALYSIS for hello.c ===

[Stage 1: Preprocessing]
  Input:  hello.c (234 bytes)
  Output: hello.i (42,891 bytes)
  Included: stdio.h, stdlib.h, limits.h, ...
  Macros expanded: 3 (MAX, DEBUG, VERSION)

[Stage 2: Compilation]
  Input:  hello.i
  Output: hello.s (1,247 bytes)
  Functions generated: 2 (main, helper)
  Optimization level: -O0

[Stage 3: Assembly]
  Input:  hello.s
  Output: hello.o (2,456 bytes)

  Sections:
    .text    : 168 bytes (executable code)
    .rodata  : 24 bytes (string literals)
    .data    : 0 bytes
    .bss     : 0 bytes

  Symbol Table:
    DEFINED (Global):  main (T), helper (t)
    UNDEFINED:         printf (U), exit (U)

  Relocations: 4 entries
    - printf @ offset 0x24 (PC-relative call)
    - .rodata @ offset 0x18 (data reference)

[Stage 4: Linking]
  Input:  hello.o + libc.so.6
  Output: hello (16,432 bytes)

  Symbol Resolution:
    printf -> libc.so.6 (0x7f... at runtime)
    exit   -> libc.so.6 (0x7f... at runtime)

  Dependencies: libc.so.6, ld-linux-x86-64.so.2

[Runtime Layout] (paused at main)
  Text:    0x401000 - 0x401234
  ROData:  0x402000 - 0x402100
  Data:    0x404000 - 0x404008
  BSS:     0x404008 - 0x404010
  Heap:    0x405000 - (grows up)
  Stack:   0x7fff... - (grows down)

  Stack at main():
    Return address: 0x7f4a3c...  (__libc_start_main+234)
    Saved RBP:      0x0
    Local vars:     argc @ rbp-4, argv @ rbp-16

=== END REPORT ===

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────────────┐
│                         pipeline-explorer                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌───────────┐ │
│  │   Compiler   │──▶│    Object    │──▶│     Link     │──▶│  Runtime  │ │
│  │   Invoker    │   │   Analyzer   │   │   Analyzer   │   │ Inspector │ │
│  └──────────────┘   └──────────────┘   └──────────────┘   └───────────┘ │
│         │                  │                  │                 │        │
│         ▼                  ▼                  ▼                 ▼        │
│  ┌──────────────────────────────────────────────────────────────────────┐│
│  │                        Report Generator                              ││
│  └──────────────────────────────────────────────────────────────────────┘│
│                                    │                                      │
│                                    ▼                                      │
│                             [Formatted Output]                            │
└─────────────────────────────────────────────────────────────────────────┘

4.2 Key Components

Component	Responsibility	Key Decisions
Compiler Invoker	Runs gcc with flags to produce each stage artifact	Uses `-save-temps` or explicit stage flags; captures stderr for warnings
Object Analyzer	Parses ELF object files	Can use `readelf`/`objdump` output OR parse ELF directly (stretch goal)
Link Analyzer	Compares static/dynamic linking, identifies dependencies	Uses `ldd`, `nm`, compares sizes
Runtime Inspector	Launches executable under debugger, queries state	Uses GDB in batch mode with scripted commands
Report Generator	Formats analysis into readable output	Structured text, optionally JSON for machine consumption

4.3 Data Structures

// Core structures for the analyzer

typedef struct {
    char *name;
    size_t size;
    uint64_t address;
    char type;  // 'T', 'D', 'B', 'U', etc.
    int is_global;
} Symbol;

typedef struct {
    char *name;
    size_t size;
    uint64_t address;
    uint32_t flags;  // SHF_* flags
} Section;

typedef struct {
    uint64_t offset;
    char *symbol_name;
    int type;  // R_X86_64_* relocation type
} Relocation;

typedef struct {
    Section *sections;
    size_t section_count;
    Symbol *symbols;
    size_t symbol_count;
    Relocation *relocations;
    size_t relocation_count;
} ObjectAnalysis;

typedef struct {
    uint64_t text_start, text_end;
    uint64_t data_start, data_end;
    uint64_t heap_start;
    uint64_t stack_top;
    char **loaded_libraries;
    size_t library_count;
} RuntimeLayout;

4.4 Algorithm Overview

Main Algorithm Flow:

Parse command line → determine source file and requested analyses
Invoke pipeline stages:
- Run gcc -E → capture preprocessed output
- Run gcc -S → capture assembly
- Run gcc -c → create object file
- Run gcc → create executable (both static and dynamic versions)
Analyze object file:
- Run readelf -h → header info
- Run readelf -S → section table
- Run readelf -s → symbol table
- Run readelf -r → relocations
- Parse outputs into data structures
Analyze linking:
- Compare file sizes
- Run ldd on dynamic version
- Run nm to see resolved symbols
Runtime inspection:
- Launch with gdb -batch -ex "..." executable
- Query memory mappings (info proc mappings)
- Query stack (info frame)
- Parse GDB output
Generate report → format and output all collected data

Complexity Analysis:

Time: O(n) where n is source file size, dominated by gcc invocations
Space: O(m) where m is size of largest artifact (typically the executable)

5. Implementation Guide

5.1 Development Environment Setup

# Required tools
sudo apt-get install gcc gdb binutils build-essential

# Verify installations
gcc --version
gdb --version
readelf --version
objdump --version

# Create project structure
mkdir -p pipeline-explorer/{src,tests,artifacts}
cd pipeline-explorer

5.2 Project Structure

pipeline-explorer/
├── src/
│   ├── main.c              # Entry point, CLI parsing
│   ├── compiler.c          # Invokes gcc stages
│   ├── object_analyzer.c   # Parses ELF via readelf/objdump
│   ├── link_analyzer.c     # Analyzes linking
│   ├── runtime_inspector.c # GDB automation
│   ├── report.c            # Output formatting
│   └── util.c              # String handling, process execution
├── include/
│   ├── pipeline.h          # Shared data structures
│   └── util.h              # Utility declarations
├── tests/
│   ├── simple.c            # Basic test case
│   ├── multifile/          # Multi-file test
│   └── expected/           # Expected outputs
├── Makefile
└── README.md

5.3 Implementation Phases

Phase 1: Foundation (Days 1-3)

Goals:

Set up project structure and build system
Implement basic pipeline invocation
Capture all stage artifacts

Tasks:

Create Makefile that builds your tool
Implement run_command() utility that executes shell commands and captures output
Implement invoke_preprocessor(), invoke_compiler(), invoke_assembler(), invoke_linker()
Verify artifacts are created in a temp directory

Checkpoint: ./pipeline-explorer hello.c produces hello.i, hello.s, hello.o, and hello in an artifacts directory.

Phase 2: Object File Analysis (Days 4-7)

Goals:

Parse readelf output for sections, symbols, relocations
Build internal data structures
Format readable output

Tasks:

Run readelf -S and parse section table output
Run readelf -s and parse symbol table (handling local/global, types)
Run readelf -r and parse relocation entries
Connect symbols to their sections
Format and output the analysis

Checkpoint: Running on hello.o shows all sections with sizes, categorizes symbols correctly, and lists relocations.

Phase 3: Link Analysis (Days 8-10)

Goals:

Compare static vs dynamic linking
Identify all dependencies
Show symbol resolution

Tasks:

Create both static (-static) and dynamic versions
Use ldd to list shared library dependencies
Use nm on final executable to show resolved symbols
Calculate and compare file sizes
Explain the tradeoffs in output

Checkpoint: Output shows library dependencies, size comparison, and where undefined symbols were resolved.

Phase 4: Runtime Inspection (Days 11-13)

Goals:

Automate GDB to pause and inspect running process
Extract memory layout
Show stack state at entry

Tasks:

Create GDB command script that sets breakpoint at main, runs, queries state
Parse info proc mappings output for memory regions
Parse info frame output for stack information
Handle GDB output parsing robustly
Integrate into final report

Checkpoint: Running on an executable shows correct memory layout matching expected regions.

Phase 5: Polish & Integration (Day 14)

Goals:

Error handling for all edge cases
Clean output formatting
Testing on various inputs

Tasks:

Handle compilation errors gracefully
Test with multi-file programs
Test with C++ if time permits
Write documentation
Clean up code, remove debug output

Checkpoint: Tool handles malformed input gracefully, produces clean reports for variety of test cases.

5.4 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
ELF Parsing	Parse directly vs use readelf	Use readelf	Simpler, ELF parsing is a project unto itself
GDB Control	Expect scripts vs Python API	Expect scripts	More portable, simpler to implement
Output Format	Text only vs JSON option	Text with JSON flag	Text for humans, JSON for tooling integration
Temp Files	System temp vs local dir	Local artifacts/ dir	Easier debugging, user can inspect
Static Analysis	Required vs optional	Required with –static flag	Shows important tradeoff but not always available

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Test parsing functions in isolation	Symbol parser handles all types correctly
Integration Tests	Test full pipeline on sample programs	hello.c produces expected output
Regression Tests	Verify fixes don’t break existing	Compare output to known-good baselines
Edge Cases	Handle unusual inputs	Empty file, no symbols, huge files

6.2 Critical Test Cases

Minimal Program:
```
int main() { return 0; }
```
Expected: No undefined symbols except _start, minimal sections

External Dependencies:

#include <stdio.h>
#include <math.h>
int main() { printf("%f\n", sin(1.0)); return 0; }

Expected: Shows printf, sin as undefined → resolved from libc, libm

Global Variables:

int initialized = 42;
int uninitialized;
const char *message = "hello";
int main() { return initialized + uninitialized; }

Expected: Shows .data, .bss, .rodata usage correctly

Multi-file:

// main.c
extern int helper(int);
int main() { return helper(5); }

// helper.c
int helper(int x) { return x * 2; }

Expected: Shows symbol resolution between files

Static vs Dynamic: Same program linked both ways, compare sizes and dependencies

6.3 Test Data

# Create test suite
mkdir -p tests

# Test 1: Minimal
echo 'int main() { return 0; }' > tests/minimal.c

# Test 2: Printf
echo '#include <stdio.h>
int main() { printf("hello\n"); return 0; }' > tests/printf.c

# Test 3: Global data
echo 'int x = 1; int y; const char *s = "hi";
int main() { return x + y; }' > tests/globals.c

# Expected output patterns to grep for
# minimal: no .data, minimal .text
# printf: printf in undefined symbols
# globals: all three sections populated

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Not handling spaces in paths	Tool fails on “/home/user/My Programs/test.c”	Quote all paths in shell commands
Parsing readelf wrong	Missing symbols or wrong types	Test parser on diverse binaries first
GDB version differences	Commands fail on different systems	Test GDB commands standalone first
Forgetting -g for debug info	No source line mapping	Always compile with -g for debug builds
Ignoring stderr	Silent failures	Capture and check stderr from all tools
Hardcoded paths	Works on your machine only	Use which() or PATH lookup

7.2 Debugging Strategies

Print intermediate states: After each parsing step, dump the data structure
Test tools independently: Before integrating, verify readelf, objdump, gdb produce expected output
Use simple inputs first: Get minimal.c working perfectly before complex cases
Compare with manual: Run tools manually and compare to your parsed output
Binary diff for determinism: Same input should produce byte-identical artifacts (ignoring timestamps)

7.3 Performance Traps

Spawning too many processes: Batch related queries (one readelf call with multiple flags)
Reading huge outputs: For large binaries, process output line-by-line, don’t load all into memory
GDB startup time: GDB is slow to start; for multiple queries, use one session with multiple commands

8. Extensions & Challenges

8.1 Beginner Extensions

Add JSON output: --format json for machine-readable output
Colorized output: Highlight different symbol types, sections
Verbose mode: Show actual readelf/objdump commands being run

8.2 Intermediate Extensions

Direct ELF parsing: Read ELF format directly without external tools
Disassembly integration: Show assembly for specific functions
Diff mode: Compare two binaries’ symbols and sections
Cross-compilation support: Analyze ARM/RISC-V binaries

8.3 Advanced Extensions

Debug info parsing: Read DWARF to show full source mapping
Dynamic analysis: Use ptrace to trace actual symbol resolutions at runtime
Optimization analysis: Show what changes between -O0, -O1, -O2, -O3
Security audit: Check for common vulnerabilities (no stack canary, RELRO, etc.)

9. Real-World Connections

9.1 Industry Applications

Build systems (Bazel, CMake): Understand these pipelines to debug build failures
Package managers (apt, rpm): Know how shared libraries are managed
Containerization (Docker): Static linking simplifies container images
Embedded systems: Every byte counts; understanding sections is crucial
Reverse engineering: This is step 1 of analyzing any binary

binutils: The tools you’re wrapping (readelf, objdump, nm)
LLVM: Alternative compiler with similar pipeline concepts
pwntools: Python library for similar analysis (CTF-oriented)
Ghidra/radare2: Advanced binary analysis tools
libelf: Library for direct ELF parsing

9.3 Interview Relevance

This project prepares you for questions like:

“Walk me through what happens when you compile and run a C program”
“What’s the difference between static and dynamic linking?”
“How would you debug a segfault at address X?”
“Explain the process memory layout”
“What are symbols and relocations in object files?”

10. Resources

10.1 Essential Reading

CS:APP Chapter 1: “A Tour of Computer Systems” - Overview of the pipeline
CS:APP Chapter 7: “Linking” - Deep dive into object files and linking
“Linkers and Loaders” by John Levine: The definitive book on linking
System V ABI: Official specification for ELF format and calling conventions

10.2 Video Resources

MIT 6.004 lectures on compilation and linking
“How do compilers work?” on Computerphile YouTube channel
LiveOverflow series on binary exploitation (covers ELF in depth)

10.3 Tools & Documentation

readelf(1): man readelf - ELF file display
objdump(1): man objdump - Object file disassembly
nm(1): man nm - Symbol listing
ldd(1): man ldd - Shared library dependencies
ld(1): man ld - Linker documentation

Previous: None (this is the foundation project)
Next: P2 (Bitwise Data Inspector) builds on your understanding of how data is represented; P10 (ELF Link Map) goes deeper into linking

11. Self-Assessment Checklist

Before considering this project complete, verify:

Understanding

I can explain each of the 4 pipeline stages without looking at notes
I can list the main ELF sections and what goes in each
I can explain what relocations are and why they’re needed
I understand the difference between static and dynamic linking
I can draw the process memory layout and explain each region

Implementation

Tool correctly identifies all pipeline artifacts
Symbol parsing correctly categorizes defined/undefined, local/global
Section sizes match what readelf reports
Runtime inspection shows correct memory regions
Error handling works for malformed input

Growth

I debugged at least one issue by examining the actual binary
I can now read assembly output and connect it to C source
I’m comfortable using readelf, objdump, nm, ldd, gdb

12. Submission / Completion Criteria

Minimum Viable Completion:

Pipeline stage artifacts are captured and reported
Object file sections and symbols are analyzed
Basic output formatting works
Runs without crashing on valid input

Full Completion:

All analysis modes work (pipeline, object, link, runtime)
Static vs dynamic comparison works
GDB integration shows memory layout
Clean error handling
Tested on multiple input files

Excellence (Going Above & Beyond):

Direct ELF parsing (no readelf dependency)
Cross-platform support (macOS)
JSON output format
Crash address analysis mode
DWARF debug info parsing

13. Real World Outcome

When you complete this project, here’s exactly what you’ll see when running your tool:

$ ./toolchain_explorer hello.c

=== PREPROCESSING STAGE ===
Input:  hello.c (42 lines)
Output: hello.i (1,847 lines)
Time:   0.023s

Macros expanded:
  MAX       → 100
  DEBUG     → 1
  VERSION   → "1.0.0"

Headers included:
  stdio.h     → /usr/include/stdio.h (925 lines)
  stdlib.h    → /usr/include/stdlib.h (743 lines)
  limits.h    → /usr/include/limits.h (179 lines)

=== COMPILATION STAGE ===
Input:  hello.i (1,847 lines)
Output: hello.s (89 lines)
Time:   0.156s

Functions generated:
  main         @ .text (32 instructions)
  helper       @ .text (18 instructions)

String literals found:
  .LC0: "Hello, %s!\n"
  .LC1: "Max value is %d"

=== ASSEMBLY STAGE ===
Input:  hello.s (89 lines)
Output: hello.o (2,456 bytes)
Time:   0.012s

ELF Sections:
  .text     : 168 bytes (executable code)
  .rodata   : 42 bytes  (read-only strings)
  .data     : 8 bytes   (initialized globals)
  .bss      : 16 bytes  (uninitialized globals)
  .symtab   : 312 bytes (symbol table)
  .strtab   : 89 bytes  (string table)
  .rela.text: 72 bytes  (relocations)

Symbol Table:
  GLOBAL DEFINED:
    main          T  0x0000000000000000  (32 bytes)
    global_count  D  0x0000000000000000  (8 bytes)

  LOCAL DEFINED:
    helper        t  0x0000000000000020  (18 bytes)
    .LC0          r  0x0000000000000000  (string)

  UNDEFINED (need linking):
    printf        U  (from libc)
    exit          U  (from libc)
    malloc        U  (from libc)

Relocations (4 entries):
  Offset     Type              Symbol
  0x00000015 R_X86_64_PLT32    printf
  0x00000024 R_X86_64_PC32     .LC0
  0x0000002f R_X86_64_PLT32    exit
  0x0000003a R_X86_64_PLT32    malloc

=== LINKING STAGE ===
Input:  hello.o + libc.so.6
Output: hello (16,432 bytes)
Time:   0.089s

Static vs Dynamic Comparison:
  Dynamic: 16,432 bytes  (uses shared libc)
  Static:  823,456 bytes (libc copied in)
  Ratio:   50x larger when static!

Shared Library Dependencies:
  libc.so.6           → /lib/x86_64-linux-gnu/libc.so.6
  ld-linux-x86-64.so.2 → /lib64/ld-linux-x86-64.so.2

Symbol Resolution:
  printf  → libc.so.6::printf   (PLT stub at 0x401030)
  exit    → libc.so.6::exit     (PLT stub at 0x401040)
  malloc  → libc.so.6::malloc   (PLT stub at 0x401050)

=== RUNTIME LAYOUT (paused at main) ===
Memory Map:
  0x00400000 - 0x00401000  r-xp  .text (code)
  0x00401000 - 0x00402000  r--p  .rodata
  0x00403000 - 0x00404000  rw-p  .data + .bss
  0x00404000 - 0x00425000  rw-p  [heap]
  0x7fff8000 - 0x7fffffff  rw-p  [stack]

Stack at main():
  Address           Value              Meaning
  0x7fffffffdde8    0x00007ffff7a2d840 return addr → __libc_start_main+234
  0x7fffffffdde0    0x00007fffffffde00 saved %rbp
  0x7fffffffddd8    0x0000000000000001 argc = 1
  0x7fffffffddd0    0x00007fffffffdeb8 argv pointer

=== PIPELINE COMPLETE ===
Total time: 0.280s
All artifacts saved to: ./artifacts/hello/

14. The Core Question You’re Answering

“When I type gcc hello.c -o hello and press Enter, what ACTUALLY happens inside my computer before I can run ./hello?”

This project transforms “the compiler does stuff” into a complete mental model of four distinct transformations, each with its own tools, formats, and purpose. You’ll understand why linking errors are different from compile errors, why symbols matter, and how your code becomes bytes that the CPU executes.

15. Concepts You Must Understand First

Before starting this project, ensure you understand these concepts:

Concept	Why It Matters	Where to Learn
C preprocessor directives (#include, #define)	You’ll trace how these expand	CS:APP 1.2, any C book Ch. 1
Basic assembly syntax (mov, call, ret)	You’ll read compiler output	CS:APP 3.1-3.4
What a function call does at the machine level	You’ll see call/ret in action	CS:APP 3.7
Hexadecimal notation	Object files are full of hex	CS:APP 2.1
What “address” means for code and data	Linking assigns addresses	CS:APP 1.4, 7.1
Difference between source, object, and executable files	Core of this project	CS:APP 1.2, Chapter 7 intro

16. Questions to Guide Your Design

Work through these questions BEFORE writing code:

Input/Output: What does your tool take as input? What outputs should it produce? Where should artifacts be saved?
Stage Separation: How will you invoke each stage of gcc separately? What flags control this?
Output Parsing: Tools like readelf and objdump produce text output. How will you parse this reliably? What about different versions?
Error Handling: What if the C file doesn’t compile? What if readelf isn’t installed? How do you report errors helpfully?
GDB Automation: How can you script GDB to pause at main and query process state? What’s the output format?
Static Linking: On modern systems, static linking may not work or may require special packages. How will you handle this?
Report Format: How will you structure the output? Text? JSON? Should sections be optional?

17. Thinking Exercise

Before writing any code, trace through this by hand:

Take this simple C program:

#include <stdio.h>
#define MSG "Hello"
int count = 0;
int main() {
    printf("%s: %d\n", MSG, count);
    return 0;
}

Exercise: On paper, answer:

Preprocessing: What does the file look like after #include and #define are processed? How many lines? What happened to MSG?
Compilation: How many functions will be in the assembly? What strings will be in .rodata? Where is count stored?
Assembly: What sections will the object file have? What symbols are defined? What symbols are undefined?
Linking: Where does the definition of printf come from? What happens if we link statically vs. dynamically?
Runtime: When the process starts, where in memory is count? Where is the string “Hello”? Where is printf’s code?

Verify your answers by running the actual commands (gcc -E, gcc -S, objdump, etc.) before implementing your tool.

18. The Interview Questions They’ll Ask

After completing this project, you’ll be ready for these common interview questions:

“Walk me through what happens when you compile a C program.”
- Expected: Explain all 4 stages with specific tools (cpp, cc1, as, ld)
- Bonus: Mention intermediate file formats and what changes at each stage
“What’s the difference between a compile error and a link error?”
- Expected: Compile errors are syntax/type issues; link errors are missing symbols
- Bonus: Give example of each (“undeclared identifier” vs “undefined reference”)
“Explain static vs. dynamic linking. When would you use each?”
- Expected: Trade-offs of binary size, dependencies, update flexibility
- Bonus: Mention PLT/GOT, security implications (ASLR), and LGPL considerations
“What’s in an object file?”
- Expected: Machine code, data, symbol table, relocation entries
- Bonus: Explain why relocations exist (addresses aren’t known until linking)
“How would you debug a ‘segfault at address 0x401234’?”
- Expected: Use objdump/readelf to find what’s at that address, check if it’s in .text
- Bonus: Explain how to map back to source line using debug info
“What does ‘position-independent code’ mean?”
- Expected: Code that works regardless of where it’s loaded in memory
- Bonus: Explain how this relates to shared libraries and security (ASLR)

19. Hints in Layers

If you’re stuck, reveal hints one at a time:

Hint 1: Getting Started

Start with just capturing the pipeline artifacts. Before parsing anything, make sure you can run:

gcc -E source.c -o source.i
gcc -S source.c -o source.s
gcc -c source.c -o source.o
gcc source.c -o executable

And verify each file exists. Your tool is basically automating this and adding analysis.

Hint 2: Parsing readelf Output

Don’t try to parse the raw binary ELF format yourself (that’s a project unto itself). Use readelf with flags:

readelf -h file.o - ELF header
readelf -S file.o - Section headers
readelf -s file.o - Symbol table
readelf -r file.o - Relocations

Parse the text output line by line. Look for consistent column positions or use regex.

Hint 3: GDB Scripting

Create a file gdb_script.txt:

set pagination off
break main
run
info proc mappings
info registers
info frame
quit

Run with: gdb -batch -x gdb_script.txt ./executable

The output is text you can capture and parse.

Hint 4: Static Linking Issues

On Ubuntu/Debian: sudo apt-get install libc6-dev

On modern systems, fully static linking may fail. Make this an optional feature. Your tool should gracefully report “static linking unavailable” rather than crashing.

To attempt static: gcc -static source.c -o executable_static

20. Books That Will Help

Topic	Book	Chapter/Section
Overview of compilation	CS:APP 3rd Ed	Chapter 1.2 “Programs Are Translated by Other Programs”
Detailed compilation stages	CS:APP 3rd Ed	Chapter 1.3 “It Pays to Understand How Compilation Systems Work”
Object file formats	CS:APP 3rd Ed	Chapter 7.3-7.5 “Object Files”, “Relocatable Object Files”, “Symbols and Symbol Tables”
Symbol resolution	CS:APP 3rd Ed	Chapter 7.6 “Symbol Resolution”
Relocation	CS:APP 3rd Ed	Chapter 7.7 “Relocation”
Static vs Dynamic linking	CS:APP 3rd Ed	Chapter 7.10-7.11 “Dynamic Linking”, “Loading Shared Libraries”
Process memory layout	CS:APP 3rd Ed	Chapter 7.9 “Loading Executable Object Files”
Linker internals (deep dive)	Linkers & Loaders by John Levine	Chapters 1-4
ELF format details	System V ABI	Chapter 4-5 (ELF specification)
GNU toolchain	GCC Manual	Section on “Overall Options” for -E, -S, -c

This guide was expanded from CSAPP_3E_DEEP_LEARNING_PROJECTS.md. For the complete learning path, see the project index.

Project 1: “Hello, Toolchain” — Build Pipeline Explorer

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 The Compilation Pipeline

Stage 1: Preprocessing (cpp)

Stage 2: Compilation (cc1)

Stage 3: Assembly (as)

Stage 4: Linking (ld)

2.2 ELF Object File Format

Symbol Types

2.3 Static vs Dynamic Linking

2.4 Process Memory Layout

2.5 Why This Matters

2.6 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 Implementation Phases

Phase 1: Foundation (Days 1-3)

Phase 2: Object File Analysis (Days 4-7)

Phase 3: Link Analysis (Days 8-10)

Phase 4: Runtime Inspection (Days 11-13)

Phase 5: Polish & Integration (Day 14)

5.4 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

Understanding

Implementation

Growth

12. Submission / Completion Criteria

13. Real World Outcome

14. The Core Question You’re Answering

15. Concepts You Must Understand First

16. Questions to Guide Your Design

17. Thinking Exercise

18. The Interview Questions They’ll Ask

19. Hints in Layers

20. Books That Will Help