Project 1: "Hello, Toolchain" — Build Pipeline Explorer

Project 1: “Hello, Toolchain” — Build Pipeline Explorer

Build a CLI tool that reveals every transformation your C code undergoes from source to running process.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 1-2 weeks
Language C (Alternatives: Rust, Zig, C++)
Prerequisites Basic C, comfort with build tools, basic debugging
Key Topics Compilation pipeline, object files, symbols, linking, process layout
CS:APP Chapters 1, 7

1. Learning Objectives

By completing this project, you will:

  1. Trace the complete compilation pipeline: Explain what happens at each stage from .c to running process
  2. Read object file metadata: Parse and interpret ELF section headers, symbol tables, and relocation entries
  3. Understand static vs dynamic linking: Predict how symbol resolution differs and demonstrate the tradeoffs
  4. Map runtime memory layout: Connect debugger output to the theoretical process address space model
  5. Debug using binary artifacts: Given a crash address, trace back through the pipeline to find the source
  6. Use professional tooling fluently: Master gcc, objdump, readelf, nm, ldd, and debuggers

2. Theoretical Foundation

2.1 The Compilation Pipeline

When you type gcc hello.c -o hello, you’re invoking not one tool but four distinct programs, each performing a specific transformation:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   hello.c   │────▶│   hello.i   │────▶│   hello.s   │────▶│   hello.o   │────▶│   hello     │
│   (source)  │ cpp │(preprocessed)│ cc1 │ (assembly)  │ as  │  (object)   │ ld  │(executable) │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
     Text              Text               Text              Binary             Binary

Stage 1: Preprocessing (cpp)

The C preprocessor handles all directives starting with #:

// Before preprocessing
#include <stdio.h>
#define MAX 100

int main() {
    printf("Max is %d\n", MAX);
}
// After preprocessing (hello.i) - simplified
// ... thousands of lines from stdio.h ...
typedef struct _IO_FILE FILE;
extern int printf(const char *__format, ...);
// ... more declarations ...

int main() {
    printf("Max is %d\n", 100);  // MAX replaced with 100
}

Key insight: The preprocessor is a text-to-text transformation. It knows nothing about C syntax—it just does text substitution and file inclusion.

Run: gcc -E hello.c -o hello.i

Stage 2: Compilation (cc1)

The compiler proper transforms C into assembly. This is where:

  • Syntax is parsed into an Abstract Syntax Tree
  • Type checking occurs
  • Optimizations are applied
  • Code is generated for the target architecture
# hello.s (simplified x86-64)
    .section .rodata
.LC0:
    .string "Max is %d\n"

    .text
    .globl main
main:
    pushq   %rbp
    movq    %rsp, %rbp
    movl    $100, %esi        # Second argument (MAX value)
    leaq    .LC0(%rip), %rdi  # First argument (format string)
    movl    $0, %eax          # varargs convention
    call    printf@PLT
    movl    $0, %eax          # return 0
    popq    %rbp
    ret

Run: gcc -S hello.c -o hello.s

Stage 3: Assembly (as)

The assembler converts human-readable assembly to machine code, producing an object file (.o). This is the first binary stage.

Object files contain:

  • Machine code (the .text section)
  • Data (.data, .rodata, .bss sections)
  • Symbol table (what symbols are defined/referenced)
  • Relocation entries (addresses to fix up during linking)

Run: gcc -c hello.c -o hello.o

Stage 4: Linking (ld)

The linker combines object files and libraries into an executable:

  1. Symbol resolution: Match each symbol reference to exactly one definition
  2. Relocation: Fix up addresses now that final layout is known
  3. Layout: Organize sections into a loadable format
┌─────────────────────────────────────────────────────────────┐
│                    Linker's Job                             │
├─────────────────────────────────────────────────────────────┤
│  Input: hello.o + libc.so (or libc.a)                       │
│                                                             │
│  1. Find definition for 'printf' → in libc                  │
│  2. Calculate final addresses for all symbols               │
│  3. Patch call printf@PLT with actual jump target          │
│  4. Create program headers for OS loader                    │
│  5. Output: Executable ELF file                             │
└─────────────────────────────────────────────────────────────┘

2.2 ELF Object File Format

ELF (Executable and Linkable Format) is the standard binary format on Linux/Unix. Understanding it is crucial:

┌─────────────────────────────────────┐
│           ELF Header                │  Magic number, architecture, entry point
├─────────────────────────────────────┤
│       Program Headers               │  How to load into memory (for executables)
├─────────────────────────────────────┤
│         .text                       │  Executable code
├─────────────────────────────────────┤
│         .rodata                     │  Read-only data (string literals, constants)
├─────────────────────────────────────┤
│         .data                       │  Initialized global/static variables
├─────────────────────────────────────┤
│         .bss                        │  Uninitialized global/static (zeroed at load)
├─────────────────────────────────────┤
│         .symtab                     │  Symbol table
├─────────────────────────────────────┤
│         .strtab                     │  String table (names for symbols)
├─────────────────────────────────────┤
│         .rel.text                   │  Relocation entries for .text
├─────────────────────────────────────┤
│       Section Headers               │  Describes all sections
└─────────────────────────────────────┘

Symbol Types

Type Meaning Example
T Text (code) symbol, globally visible main
t Text symbol, local to file helper_func with static
D Data symbol, globally visible int global_var = 5;
B BSS symbol (uninitialized) int uninit_global;
U Undefined (needs to be resolved) Reference to printf

2.3 Static vs Dynamic Linking

Static Linking (gcc -static):

  • All library code is copied into the executable
  • Larger binary, but self-contained
  • No runtime dependencies
  • Symbol resolution happens entirely at link time

Dynamic Linking (default):

  • Library code stays in shared libraries (.so files)
  • Smaller binaries, shared memory for common libraries
  • Resolution happens partly at load time
  • PLT (Procedure Linkage Table) and GOT (Global Offset Table) enable lazy binding
Static:  hello (executable) contains printf code
Dynamic: hello (executable) contains PLT stub → jumps to libc.so at runtime

2.4 Process Memory Layout

When the OS loads your program, it creates an address space:

High addresses (0x7fff...)
┌─────────────────────────────────────┐
│            Kernel Space             │  Not accessible to user code
├─────────────────────────────────────┤
│              Stack                  │  ↓ grows down (local vars, return addrs)
│                 │                   │
│                 ▼                   │
├─────────────────────────────────────┤
│                                     │  (unmapped region)
├─────────────────────────────────────┤
│                 ▲                   │
│                 │                   │
│              Heap                   │  ↑ grows up (malloc'd memory)
├─────────────────────────────────────┤
│         .bss (uninitialized)        │
├─────────────────────────────────────┤
│         .data (initialized)         │
├─────────────────────────────────────┤
│         .rodata (read-only)         │
├─────────────────────────────────────┤
│         .text (code)                │
└─────────────────────────────────────┘
Low addresses (0x0...)

2.5 Why This Matters

Understanding the compilation pipeline enables you to:

  1. Debug effectively: A crash at address 0x401234 means something—you can trace it to the exact instruction and source line
  2. Optimize intelligently: Understanding what the compiler does helps you write code it can optimize well
  3. Write portable code: Knowing about linking helps you manage symbol visibility and avoid conflicts
  4. Security analysis: Buffer overflows, return-oriented programming, and other attacks exploit this layout
  5. Embedded systems: Resource-constrained systems require understanding exactly what ends up in the binary

2.6 Common Misconceptions

Misconception 1: “The compiler converts C directly to machine code” Reality: There are four distinct stages, each with its own tools and artifacts.

Misconception 2: “Object files are executables” Reality: Object files contain machine code but can’t run—they have unresolved symbols and no load information.

Misconception 3: “Static linking is always better for deployment” Reality: Dynamic linking saves memory (shared libraries), enables updates without recompilation, and is required for some features (plugins, LGPL compliance).

Misconception 4: “The stack grows up” Reality: On x86/x86-64, the stack grows DOWN (toward lower addresses). This is crucial for understanding buffer overflows.


3. Project Specification

3.1 What You Will Build

A command-line “pipeline explainer” that takes a C source file and produces a structured report showing:

  1. Every artifact produced at each compilation stage
  2. Symbol and section analysis of the object file
  3. Linking information (resolved symbols, required libraries)
  4. Runtime process layout (via controlled execution with debugger)

3.2 Functional Requirements

  1. Pipeline Capture (--pipeline):
    • Save preprocessed output (.i)
    • Save assembly output (.s)
    • Save object file (.o)
    • Save final executable
  2. Object Analysis (--analyze):
    • List all sections with sizes and types
    • List symbols by category (defined, undefined, local, global)
    • Show relocation entries
  3. Link Analysis (--link):
    • Identify required shared libraries
    • Compare static vs dynamic link outputs
    • Show symbol resolution decisions
  4. Runtime Inspection (--runtime):
    • Launch under debugger, pause at main
    • Report stack and heap boundaries
    • Show loaded shared libraries
    • Demonstrate memory regions
  5. Crash Analysis (--crash <address>):
    • Given a crash address, trace back to source line
    • Show the relevant assembly
    • Explain what stage produced that code

3.3 Non-Functional Requirements

  • Determinism: Same input produces same output (modulo runtime addresses)
  • Portability: Works on Linux x86-64 (primary) and macOS (stretch goal)
  • Educational: Output explains “why” not just “what”
  • Clean: Uses proper error handling, no crashes on malformed input

3.4 Example Usage / Output

$ ./pipeline-explorer hello.c

=== PIPELINE ANALYSIS for hello.c ===

[Stage 1: Preprocessing]
  Input:  hello.c (234 bytes)
  Output: hello.i (42,891 bytes)
  Included: stdio.h, stdlib.h, limits.h, ...
  Macros expanded: 3 (MAX, DEBUG, VERSION)

[Stage 2: Compilation]
  Input:  hello.i
  Output: hello.s (1,247 bytes)
  Functions generated: 2 (main, helper)
  Optimization level: -O0

[Stage 3: Assembly]
  Input:  hello.s
  Output: hello.o (2,456 bytes)

  Sections:
    .text    : 168 bytes (executable code)
    .rodata  : 24 bytes (string literals)
    .data    : 0 bytes
    .bss     : 0 bytes

  Symbol Table:
    DEFINED (Global):  main (T), helper (t)
    UNDEFINED:         printf (U), exit (U)

  Relocations: 4 entries
    - printf @ offset 0x24 (PC-relative call)
    - .rodata @ offset 0x18 (data reference)

[Stage 4: Linking]
  Input:  hello.o + libc.so.6
  Output: hello (16,432 bytes)

  Symbol Resolution:
    printf -> libc.so.6 (0x7f... at runtime)
    exit   -> libc.so.6 (0x7f... at runtime)

  Dependencies: libc.so.6, ld-linux-x86-64.so.2

[Runtime Layout] (paused at main)
  Text:    0x401000 - 0x401234
  ROData:  0x402000 - 0x402100
  Data:    0x404000 - 0x404008
  BSS:     0x404008 - 0x404010
  Heap:    0x405000 - (grows up)
  Stack:   0x7fff... - (grows down)

  Stack at main():
    Return address: 0x7f4a3c...  (__libc_start_main+234)
    Saved RBP:      0x0
    Local vars:     argc @ rbp-4, argv @ rbp-16

=== END REPORT ===

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────────────┐
│                         pipeline-explorer                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌───────────┐ │
│  │   Compiler   │──▶│    Object    │──▶│     Link     │──▶│  Runtime  │ │
│  │   Invoker    │   │   Analyzer   │   │   Analyzer   │   │ Inspector │ │
│  └──────────────┘   └──────────────┘   └──────────────┘   └───────────┘ │
│         │                  │                  │                 │        │
│         ▼                  ▼                  ▼                 ▼        │
│  ┌──────────────────────────────────────────────────────────────────────┐│
│  │                        Report Generator                              ││
│  └──────────────────────────────────────────────────────────────────────┘│
│                                    │                                      │
│                                    ▼                                      │
│                             [Formatted Output]                            │
└─────────────────────────────────────────────────────────────────────────┘

4.2 Key Components

Component Responsibility Key Decisions
Compiler Invoker Runs gcc with flags to produce each stage artifact Uses -save-temps or explicit stage flags; captures stderr for warnings
Object Analyzer Parses ELF object files Can use readelf/objdump output OR parse ELF directly (stretch goal)
Link Analyzer Compares static/dynamic linking, identifies dependencies Uses ldd, nm, compares sizes
Runtime Inspector Launches executable under debugger, queries state Uses GDB in batch mode with scripted commands
Report Generator Formats analysis into readable output Structured text, optionally JSON for machine consumption

4.3 Data Structures

// Core structures for the analyzer

typedef struct {
    char *name;
    size_t size;
    uint64_t address;
    char type;  // 'T', 'D', 'B', 'U', etc.
    int is_global;
} Symbol;

typedef struct {
    char *name;
    size_t size;
    uint64_t address;
    uint32_t flags;  // SHF_* flags
} Section;

typedef struct {
    uint64_t offset;
    char *symbol_name;
    int type;  // R_X86_64_* relocation type
} Relocation;

typedef struct {
    Section *sections;
    size_t section_count;
    Symbol *symbols;
    size_t symbol_count;
    Relocation *relocations;
    size_t relocation_count;
} ObjectAnalysis;

typedef struct {
    uint64_t text_start, text_end;
    uint64_t data_start, data_end;
    uint64_t heap_start;
    uint64_t stack_top;
    char **loaded_libraries;
    size_t library_count;
} RuntimeLayout;

4.4 Algorithm Overview

Main Algorithm Flow:

  1. Parse command line → determine source file and requested analyses
  2. Invoke pipeline stages:
    • Run gcc -E → capture preprocessed output
    • Run gcc -S → capture assembly
    • Run gcc -c → create object file
    • Run gcc → create executable (both static and dynamic versions)
  3. Analyze object file:
    • Run readelf -h → header info
    • Run readelf -S → section table
    • Run readelf -s → symbol table
    • Run readelf -r → relocations
    • Parse outputs into data structures
  4. Analyze linking:
    • Compare file sizes
    • Run ldd on dynamic version
    • Run nm to see resolved symbols
  5. Runtime inspection:
    • Launch with gdb -batch -ex "..." executable
    • Query memory mappings (info proc mappings)
    • Query stack (info frame)
    • Parse GDB output
  6. Generate report → format and output all collected data

Complexity Analysis:

  • Time: O(n) where n is source file size, dominated by gcc invocations
  • Space: O(m) where m is size of largest artifact (typically the executable)

5. Implementation Guide

5.1 Development Environment Setup

# Required tools
sudo apt-get install gcc gdb binutils build-essential

# Verify installations
gcc --version
gdb --version
readelf --version
objdump --version

# Create project structure
mkdir -p pipeline-explorer/{src,tests,artifacts}
cd pipeline-explorer

5.2 Project Structure

pipeline-explorer/
├── src/
│   ├── main.c              # Entry point, CLI parsing
│   ├── compiler.c          # Invokes gcc stages
│   ├── object_analyzer.c   # Parses ELF via readelf/objdump
│   ├── link_analyzer.c     # Analyzes linking
│   ├── runtime_inspector.c # GDB automation
│   ├── report.c            # Output formatting
│   └── util.c              # String handling, process execution
├── include/
│   ├── pipeline.h          # Shared data structures
│   └── util.h              # Utility declarations
├── tests/
│   ├── simple.c            # Basic test case
│   ├── multifile/          # Multi-file test
│   └── expected/           # Expected outputs
├── Makefile
└── README.md

5.3 Implementation Phases

Phase 1: Foundation (Days 1-3)

Goals:

  • Set up project structure and build system
  • Implement basic pipeline invocation
  • Capture all stage artifacts

Tasks:

  1. Create Makefile that builds your tool
  2. Implement run_command() utility that executes shell commands and captures output
  3. Implement invoke_preprocessor(), invoke_compiler(), invoke_assembler(), invoke_linker()
  4. Verify artifacts are created in a temp directory

Checkpoint: ./pipeline-explorer hello.c produces hello.i, hello.s, hello.o, and hello in an artifacts directory.

Phase 2: Object File Analysis (Days 4-7)

Goals:

  • Parse readelf output for sections, symbols, relocations
  • Build internal data structures
  • Format readable output

Tasks:

  1. Run readelf -S and parse section table output
  2. Run readelf -s and parse symbol table (handling local/global, types)
  3. Run readelf -r and parse relocation entries
  4. Connect symbols to their sections
  5. Format and output the analysis

Checkpoint: Running on hello.o shows all sections with sizes, categorizes symbols correctly, and lists relocations.

Goals:

  • Compare static vs dynamic linking
  • Identify all dependencies
  • Show symbol resolution

Tasks:

  1. Create both static (-static) and dynamic versions
  2. Use ldd to list shared library dependencies
  3. Use nm on final executable to show resolved symbols
  4. Calculate and compare file sizes
  5. Explain the tradeoffs in output

Checkpoint: Output shows library dependencies, size comparison, and where undefined symbols were resolved.

Phase 4: Runtime Inspection (Days 11-13)

Goals:

  • Automate GDB to pause and inspect running process
  • Extract memory layout
  • Show stack state at entry

Tasks:

  1. Create GDB command script that sets breakpoint at main, runs, queries state
  2. Parse info proc mappings output for memory regions
  3. Parse info frame output for stack information
  4. Handle GDB output parsing robustly
  5. Integrate into final report

Checkpoint: Running on an executable shows correct memory layout matching expected regions.

Phase 5: Polish & Integration (Day 14)

Goals:

  • Error handling for all edge cases
  • Clean output formatting
  • Testing on various inputs

Tasks:

  1. Handle compilation errors gracefully
  2. Test with multi-file programs
  3. Test with C++ if time permits
  4. Write documentation
  5. Clean up code, remove debug output

Checkpoint: Tool handles malformed input gracefully, produces clean reports for variety of test cases.

5.4 Key Implementation Decisions

Decision Options Recommendation Rationale
ELF Parsing Parse directly vs use readelf Use readelf Simpler, ELF parsing is a project unto itself
GDB Control Expect scripts vs Python API Expect scripts More portable, simpler to implement
Output Format Text only vs JSON option Text with JSON flag Text for humans, JSON for tooling integration
Temp Files System temp vs local dir Local artifacts/ dir Easier debugging, user can inspect
Static Analysis Required vs optional Required with –static flag Shows important tradeoff but not always available

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Test parsing functions in isolation Symbol parser handles all types correctly
Integration Tests Test full pipeline on sample programs hello.c produces expected output
Regression Tests Verify fixes don’t break existing Compare output to known-good baselines
Edge Cases Handle unusual inputs Empty file, no symbols, huge files

6.2 Critical Test Cases

  1. Minimal Program:
    int main() { return 0; }
    

    Expected: No undefined symbols except _start, minimal sections

  2. External Dependencies:
    #include <stdio.h>
    #include <math.h>
    int main() { printf("%f\n", sin(1.0)); return 0; }
    

    Expected: Shows printf, sin as undefined → resolved from libc, libm

  3. Global Variables:
    int initialized = 42;
    int uninitialized;
    const char *message = "hello";
    int main() { return initialized + uninitialized; }
    

    Expected: Shows .data, .bss, .rodata usage correctly

  4. Multi-file:
    // main.c
    extern int helper(int);
    int main() { return helper(5); }
    
    // helper.c
    int helper(int x) { return x * 2; }
    

    Expected: Shows symbol resolution between files

  5. Static vs Dynamic: Same program linked both ways, compare sizes and dependencies

6.3 Test Data

# Create test suite
mkdir -p tests

# Test 1: Minimal
echo 'int main() { return 0; }' > tests/minimal.c

# Test 2: Printf
echo '#include <stdio.h>
int main() { printf("hello\n"); return 0; }' > tests/printf.c

# Test 3: Global data
echo 'int x = 1; int y; const char *s = "hi";
int main() { return x + y; }' > tests/globals.c

# Expected output patterns to grep for
# minimal: no .data, minimal .text
# printf: printf in undefined symbols
# globals: all three sections populated

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Not handling spaces in paths Tool fails on “/home/user/My Programs/test.c” Quote all paths in shell commands
Parsing readelf wrong Missing symbols or wrong types Test parser on diverse binaries first
GDB version differences Commands fail on different systems Test GDB commands standalone first
Forgetting -g for debug info No source line mapping Always compile with -g for debug builds
Ignoring stderr Silent failures Capture and check stderr from all tools
Hardcoded paths Works on your machine only Use which() or PATH lookup

7.2 Debugging Strategies

  • Print intermediate states: After each parsing step, dump the data structure
  • Test tools independently: Before integrating, verify readelf, objdump, gdb produce expected output
  • Use simple inputs first: Get minimal.c working perfectly before complex cases
  • Compare with manual: Run tools manually and compare to your parsed output
  • Binary diff for determinism: Same input should produce byte-identical artifacts (ignoring timestamps)

7.3 Performance Traps

  • Spawning too many processes: Batch related queries (one readelf call with multiple flags)
  • Reading huge outputs: For large binaries, process output line-by-line, don’t load all into memory
  • GDB startup time: GDB is slow to start; for multiple queries, use one session with multiple commands

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add JSON output: --format json for machine-readable output
  • Colorized output: Highlight different symbol types, sections
  • Verbose mode: Show actual readelf/objdump commands being run

8.2 Intermediate Extensions

  • Direct ELF parsing: Read ELF format directly without external tools
  • Disassembly integration: Show assembly for specific functions
  • Diff mode: Compare two binaries’ symbols and sections
  • Cross-compilation support: Analyze ARM/RISC-V binaries

8.3 Advanced Extensions

  • Debug info parsing: Read DWARF to show full source mapping
  • Dynamic analysis: Use ptrace to trace actual symbol resolutions at runtime
  • Optimization analysis: Show what changes between -O0, -O1, -O2, -O3
  • Security audit: Check for common vulnerabilities (no stack canary, RELRO, etc.)

9. Real-World Connections

9.1 Industry Applications

  • Build systems (Bazel, CMake): Understand these pipelines to debug build failures
  • Package managers (apt, rpm): Know how shared libraries are managed
  • Containerization (Docker): Static linking simplifies container images
  • Embedded systems: Every byte counts; understanding sections is crucial
  • Reverse engineering: This is step 1 of analyzing any binary
  • binutils: The tools you’re wrapping (readelf, objdump, nm)
  • LLVM: Alternative compiler with similar pipeline concepts
  • pwntools: Python library for similar analysis (CTF-oriented)
  • Ghidra/radare2: Advanced binary analysis tools
  • libelf: Library for direct ELF parsing

9.3 Interview Relevance

This project prepares you for questions like:

  • “Walk me through what happens when you compile and run a C program”
  • “What’s the difference between static and dynamic linking?”
  • “How would you debug a segfault at address X?”
  • “Explain the process memory layout”
  • “What are symbols and relocations in object files?”

10. Resources

10.1 Essential Reading

  • CS:APP Chapter 1: “A Tour of Computer Systems” - Overview of the pipeline
  • CS:APP Chapter 7: “Linking” - Deep dive into object files and linking
  • “Linkers and Loaders” by John Levine: The definitive book on linking
  • System V ABI: Official specification for ELF format and calling conventions

10.2 Video Resources

  • MIT 6.004 lectures on compilation and linking
  • “How do compilers work?” on Computerphile YouTube channel
  • LiveOverflow series on binary exploitation (covers ELF in depth)

10.3 Tools & Documentation

  • readelf(1): man readelf - ELF file display
  • objdump(1): man objdump - Object file disassembly
  • nm(1): man nm - Symbol listing
  • ldd(1): man ldd - Shared library dependencies
  • ld(1): man ld - Linker documentation
  • Previous: None (this is the foundation project)
  • Next: P2 (Bitwise Data Inspector) builds on your understanding of how data is represented; P10 (ELF Link Map) goes deeper into linking

11. Self-Assessment Checklist

Before considering this project complete, verify:

Understanding

  • I can explain each of the 4 pipeline stages without looking at notes
  • I can list the main ELF sections and what goes in each
  • I can explain what relocations are and why they’re needed
  • I understand the difference between static and dynamic linking
  • I can draw the process memory layout and explain each region

Implementation

  • Tool correctly identifies all pipeline artifacts
  • Symbol parsing correctly categorizes defined/undefined, local/global
  • Section sizes match what readelf reports
  • Runtime inspection shows correct memory regions
  • Error handling works for malformed input

Growth

  • I debugged at least one issue by examining the actual binary
  • I can now read assembly output and connect it to C source
  • I’m comfortable using readelf, objdump, nm, ldd, gdb

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Pipeline stage artifacts are captured and reported
  • Object file sections and symbols are analyzed
  • Basic output formatting works
  • Runs without crashing on valid input

Full Completion:

  • All analysis modes work (pipeline, object, link, runtime)
  • Static vs dynamic comparison works
  • GDB integration shows memory layout
  • Clean error handling
  • Tested on multiple input files

Excellence (Going Above & Beyond):

  • Direct ELF parsing (no readelf dependency)
  • Cross-platform support (macOS)
  • JSON output format
  • Crash address analysis mode
  • DWARF debug info parsing

13. Real World Outcome

When you complete this project, here’s exactly what you’ll see when running your tool:

$ ./toolchain_explorer hello.c

=== PREPROCESSING STAGE ===
Input:  hello.c (42 lines)
Output: hello.i (1,847 lines)
Time:   0.023s

Macros expanded:
  MAX       → 100
  DEBUG     → 1
  VERSION   → "1.0.0"

Headers included:
  stdio.h     → /usr/include/stdio.h (925 lines)
  stdlib.h    → /usr/include/stdlib.h (743 lines)
  limits.h    → /usr/include/limits.h (179 lines)

=== COMPILATION STAGE ===
Input:  hello.i (1,847 lines)
Output: hello.s (89 lines)
Time:   0.156s

Functions generated:
  main         @ .text (32 instructions)
  helper       @ .text (18 instructions)

String literals found:
  .LC0: "Hello, %s!\n"
  .LC1: "Max value is %d"

=== ASSEMBLY STAGE ===
Input:  hello.s (89 lines)
Output: hello.o (2,456 bytes)
Time:   0.012s

ELF Sections:
  .text     : 168 bytes (executable code)
  .rodata   : 42 bytes  (read-only strings)
  .data     : 8 bytes   (initialized globals)
  .bss      : 16 bytes  (uninitialized globals)
  .symtab   : 312 bytes (symbol table)
  .strtab   : 89 bytes  (string table)
  .rela.text: 72 bytes  (relocations)

Symbol Table:
  GLOBAL DEFINED:
    main          T  0x0000000000000000  (32 bytes)
    global_count  D  0x0000000000000000  (8 bytes)

  LOCAL DEFINED:
    helper        t  0x0000000000000020  (18 bytes)
    .LC0          r  0x0000000000000000  (string)

  UNDEFINED (need linking):
    printf        U  (from libc)
    exit          U  (from libc)
    malloc        U  (from libc)

Relocations (4 entries):
  Offset     Type              Symbol
  0x00000015 R_X86_64_PLT32    printf
  0x00000024 R_X86_64_PC32     .LC0
  0x0000002f R_X86_64_PLT32    exit
  0x0000003a R_X86_64_PLT32    malloc

=== LINKING STAGE ===
Input:  hello.o + libc.so.6
Output: hello (16,432 bytes)
Time:   0.089s

Static vs Dynamic Comparison:
  Dynamic: 16,432 bytes  (uses shared libc)
  Static:  823,456 bytes (libc copied in)
  Ratio:   50x larger when static!

Shared Library Dependencies:
  libc.so.6           → /lib/x86_64-linux-gnu/libc.so.6
  ld-linux-x86-64.so.2 → /lib64/ld-linux-x86-64.so.2

Symbol Resolution:
  printf  → libc.so.6::printf   (PLT stub at 0x401030)
  exit    → libc.so.6::exit     (PLT stub at 0x401040)
  malloc  → libc.so.6::malloc   (PLT stub at 0x401050)

=== RUNTIME LAYOUT (paused at main) ===
Memory Map:
  0x00400000 - 0x00401000  r-xp  .text (code)
  0x00401000 - 0x00402000  r--p  .rodata
  0x00403000 - 0x00404000  rw-p  .data + .bss
  0x00404000 - 0x00425000  rw-p  [heap]
  0x7fff8000 - 0x7fffffff  rw-p  [stack]

Stack at main():
  Address           Value              Meaning
  0x7fffffffdde8    0x00007ffff7a2d840 return addr → __libc_start_main+234
  0x7fffffffdde0    0x00007fffffffde00 saved %rbp
  0x7fffffffddd8    0x0000000000000001 argc = 1
  0x7fffffffddd0    0x00007fffffffdeb8 argv pointer

=== PIPELINE COMPLETE ===
Total time: 0.280s
All artifacts saved to: ./artifacts/hello/

14. The Core Question You’re Answering

“When I type gcc hello.c -o hello and press Enter, what ACTUALLY happens inside my computer before I can run ./hello?”

This project transforms “the compiler does stuff” into a complete mental model of four distinct transformations, each with its own tools, formats, and purpose. You’ll understand why linking errors are different from compile errors, why symbols matter, and how your code becomes bytes that the CPU executes.


15. Concepts You Must Understand First

Before starting this project, ensure you understand these concepts:

Concept Why It Matters Where to Learn
C preprocessor directives (#include, #define) You’ll trace how these expand CS:APP 1.2, any C book Ch. 1
Basic assembly syntax (mov, call, ret) You’ll read compiler output CS:APP 3.1-3.4
What a function call does at the machine level You’ll see call/ret in action CS:APP 3.7
Hexadecimal notation Object files are full of hex CS:APP 2.1
What “address” means for code and data Linking assigns addresses CS:APP 1.4, 7.1
Difference between source, object, and executable files Core of this project CS:APP 1.2, Chapter 7 intro

16. Questions to Guide Your Design

Work through these questions BEFORE writing code:

  1. Input/Output: What does your tool take as input? What outputs should it produce? Where should artifacts be saved?

  2. Stage Separation: How will you invoke each stage of gcc separately? What flags control this?

  3. Output Parsing: Tools like readelf and objdump produce text output. How will you parse this reliably? What about different versions?

  4. Error Handling: What if the C file doesn’t compile? What if readelf isn’t installed? How do you report errors helpfully?

  5. GDB Automation: How can you script GDB to pause at main and query process state? What’s the output format?

  6. Static Linking: On modern systems, static linking may not work or may require special packages. How will you handle this?

  7. Report Format: How will you structure the output? Text? JSON? Should sections be optional?


17. Thinking Exercise

Before writing any code, trace through this by hand:

Take this simple C program:

#include <stdio.h>
#define MSG "Hello"
int count = 0;
int main() {
    printf("%s: %d\n", MSG, count);
    return 0;
}

Exercise: On paper, answer:

  1. Preprocessing: What does the file look like after #include and #define are processed? How many lines? What happened to MSG?

  2. Compilation: How many functions will be in the assembly? What strings will be in .rodata? Where is count stored?

  3. Assembly: What sections will the object file have? What symbols are defined? What symbols are undefined?

  4. Linking: Where does the definition of printf come from? What happens if we link statically vs. dynamically?

  5. Runtime: When the process starts, where in memory is count? Where is the string “Hello”? Where is printf’s code?

Verify your answers by running the actual commands (gcc -E, gcc -S, objdump, etc.) before implementing your tool.


18. The Interview Questions They’ll Ask

After completing this project, you’ll be ready for these common interview questions:

  1. “Walk me through what happens when you compile a C program.”
    • Expected: Explain all 4 stages with specific tools (cpp, cc1, as, ld)
    • Bonus: Mention intermediate file formats and what changes at each stage
  2. “What’s the difference between a compile error and a link error?”
    • Expected: Compile errors are syntax/type issues; link errors are missing symbols
    • Bonus: Give example of each (“undeclared identifier” vs “undefined reference”)
  3. “Explain static vs. dynamic linking. When would you use each?”
    • Expected: Trade-offs of binary size, dependencies, update flexibility
    • Bonus: Mention PLT/GOT, security implications (ASLR), and LGPL considerations
  4. “What’s in an object file?”
    • Expected: Machine code, data, symbol table, relocation entries
    • Bonus: Explain why relocations exist (addresses aren’t known until linking)
  5. “How would you debug a ‘segfault at address 0x401234’?”
    • Expected: Use objdump/readelf to find what’s at that address, check if it’s in .text
    • Bonus: Explain how to map back to source line using debug info
  6. “What does ‘position-independent code’ mean?”
    • Expected: Code that works regardless of where it’s loaded in memory
    • Bonus: Explain how this relates to shared libraries and security (ASLR)

19. Hints in Layers

If you’re stuck, reveal hints one at a time:

Hint 1: Getting Started

Start with just capturing the pipeline artifacts. Before parsing anything, make sure you can run:

gcc -E source.c -o source.i
gcc -S source.c -o source.s
gcc -c source.c -o source.o
gcc source.c -o executable

And verify each file exists. Your tool is basically automating this and adding analysis.

Hint 2: Parsing readelf Output

Don’t try to parse the raw binary ELF format yourself (that’s a project unto itself). Use readelf with flags:

  • readelf -h file.o - ELF header
  • readelf -S file.o - Section headers
  • readelf -s file.o - Symbol table
  • readelf -r file.o - Relocations

Parse the text output line by line. Look for consistent column positions or use regex.

Hint 3: GDB Scripting

Create a file gdb_script.txt:

set pagination off
break main
run
info proc mappings
info registers
info frame
quit

Run with: gdb -batch -x gdb_script.txt ./executable

The output is text you can capture and parse.

Hint 4: Static Linking Issues

On Ubuntu/Debian: sudo apt-get install libc6-dev

On modern systems, fully static linking may fail. Make this an optional feature. Your tool should gracefully report “static linking unavailable” rather than crashing.

To attempt static: gcc -static source.c -o executable_static


20. Books That Will Help

Topic Book Chapter/Section
Overview of compilation CS:APP 3rd Ed Chapter 1.2 “Programs Are Translated by Other Programs”
Detailed compilation stages CS:APP 3rd Ed Chapter 1.3 “It Pays to Understand How Compilation Systems Work”
Object file formats CS:APP 3rd Ed Chapter 7.3-7.5 “Object Files”, “Relocatable Object Files”, “Symbols and Symbol Tables”
Symbol resolution CS:APP 3rd Ed Chapter 7.6 “Symbol Resolution”
Relocation CS:APP 3rd Ed Chapter 7.7 “Relocation”
Static vs Dynamic linking CS:APP 3rd Ed Chapter 7.10-7.11 “Dynamic Linking”, “Loading Shared Libraries”
Process memory layout CS:APP 3rd Ed Chapter 7.9 “Loading Executable Object Files”
Linker internals (deep dive) Linkers & Loaders by John Levine Chapters 1-4
ELF format details System V ABI Chapter 4-5 (ELF specification)
GNU toolchain GCC Manual Section on “Overall Options” for -E, -S, -c

This guide was expanded from CSAPP_3E_DEEP_LEARNING_PROJECTS.md. For the complete learning path, see the project index.