Project 2: The Syscall Tracer (Mini-strace)

Build a minimal tracer that attaches to a process and prints each syscall with its return value.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 1 week
Main Programming Language C (Alternatives: Rust, Go)
Alternative Programming Languages Rust, Go
Coolness Level See REFERENCE.md (Level 4)
Business Potential See REFERENCE.md (Level 2)
Prerequisites Process control, syscalls, registers
Key Topics ptrace, syscall ABI, signal delivery

1. Learning Objectives

By completing this project, you will:

  1. Explain how a tracer pauses and resumes a tracee.
  2. Identify syscall entry and exit points reliably.
  3. Interpret return values and error codes deterministically.
  4. Build a minimal tool that validates kernel behavior.

2. All Theory Needed (Per-Concept Breakdown)

Syscall Tracing with ptrace

Fundamentals Syscall tracing is the practice of observing the exact system calls a process makes. On Linux, this is commonly done using ptrace, a debugging interface that allows one process (the tracer) to control another (the tracee). The tracer can pause the tracee at syscall boundaries, read registers to decode arguments, and resume execution. Understanding this interface gives you visibility into the kernel boundary and lets you turn vague questions like “why is it slow” into precise observations about which kernel services are used and how they behave.

Deep Dive ptrace is a process control interface that allows a tracer to observe and manipulate a tracee. When a tracer attaches to a tracee, the kernel forces the tracee to stop and notifies the tracer. The tracer can then choose to resume the tracee with special flags that cause stops on syscall entry and syscall exit. Each stop provides an opportunity to inspect registers and memory. The most reliable pattern is to alternate between entry and exit stops, capturing syscall numbers and arguments on entry, and results on exit.

System calls are identified by a numeric ID defined by the ABI. The tracer must know where that ID is stored in registers and where arguments are placed. These details differ by architecture, which is why strace has architecture-specific decoders. For a learning tool, it is acceptable to decode only a small subset of arguments and focus on syscall number and return value. The return value indicates success or failure, with negative values mapping to errno. This provides a deterministic way to identify failures without guessing.

Signals complicate tracing. When a tracee receives a signal, the tracer sees it as a stop. The tracer must decide whether to deliver the signal to the tracee or suppress it. If you always suppress signals, you will alter the program’s behavior. If you always forward signals, you risk breaking the tracer’s state machine by mixing signal stops with syscall stops. A robust tracer handles both by tracking why the process stopped, then resuming with appropriate flags.

Tracing introduces overhead, so it should be deterministic and minimal. To keep results stable, use fixed commands and avoid timing-sensitive tests. A good tracer prints syscalls in a consistent order with clear formatting. For example, it can show syscall name (or number), file path for open-like calls, and return values. The goal is not full-featured strace; it is to understand the syscall boundary and confirm your mental model with direct evidence.

The tracer is also a window into library behavior. A single command like ls may invoke dozens of syscalls for dynamic linking, locale loading, and directory traversal. This teaches an important lesson: performance is often dominated by implicit library behavior rather than your own code. By tracing, you can see those hidden steps and optimize or debug based on reality, not assumptions.

How this fit on projects You will apply this concept in §3.1 to define tracer behavior, in §4.1 to design the control loop, and in §6.2 to create test cases. It also supports P03-build-your-own-shell.md by verifying fork/exec behavior.

Definitions & key terms

  • ptrace: Kernel interface that allows one process to observe/control another.
  • Tracee: The process being traced.
  • Tracer: The process controlling the tracee.
  • Syscall stop: A trap at syscall entry or exit.
  • errno: Error code set on syscall failure.

Mental model diagram

Tracer
  |
  | attach
  v
Tracee --(syscall entry)--> stop
  ^                         |
  | resume <----------------+

How it works

  1. Start or attach to a target process.
  2. Wait for a stop event.
  3. Inspect registers and determine stop type.
  4. Log syscall entry or exit data.
  5. Resume tracee and repeat.

Minimal concrete example

Trace log (conceptual):
[pid 32010] openat("/etc/hostname") -> fd=3
[pid 32010] read(fd=3, bytes=64) -> 12
[pid 32010] close(fd=3) -> 0

Common misconceptions

  • “Tracing shows only my code.” It shows all syscalls, including library activity.
  • “Syscall numbers are stable across architectures.” They are not.
  • “Signals are separate from tracing.” Signals and tracing are deeply intertwined.

Check-your-understanding questions

  1. Why does a tracer see two stops per syscall?
  2. How does the tracer know a syscall failed?
  3. Why can a signal stop be confused with a syscall stop?
  4. What makes tracing slow?

Check-your-understanding answers

  1. One stop at entry to read arguments, one at exit to read return values.
  2. The return value is negative and maps to errno.
  3. Both are delivered as stops and require inspection to distinguish.
  4. Each stop requires a context switch and inspection overhead.

Real-world applications

  • Debugging permission and file access errors.
  • Profiling syscall-heavy workloads.
  • Understanding program startup behavior.

Where you’ll apply it

References

  • syscall(2) man page: https://man7.org/linux/man-pages/man2/syscall.2.html
  • ptrace(2) man page: https://man7.org/linux/man-pages/man2/ptrace.2.html
  • “The Linux Programming Interface” - process control chapters

Key insights Tracing converts invisible kernel behavior into a deterministic log.

Summary A syscall tracer is a microscope for the user-kernel boundary.

Homework/Exercises to practice the concept

  1. Trace a simple command and count unique syscalls.
  2. Identify which syscalls correspond to file operations.

Solutions to the homework/exercises

  1. Expect a small set of repeated syscalls.
  2. Look for open, read, write, close, and stat-like calls.

3. Project Specification

3.1 What You Will Build

A command-line tool that launches a target program and prints a deterministic log of its syscalls, including return values and error codes. It focuses on clarity over completeness and supports a small set of decoded syscalls.

3.2 Functional Requirements

  1. Attach and trace: Launch or attach to a target process.
  2. Syscall logging: Print syscall name/number and return value.
  3. Signal safety: Continue tracing even when signals arrive.

3.3 Non-Functional Requirements

  • Performance: Acceptable for short-lived commands.
  • Reliability: Trace completes without hanging.
  • Usability: Output is readable and sorted by time.

3.4 Example Usage / Output

$ ./mytrace /bin/echo hello
[pid 4021] write(1, "hello\n") -> 6
[pid 4021] exit_group(0)

3.5 Data Formats / Schemas / Protocols

  • Syscall log line: [pid] name(args) -> result.
  • Error formatting: -> -1 (EPERM).

3.6 Edge Cases

  • Tracee exits before first stop.
  • Signals delivered during syscall tracing.
  • Unsupported syscall argument types.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

  • Build in project-root and run ./mytrace /bin/ls.
  • Requires permissions to trace child processes.

3.7.2 Golden Path Demo (Deterministic)

Trace a fixed command like /bin/true for stable output.

3.7.3 If CLI: Exact terminal transcript

$ ./mytrace /bin/true
[pid 5012] exit_group(0)
# exit code: 0

Failure demo (deterministic):

$ ./mytrace /bin/does-not-exist
error: exec failed (ENOENT)
# exit code: 2

Exit codes:

  • 0 success
  • 2 invalid command or exec failure

4. Solution Architecture

4.1 High-Level Design

Tracer -> wait loop -> inspect registers -> log -> resume

4.2 Key Components

Component Responsibility Key Decisions
Attach logic Start/attach to tracee Use child tracing for determinism
Stop decoder Identify syscall vs signal stop Inspect wait status
Logger Format output consistently Minimal decoding for clarity

4.4 Data Structures (No Full Code)

  • Trace state: per-PID current phase (entry/exit).
  • Syscall record: number, args, return value, errno.

4.4 Algorithm Overview

Key Algorithm: syscall tracing loop

  1. Wait for stop event.
  2. Determine stop type.
  3. Read syscall number or return value.
  4. Log and resume.

Complexity Analysis:

  • Time: O(n) for n syscalls.
  • Space: O(p) for p traced processes.

5. Implementation Guide

5.1 Development Environment Setup

# Install a C compiler and strace for comparison

5.2 Project Structure

project-root/
├── src/
│   ├── tracer.c
│   └── decode.c
├── tests/
│   └── trace_tests.sh
└── README.md

5.3 The Core Question You’re Answering

“What does the kernel actually see when my program runs?”

5.4 Concepts You Must Understand First

  1. Syscall ABI
    • Where do syscall numbers and args live?
    • Book Reference: “The Linux Programming Interface” - Ch. 3-4
  2. Process stops
    • How does a tracer receive stop events?
    • Book Reference: “Advanced Programming in the UNIX Environment” - process control

5.5 Questions to Guide Your Design

  1. How will you pair syscall entry and exit stops?
  2. How will you display unsupported syscalls?

5.6 Thinking Exercise

Tracing State Machine

Draw the state machine for a tracee alternating between entry and exit stops.

5.7 The Interview Questions They’ll Ask

  1. “How does strace work internally?”
  2. “What is ptrace used for besides tracing?”
  3. “How do you detect syscall failure?”
  4. “Why is tracing slow?”

5.8 Hints in Layers

Hint 1: Start with syscall numbers Log only numbers first, then add names.

Hint 2: Alternate stops Track whether you are at entry or exit for each PID.

Hint 3: Signals Pass signals through to avoid changing program behavior.

Hint 4: Debugging Compare output with strace -f on the same command.

5.9 Books That Will Help

Topic Book Chapter
Syscalls “The Linux Programming Interface” Ch. 3-6
Process control “Advanced Programming in the UNIX Environment” Ch. 8

5.10 Implementation Phases

Phase 1: Foundation (1-2 days)

Goals:

  • Launch a child process under trace.
  • Log syscall numbers.

Tasks:

  1. Implement attach and wait loop.
  2. Print syscall number at each stop.

Checkpoint: ./mytrace /bin/true prints at least one syscall.

Phase 2: Core Functionality (2-3 days)

Goals:

  • Decode return values and errors.

Tasks:

  1. Detect entry vs exit stops.
  2. Format output with return values.

Checkpoint: Log shows success and failure codes.

Phase 3: Polish & Edge Cases (1-2 days)

Goals:

  • Handle signals and process exit.

Tasks:

  1. Forward signals properly.
  2. Exit cleanly on tracee termination.

Checkpoint: Tracer never hangs after tracee exits.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Trace mode Attach vs launch Launch child Deterministic output
Decoding depth Full decode vs minimal Minimal Keep scope manageable

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Validate formatting Parse log line output
Integration Tests Trace real commands /bin/true, /bin/ls
Edge Case Tests Signal handling Send SIGINT during trace

6.2 Critical Test Cases

  1. Trace /bin/true: prints a deterministic exit syscall.
  2. Trace missing binary: returns exec error and exit code 2.
  3. Signal delivery: trace continues after SIGINT.

6.3 Test Data

Expected exit syscall for /bin/true: exit_group(0)

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Wrong stop handling Missing syscalls Track entry/exit per PID
Signal suppression Tracee behavior changes Forward signals
Unsupported arch Nonsense output Verify ABI for your arch

7.2 Debugging Strategies

  • Compare with strace: Use the same command to confirm.
  • Print raw registers: Confirm syscall numbers when unsure.

7.3 Performance Traps

Tracing adds overhead; avoid using it in tight performance benchmarks.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add syscall name lookup for common calls.
  • Filter by syscall number.

8.2 Intermediate Extensions

  • Decode open paths and file descriptors.
  • Follow child processes.

8.3 Advanced Extensions

  • Add time spent per syscall.
  • Output in JSON for tooling integration.

9. Real-World Connections

9.1 Industry Applications

  • Debugging tools: strace, ltrace, and profilers.
  • Security: auditing unexpected syscalls.
  • strace: https://strace.io/ - full-featured syscall tracer.
  • gdb: https://www.gnu.org/software/gdb/ - uses ptrace under the hood.

9.3 Interview Relevance

Tracing and syscall knowledge are common systems interview topics.


10. Resources

10.1 Essential Reading

  • “The Linux Programming Interface” - syscalls and process control
  • ptrace(2) man page

10.2 Video Resources

  • “Linux syscalls and tracing” - conference talks (search title)

10.3 Tools & Documentation

  • strace: https://strace.io/ - reference output
  • man7.org: https://man7.org/linux/man-pages/man2/ptrace.2.html

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain how ptrace stops a process at syscall boundaries
  • I can explain how errno is derived from return values
  • I understand signal interactions with tracing

11.2 Implementation

  • All functional requirements are met
  • All test cases pass
  • Output is deterministic and readable

11.3 Growth

  • I can explain this project in an interview
  • I documented lessons learned
  • I can identify a next feature to add

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Trace a short command and log syscalls
  • Handle tracee exit without hanging
  • Produce deterministic output

Full Completion:

  • All minimum criteria plus:
  • Support a small set of decoded syscall names
  • Include a failure demo with exit codes

Excellence (Going Above & Beyond):

  • Follow child processes
  • Export structured output for analysis