Project 11: Process Debugging Toolkit

Build a multi-tool report for a PID: syscalls, open files, sockets, and memory.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 2 weeks
Language Bash/Python (Alternatives: Go, Rust)
Prerequisites Projects 1-3, comfortable with tools
Key Topics strace, pmap, lsof, /proc

1. Learning Objectives

By completing this project, you will:

  1. Collect a live strace sample for a PID.
  2. List open files and sockets for the process.
  3. Report memory usage and mappings.
  4. Provide a concise diagnostic summary.

2. Theoretical Foundation

2.1 Core Concepts

  • Live observation: strace shows what the process is doing now.
  • Resource inventory: open files and sockets often explain behavior.
  • Memory footprint: RSS/VSZ and heap tell if memory is stable.

2.2 Why This Matters

A single tool rarely tells the full story. This toolkit integrates the most useful signals for PID debugging.

2.3 Historical Context / Background

Traditional Unix debugging relies on composing small tools; this project codifies that workflow.

2.4 Common Misconceptions

  • “lsof is enough”: It misses syscall behavior and memory trends.
  • “strace is safe always”: It can perturb timing; use short samples.

3. Project Specification

3.1 What You Will Build

A CLI tool debug-process <pid> that outputs a structured report: command info, live strace sample, open files/sockets, memory snapshot, and diagnosis.

3.2 Functional Requirements

  1. Accept a PID and verify it exists.
  2. Collect strace for a short duration.
  3. List open files and network sockets.
  4. Report memory metrics and map summary.

3.3 Non-Functional Requirements

  • Safety: Avoid long-running strace sessions.
  • Reliability: Handle permission errors gracefully.
  • Usability: Sections are clearly labeled.

3.4 Example Usage / Output

$ ./debug-process 1234
Command: python3 server.py
Strace: epoll_wait -> accept -> read -> write
Open files: 12
Memory: RSS 156MB, Heap 45MB

3.5 Real World Outcome

You will run the tool and get a full report for a PID suitable for incident notes. Example:

$ ./debug-process 1234
Command: python3 server.py
Strace: epoll_wait -> accept -> read -> write
Open files: 12
Memory: RSS 156MB, Heap 45MB

4. Solution Architecture

4.1 High-Level Design

PID -> gather metadata -> strace sample -> lsof/pmap -> summary -> report

4.2 Key Components

Component Responsibility Key Decisions
Metadata /proc status/cmdline Prefer /proc
Strace sampler Short tracing 3-5 seconds
Resource lister lsof or /proc/fd Use lsof when available
Memory pmap/smaps Summarize heap and RSS

4.3 Data Structures

report = {"cmd": "...", "strace": [], "files": [], "mem": {}}

4.4 Algorithm Overview

Key Algorithm: Live Sample

  1. Attach strace for a fixed window.
  2. Parse a few representative syscalls.
  3. Summarize as a one-line activity description.

Complexity Analysis:

  • Time: O(1) per run, bounded by sample time
  • Space: O(n) lines captured

5. Implementation Guide

5.1 Development Environment Setup

strace -V
lsof -v

5.2 Project Structure

project-root/
├── debug_process.py
└── README.md

5.3 The Core Question You’re Answering

“What is this process doing right now, and what resources is it using?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. strace attach mode
  2. lsof output and fd types
  3. pmap/smaps memory summaries

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. How long should the strace sample be?
  2. What happens if you lack permission to trace?
  3. What is the minimum useful summary?

5.6 Thinking Exercise

Manual debug

Pick a PID and manually run strace -p, lsof -p, and pmap -x. Compare what each reveals.

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you debug a live process without stopping it?”
  2. “What does lsof show that ps does not?”
  3. “How do you identify I/O waits from syscalls?”

5.8 Hints in Layers

Hint 1: Keep strace short Use timeout 5 strace -p <pid>.

Hint 2: Use /proc when lsof missing /proc/<pid>/fd can replace lsof.

Hint 3: Summarize Report only top 3 syscalls and top 5 files.

5.9 Books That Will Help

Topic Book Chapter
Live debugging “Linux System Programming” Ch. 10
File descriptors “TLPI” Ch. 5
Network debugging “TCP/IP Illustrated” Vol. 1

5.10 Implementation Phases

Phase 1: Foundation (3-4 days)

Goals:

  • Gather metadata and lsof output.

Tasks:

  1. Read cmdline and status.
  2. List open files.

Checkpoint: Output shows correct command and fd list.

Phase 2: Core Functionality (4-5 days)

Goals:

  • Add strace sampling and memory summary.

Tasks:

  1. Run short strace attach.
  2. Parse top syscalls and durations.
  3. Summarize memory with pmap.

Checkpoint: Report shows activity and memory values.

Phase 3: Polish & Edge Cases (2-3 days)

Goals:

  • Improve error handling and output.

Tasks:

  1. Handle permission denied gracefully.
  2. Add a summary diagnosis line.

Checkpoint: Report is readable for non-root PID too.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Trace duration 1s vs 5s 5s More representative
Output verbose vs summary summary first Faster for triage

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Metadata Verify cmdline /proc//cmdline
Strace Verify capture short sample
Files Verify lsof compare to /proc/fd

6.2 Critical Test Cases

  1. PID with no permissions -> graceful warning.
  2. PID exits mid-run -> tool exits cleanly.
  3. Strace output parsed correctly.

6.3 Test Data

Strace line: read(7, ...) = 234 <0.004>

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Long strace Process slows Use timeout
Missing lsof Empty output Fall back to /proc/fd
Overly verbose Hard to scan Summarize first

7.2 Debugging Strategies

  • Compare with manual commands for a known PID.
  • Print a debug flag to show raw command outputs.

7.3 Performance Traps

Strace adds overhead; avoid running continuously on production.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add JSON output.
  • Add a --summary-only flag.

8.2 Intermediate Extensions

  • Add socket summary (listen/established).
  • Add top threads by CPU from /proc.

8.3 Advanced Extensions

  • Integrate with Project 10 snapshot tool.
  • Add baseline comparisons.

9. Real-World Connections

9.1 Industry Applications

  • Rapid debugging of hung or slow services.
  • lsof: https://github.com/lsof-org/lsof
  • strace: https://strace.io

9.3 Interview Relevance

  • Combining tools to diagnose processes is a key systems skill.

10. Resources

10.1 Essential Reading

  • lsof(8) - man 8 lsof
  • strace(1) - man 1 strace

10.2 Video Resources

  • Live process debugging talks (search “strace lsof pmap”)

10.3 Tools & Documentation

  • **/proc//fd** and **/proc//smaps**

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain what strace reveals.
  • I can interpret lsof output.
  • I can explain RSS vs heap.

11.2 Implementation

  • Report includes syscalls, fds, and memory.
  • Errors are handled gracefully.
  • Output is easy to scan.

11.3 Growth

  • I can apply the toolkit during incidents.
  • I can extend the toolkit with new sections.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Generate a multi-section report for a PID.

Full Completion:

  • Include strace sample, open files, and memory summary.

Excellence (Going Above & Beyond):

  • Add socket analysis and baseline comparisons.

This guide was generated from LINUX_SYSTEM_TOOLS_MASTERY.md. For the complete learning path, see the parent directory.