Project 11: Process Debugging Toolkit
Build a multi-tool report for a PID: syscalls, open files, sockets, and memory.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2 weeks |
| Language | Bash/Python (Alternatives: Go, Rust) |
| Prerequisites | Projects 1-3, comfortable with tools |
| Key Topics | strace, pmap, lsof, /proc |
1. Learning Objectives
By completing this project, you will:
- Collect a live strace sample for a PID.
- List open files and sockets for the process.
- Report memory usage and mappings.
- Provide a concise diagnostic summary.
2. Theoretical Foundation
2.1 Core Concepts
- Live observation: strace shows what the process is doing now.
- Resource inventory: open files and sockets often explain behavior.
- Memory footprint: RSS/VSZ and heap tell if memory is stable.
2.2 Why This Matters
A single tool rarely tells the full story. This toolkit integrates the most useful signals for PID debugging.
2.3 Historical Context / Background
Traditional Unix debugging relies on composing small tools; this project codifies that workflow.
2.4 Common Misconceptions
- “lsof is enough”: It misses syscall behavior and memory trends.
- “strace is safe always”: It can perturb timing; use short samples.
3. Project Specification
3.1 What You Will Build
A CLI tool debug-process <pid> that outputs a structured report: command info, live strace sample, open files/sockets, memory snapshot, and diagnosis.
3.2 Functional Requirements
- Accept a PID and verify it exists.
- Collect strace for a short duration.
- List open files and network sockets.
- Report memory metrics and map summary.
3.3 Non-Functional Requirements
- Safety: Avoid long-running strace sessions.
- Reliability: Handle permission errors gracefully.
- Usability: Sections are clearly labeled.
3.4 Example Usage / Output
$ ./debug-process 1234
Command: python3 server.py
Strace: epoll_wait -> accept -> read -> write
Open files: 12
Memory: RSS 156MB, Heap 45MB
3.5 Real World Outcome
You will run the tool and get a full report for a PID suitable for incident notes. Example:
$ ./debug-process 1234
Command: python3 server.py
Strace: epoll_wait -> accept -> read -> write
Open files: 12
Memory: RSS 156MB, Heap 45MB
4. Solution Architecture
4.1 High-Level Design
PID -> gather metadata -> strace sample -> lsof/pmap -> summary -> report
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Metadata | /proc status/cmdline | Prefer /proc |
| Strace sampler | Short tracing | 3-5 seconds |
| Resource lister | lsof or /proc/fd | Use lsof when available |
| Memory | pmap/smaps | Summarize heap and RSS |
4.3 Data Structures
report = {"cmd": "...", "strace": [], "files": [], "mem": {}}
4.4 Algorithm Overview
Key Algorithm: Live Sample
- Attach strace for a fixed window.
- Parse a few representative syscalls.
- Summarize as a one-line activity description.
Complexity Analysis:
- Time: O(1) per run, bounded by sample time
- Space: O(n) lines captured
5. Implementation Guide
5.1 Development Environment Setup
strace -V
lsof -v
5.2 Project Structure
project-root/
├── debug_process.py
└── README.md
5.3 The Core Question You’re Answering
“What is this process doing right now, and what resources is it using?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- strace attach mode
- lsof output and fd types
- pmap/smaps memory summaries
5.5 Questions to Guide Your Design
Before implementing, think through these:
- How long should the strace sample be?
- What happens if you lack permission to trace?
- What is the minimum useful summary?
5.6 Thinking Exercise
Manual debug
Pick a PID and manually run strace -p, lsof -p, and pmap -x. Compare what each reveals.
5.7 The Interview Questions They’ll Ask
Prepare to answer these:
- “How do you debug a live process without stopping it?”
- “What does lsof show that ps does not?”
- “How do you identify I/O waits from syscalls?”
5.8 Hints in Layers
Hint 1: Keep strace short
Use timeout 5 strace -p <pid>.
Hint 2: Use /proc when lsof missing
/proc/<pid>/fd can replace lsof.
Hint 3: Summarize Report only top 3 syscalls and top 5 files.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Live debugging | “Linux System Programming” | Ch. 10 |
| File descriptors | “TLPI” | Ch. 5 |
| Network debugging | “TCP/IP Illustrated” | Vol. 1 |
5.10 Implementation Phases
Phase 1: Foundation (3-4 days)
Goals:
- Gather metadata and lsof output.
Tasks:
- Read cmdline and status.
- List open files.
Checkpoint: Output shows correct command and fd list.
Phase 2: Core Functionality (4-5 days)
Goals:
- Add strace sampling and memory summary.
Tasks:
- Run short strace attach.
- Parse top syscalls and durations.
- Summarize memory with pmap.
Checkpoint: Report shows activity and memory values.
Phase 3: Polish & Edge Cases (2-3 days)
Goals:
- Improve error handling and output.
Tasks:
- Handle permission denied gracefully.
- Add a summary diagnosis line.
Checkpoint: Report is readable for non-root PID too.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Trace duration | 1s vs 5s | 5s | More representative |
| Output | verbose vs summary | summary first | Faster for triage |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Metadata | Verify cmdline | /proc/ |
| Strace | Verify capture | short sample |
| Files | Verify lsof | compare to /proc/fd |
6.2 Critical Test Cases
- PID with no permissions -> graceful warning.
- PID exits mid-run -> tool exits cleanly.
- Strace output parsed correctly.
6.3 Test Data
Strace line: read(7, ...) = 234 <0.004>
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Long strace | Process slows | Use timeout |
| Missing lsof | Empty output | Fall back to /proc/fd |
| Overly verbose | Hard to scan | Summarize first |
7.2 Debugging Strategies
- Compare with manual commands for a known PID.
- Print a debug flag to show raw command outputs.
7.3 Performance Traps
Strace adds overhead; avoid running continuously on production.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add JSON output.
- Add a
--summary-onlyflag.
8.2 Intermediate Extensions
- Add socket summary (listen/established).
- Add top threads by CPU from /proc.
8.3 Advanced Extensions
- Integrate with Project 10 snapshot tool.
- Add baseline comparisons.
9. Real-World Connections
9.1 Industry Applications
- Rapid debugging of hung or slow services.
9.2 Related Open Source Projects
- lsof: https://github.com/lsof-org/lsof
- strace: https://strace.io
9.3 Interview Relevance
- Combining tools to diagnose processes is a key systems skill.
10. Resources
10.1 Essential Reading
- lsof(8) -
man 8 lsof - strace(1) -
man 1 strace
10.2 Video Resources
- Live process debugging talks (search “strace lsof pmap”)
10.3 Tools & Documentation
- **/proc/
/fd** and **/proc/ /smaps**
10.4 Related Projects in This Series
- Service Watchdog: use this toolkit for health checks.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain what strace reveals.
- I can interpret lsof output.
- I can explain RSS vs heap.
11.2 Implementation
- Report includes syscalls, fds, and memory.
- Errors are handled gracefully.
- Output is easy to scan.
11.3 Growth
- I can apply the toolkit during incidents.
- I can extend the toolkit with new sections.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Generate a multi-section report for a PID.
Full Completion:
- Include strace sample, open files, and memory summary.
Excellence (Going Above & Beyond):
- Add socket analysis and baseline comparisons.
This guide was generated from LINUX_SYSTEM_TOOLS_MASTERY.md. For the complete learning path, see the parent directory.