Project 10: Performance Snapshot Tool
Capture a full system diagnostic snapshot in one command.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1 week |
| Language | Python (Alternatives: Go, Rust, Bash) |
| Prerequisites | Projects 1, 3, 5 |
| Key Topics | holistic diagnostics, USE method, correlation |
1. Learning Objectives
By completing this project, you will:
- Collect CPU, memory, disk, and network metrics in one run.
- Summarize top processes by CPU and memory.
- Detect common anomalies from thresholds.
- Produce a readable, shareable report.
2. Theoretical Foundation
2.1 Core Concepts
- USE method: Utilization, Saturation, Errors across system resources.
- Correlation: High load plus high I/O wait indicates disk bottlenecks.
- Baseline: Meaningful alerts require comparison to normal behavior.
2.2 Why This Matters
When incidents happen, you need a fast, repeatable way to capture evidence.
2.3 Historical Context / Background
System snapshots are common in SRE practice; many on-call playbooks start with a standard data capture.
2.4 Common Misconceptions
- “One metric explains everything”: You need a full picture.
- “Top process explains load”: Load can be I/O-driven.
3. Project Specification
3.1 What You Will Build
A command-line tool that gathers system metrics, summarizes issues, and writes a report to disk.
3.2 Functional Requirements
- Collect load, CPU times, memory, and swap.
- Capture top processes by CPU and memory.
- Include recent kernel warnings.
- Produce a summary section with recommendations.
3.3 Non-Functional Requirements
- Performance: Snapshot should finish quickly.
- Reliability: Handle missing tools gracefully.
- Usability: Clear sections and headings.
3.4 Example Usage / Output
$ ./perf-snapshot --output report.txt
Summary: HIGH LOAD, MEMORY PRESSURE
Top CPU: python3 (145%)
3.5 Real World Outcome
You will run a single command and get a report file with quick summary and details. Example:
$ ./perf-snapshot --output report.txt
Summary: HIGH LOAD, MEMORY PRESSURE
Top CPU: python3 (145%)
4. Solution Architecture
4.1 High-Level Design
collect metrics -> analyze -> summarize -> write report
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Collector | Run commands and read /proc | Use /proc when possible |
| Analyzer | Apply thresholds | Use CPU count for load |
| Reporter | Write report | Plain text with sections |
4.3 Data Structures
report = {"summary": [], "cpu": {}, "mem": {}, "top": []}
4.4 Algorithm Overview
Key Algorithm: Summary Generation
- Evaluate thresholds.
- Add summary lines for each detected issue.
- Print recommendations based on flags.
Complexity Analysis:
- Time: O(1) per command
- Space: O(1)
5. Implementation Guide
5.1 Development Environment Setup
python3 --version
5.2 Project Structure
project-root/
├── perf_snapshot.py
└── README.md
5.3 The Core Question You’re Answering
“How do I capture everything I need for triage in one shot?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- USE method
- Key metrics per resource
- Top process identification
5.5 Questions to Guide Your Design
Before implementing, think through these:
- Which data sources are required for CPU, memory, disk, and network?
- How do you keep the report short but useful?
- What thresholds should trigger warnings?
5.6 Thinking Exercise
Manual snapshot
Collect a snapshot manually using uptime, free, vmstat, ps, and dmesg. Compare to what your tool will automate.
5.7 The Interview Questions They’ll Ask
Prepare to answer these:
- “Walk me through a fast diagnosis for a slow server.”
- “What is the USE method?”
- “How do you identify top resource consumers quickly?”
5.8 Hints in Layers
Hint 1: Use /proc
Prefer /proc/loadavg and /proc/meminfo.
Hint 2: Use ps sorting
ps aux --sort=-%cpu and --sort=-%mem.
Hint 3: Keep sections short Lead with summary and top consumers.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| USE method | “Systems Performance” | Ch. 2 |
| CPU analysis | “Systems Performance” | Ch. 6 |
| Memory analysis | “Systems Performance” | Ch. 7 |
5.10 Implementation Phases
Phase 1: Foundation (2 days)
Goals:
- Collect core metrics.
Tasks:
- Read load and memory from /proc.
- Capture top processes.
Checkpoint: Raw data collected into report.
Phase 2: Core Functionality (3 days)
Goals:
- Add summary and recommendations.
Tasks:
- Apply thresholds.
- Add summary section.
Checkpoint: Summary flags match system state.
Phase 3: Polish & Edge Cases (2 days)
Goals:
- Add kernel warnings and clean formatting.
Tasks:
- Add dmesg tail.
- Format report sections.
Checkpoint: Report is readable and consistent.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Output format | text vs JSON | text | Fast to read |
| Data sources | commands vs /proc | /proc when possible | Predictable |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Collection | Ensure metrics gathered | loadavg, meminfo |
| Analysis | Threshold warnings | Simulated load |
| Output | Report format | Snapshot file |
6.2 Critical Test Cases
- High load triggers warning.
- Low MemAvailable triggers warning.
- Report file is created and readable.
6.3 Test Data
Simulated load > CPU count
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Missing tools | Empty sections | Provide fallbacks |
| Overlong report | Hard to read | Use summary section |
| Wrong thresholds | False alarms | Calibrate with CPU count |
7.2 Debugging Strategies
- Print raw values next to computed warnings.
- Compare with
topandfreeoutput.
7.3 Performance Traps
Avoid long-running commands in the snapshot path.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add JSON output.
- Add a concise one-line summary mode.
8.2 Intermediate Extensions
- Add network errors from
/proc/net/dev. - Add disk stats from
/proc/diskstats.
8.3 Advanced Extensions
- Add HTML report with charts.
- Add a repeatable baseline comparison.
9. Real-World Connections
9.1 Industry Applications
- Incident response: capture evidence for postmortems.
9.2 Related Open Source Projects
- sysstat: https://github.com/sysstat/sysstat
- sar: https://linux.die.net/man/1/sar
9.3 Interview Relevance
- System diagnosis workflows are common for SRE roles.
10. Resources
10.1 Essential Reading
- uptime(1) -
man 1 uptime - free(1) -
man 1 free
10.2 Video Resources
- Brendan Gregg USE method talks (search “USE method”)
10.3 Tools & Documentation
- /proc/loadavg and /proc/meminfo
10.4 Related Projects in This Series
- Process Debugging Toolkit: deeper per-PID analysis.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the USE method.
- I can interpret load and memory signals.
- I can correlate metrics to causes.
11.2 Implementation
- Snapshot collects all sections.
- Summary highlights issues correctly.
- Report is readable.
11.3 Growth
- I can customize thresholds for my systems.
- I can apply the report to a real incident.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Collect and output core metrics in one report.
Full Completion:
- Add summary warnings and top processes.
Excellence (Going Above & Beyond):
- Add HTML/JSON outputs and baseline comparison.
This guide was generated from LINUX_SYSTEM_TOOLS_MASTERY.md. For the complete learning path, see the parent directory.