Project 10: Performance Snapshot Tool

Capture a full system diagnostic snapshot in one command.

Quick Reference

Attribute	Value
Difficulty	Intermediate
Time Estimate	1 week
Language	Python (Alternatives: Go, Rust, Bash)
Prerequisites	Projects 1, 3, 5
Key Topics	holistic diagnostics, USE method, correlation

1. Learning Objectives

By completing this project, you will:

Collect CPU, memory, disk, and network metrics in one run.
Summarize top processes by CPU and memory.
Detect common anomalies from thresholds.
Produce a readable, shareable report.

2. Theoretical Foundation

2.1 Core Concepts

USE method: Utilization, Saturation, Errors across system resources.
Correlation: High load plus high I/O wait indicates disk bottlenecks.
Baseline: Meaningful alerts require comparison to normal behavior.

2.2 Why This Matters

When incidents happen, you need a fast, repeatable way to capture evidence.

2.3 Historical Context / Background

System snapshots are common in SRE practice; many on-call playbooks start with a standard data capture.

2.4 Common Misconceptions

“One metric explains everything”: You need a full picture.
“Top process explains load”: Load can be I/O-driven.

3. Project Specification

3.1 What You Will Build

A command-line tool that gathers system metrics, summarizes issues, and writes a report to disk.

3.2 Functional Requirements

Collect load, CPU times, memory, and swap.
Capture top processes by CPU and memory.
Include recent kernel warnings.
Produce a summary section with recommendations.

3.3 Non-Functional Requirements

Performance: Snapshot should finish quickly.
Reliability: Handle missing tools gracefully.
Usability: Clear sections and headings.

3.4 Example Usage / Output

$ ./perf-snapshot --output report.txt
Summary: HIGH LOAD, MEMORY PRESSURE
Top CPU: python3 (145%)

3.5 Real World Outcome

You will run a single command and get a report file with quick summary and details. Example:

$ ./perf-snapshot --output report.txt
Summary: HIGH LOAD, MEMORY PRESSURE
Top CPU: python3 (145%)

4. Solution Architecture

4.1 High-Level Design

collect metrics -> analyze -> summarize -> write report

4.2 Key Components

Component	Responsibility	Key Decisions
Collector	Run commands and read /proc	Use /proc when possible
Analyzer	Apply thresholds	Use CPU count for load
Reporter	Write report	Plain text with sections

4.3 Data Structures

report = {"summary": [], "cpu": {}, "mem": {}, "top": []}

4.4 Algorithm Overview

Key Algorithm: Summary Generation

Evaluate thresholds.
Add summary lines for each detected issue.
Print recommendations based on flags.

Complexity Analysis:

Time: O(1) per command
Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

python3 --version

5.2 Project Structure

project-root/
├── perf_snapshot.py
└── README.md

5.3 The Core Question You’re Answering

“How do I capture everything I need for triage in one shot?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

USE method
Key metrics per resource
Top process identification

5.5 Questions to Guide Your Design

Before implementing, think through these:

Which data sources are required for CPU, memory, disk, and network?
How do you keep the report short but useful?
What thresholds should trigger warnings?

5.6 Thinking Exercise

Manual snapshot

Collect a snapshot manually using uptime, free, vmstat, ps, and dmesg. Compare to what your tool will automate.

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

“Walk me through a fast diagnosis for a slow server.”
“What is the USE method?”
“How do you identify top resource consumers quickly?”

5.8 Hints in Layers

Hint 1: Use /proc Prefer /proc/loadavg and /proc/meminfo.

Hint 2: Use ps sorting ps aux --sort=-%cpu and --sort=-%mem.

Hint 3: Keep sections short Lead with summary and top consumers.

5.9 Books That Will Help

Topic	Book	Chapter
USE method	“Systems Performance”	Ch. 2
CPU analysis	“Systems Performance”	Ch. 6
Memory analysis	“Systems Performance”	Ch. 7

5.10 Implementation Phases

Phase 1: Foundation (2 days)

Goals:

Collect core metrics.

Tasks:

Read load and memory from /proc.
Capture top processes.

Checkpoint: Raw data collected into report.

Phase 2: Core Functionality (3 days)

Goals:

Add summary and recommendations.

Tasks:

Apply thresholds.
Add summary section.

Checkpoint: Summary flags match system state.

Phase 3: Polish & Edge Cases (2 days)

Goals:

Add kernel warnings and clean formatting.

Tasks:

Add dmesg tail.
Format report sections.

Checkpoint: Report is readable and consistent.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Output format	text vs JSON	text	Fast to read
Data sources	commands vs /proc	/proc when possible	Predictable

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Collection	Ensure metrics gathered	loadavg, meminfo
Analysis	Threshold warnings	Simulated load
Output	Report format	Snapshot file

6.2 Critical Test Cases

High load triggers warning.
Low MemAvailable triggers warning.
Report file is created and readable.

6.3 Test Data

Simulated load > CPU count

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Missing tools	Empty sections	Provide fallbacks
Overlong report	Hard to read	Use summary section
Wrong thresholds	False alarms	Calibrate with CPU count

7.2 Debugging Strategies

Print raw values next to computed warnings.
Compare with top and free output.

7.3 Performance Traps

Avoid long-running commands in the snapshot path.

8. Extensions & Challenges

8.1 Beginner Extensions

Add JSON output.
Add a concise one-line summary mode.

8.2 Intermediate Extensions

Add network errors from /proc/net/dev.
Add disk stats from /proc/diskstats.

8.3 Advanced Extensions

Add HTML report with charts.
Add a repeatable baseline comparison.

9. Real-World Connections

9.1 Industry Applications

Incident response: capture evidence for postmortems.

sysstat: https://github.com/sysstat/sysstat
sar: https://linux.die.net/man/1/sar

9.3 Interview Relevance

System diagnosis workflows are common for SRE roles.

10. Resources

10.1 Essential Reading

uptime(1) - man 1 uptime
free(1) - man 1 free

10.2 Video Resources

Brendan Gregg USE method talks (search “USE method”)

10.3 Tools & Documentation

/proc/loadavg and /proc/meminfo

Process Debugging Toolkit: deeper per-PID analysis.

11. Self-Assessment Checklist

11.1 Understanding

I can explain the USE method.
I can interpret load and memory signals.
I can correlate metrics to causes.

11.2 Implementation

Snapshot collects all sections.
Summary highlights issues correctly.
Report is readable.

11.3 Growth

I can customize thresholds for my systems.
I can apply the report to a real incident.

12. Submission / Completion Criteria

Minimum Viable Completion:

Collect and output core metrics in one report.

Full Completion:

Add summary warnings and top processes.

Excellence (Going Above & Beyond):

Add HTML/JSON outputs and baseline comparison.

This guide was generated from LINUX_SYSTEM_TOOLS_MASTERY.md. For the complete learning path, see the parent directory.

Project 10: Performance Snapshot Tool

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Historical Context / Background

2.4 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Real World Outcome

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

Manual snapshot

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (2 days)

Phase 2: Core Functionality (3 days)

Phase 3: Polish & Edge Cases (2 days)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria