Project 10: Performance Snapshot Tool

Capture a full system diagnostic snapshot in one command.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 1 week
Language Python (Alternatives: Go, Rust, Bash)
Prerequisites Projects 1, 3, 5
Key Topics holistic diagnostics, USE method, correlation

1. Learning Objectives

By completing this project, you will:

  1. Collect CPU, memory, disk, and network metrics in one run.
  2. Summarize top processes by CPU and memory.
  3. Detect common anomalies from thresholds.
  4. Produce a readable, shareable report.

2. Theoretical Foundation

2.1 Core Concepts

  • USE method: Utilization, Saturation, Errors across system resources.
  • Correlation: High load plus high I/O wait indicates disk bottlenecks.
  • Baseline: Meaningful alerts require comparison to normal behavior.

2.2 Why This Matters

When incidents happen, you need a fast, repeatable way to capture evidence.

2.3 Historical Context / Background

System snapshots are common in SRE practice; many on-call playbooks start with a standard data capture.

2.4 Common Misconceptions

  • “One metric explains everything”: You need a full picture.
  • “Top process explains load”: Load can be I/O-driven.

3. Project Specification

3.1 What You Will Build

A command-line tool that gathers system metrics, summarizes issues, and writes a report to disk.

3.2 Functional Requirements

  1. Collect load, CPU times, memory, and swap.
  2. Capture top processes by CPU and memory.
  3. Include recent kernel warnings.
  4. Produce a summary section with recommendations.

3.3 Non-Functional Requirements

  • Performance: Snapshot should finish quickly.
  • Reliability: Handle missing tools gracefully.
  • Usability: Clear sections and headings.

3.4 Example Usage / Output

$ ./perf-snapshot --output report.txt
Summary: HIGH LOAD, MEMORY PRESSURE
Top CPU: python3 (145%)

3.5 Real World Outcome

You will run a single command and get a report file with quick summary and details. Example:

$ ./perf-snapshot --output report.txt
Summary: HIGH LOAD, MEMORY PRESSURE
Top CPU: python3 (145%)

4. Solution Architecture

4.1 High-Level Design

collect metrics -> analyze -> summarize -> write report

4.2 Key Components

Component Responsibility Key Decisions
Collector Run commands and read /proc Use /proc when possible
Analyzer Apply thresholds Use CPU count for load
Reporter Write report Plain text with sections

4.3 Data Structures

report = {"summary": [], "cpu": {}, "mem": {}, "top": []}

4.4 Algorithm Overview

Key Algorithm: Summary Generation

  1. Evaluate thresholds.
  2. Add summary lines for each detected issue.
  3. Print recommendations based on flags.

Complexity Analysis:

  • Time: O(1) per command
  • Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

python3 --version

5.2 Project Structure

project-root/
├── perf_snapshot.py
└── README.md

5.3 The Core Question You’re Answering

“How do I capture everything I need for triage in one shot?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. USE method
  2. Key metrics per resource
  3. Top process identification

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. Which data sources are required for CPU, memory, disk, and network?
  2. How do you keep the report short but useful?
  3. What thresholds should trigger warnings?

5.6 Thinking Exercise

Manual snapshot

Collect a snapshot manually using uptime, free, vmstat, ps, and dmesg. Compare to what your tool will automate.

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Walk me through a fast diagnosis for a slow server.”
  2. “What is the USE method?”
  3. “How do you identify top resource consumers quickly?”

5.8 Hints in Layers

Hint 1: Use /proc Prefer /proc/loadavg and /proc/meminfo.

Hint 2: Use ps sorting ps aux --sort=-%cpu and --sort=-%mem.

Hint 3: Keep sections short Lead with summary and top consumers.

5.9 Books That Will Help

Topic Book Chapter
USE method “Systems Performance” Ch. 2
CPU analysis “Systems Performance” Ch. 6
Memory analysis “Systems Performance” Ch. 7

5.10 Implementation Phases

Phase 1: Foundation (2 days)

Goals:

  • Collect core metrics.

Tasks:

  1. Read load and memory from /proc.
  2. Capture top processes.

Checkpoint: Raw data collected into report.

Phase 2: Core Functionality (3 days)

Goals:

  • Add summary and recommendations.

Tasks:

  1. Apply thresholds.
  2. Add summary section.

Checkpoint: Summary flags match system state.

Phase 3: Polish & Edge Cases (2 days)

Goals:

  • Add kernel warnings and clean formatting.

Tasks:

  1. Add dmesg tail.
  2. Format report sections.

Checkpoint: Report is readable and consistent.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Output format text vs JSON text Fast to read
Data sources commands vs /proc /proc when possible Predictable

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Collection Ensure metrics gathered loadavg, meminfo
Analysis Threshold warnings Simulated load
Output Report format Snapshot file

6.2 Critical Test Cases

  1. High load triggers warning.
  2. Low MemAvailable triggers warning.
  3. Report file is created and readable.

6.3 Test Data

Simulated load > CPU count

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Missing tools Empty sections Provide fallbacks
Overlong report Hard to read Use summary section
Wrong thresholds False alarms Calibrate with CPU count

7.2 Debugging Strategies

  • Print raw values next to computed warnings.
  • Compare with top and free output.

7.3 Performance Traps

Avoid long-running commands in the snapshot path.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add JSON output.
  • Add a concise one-line summary mode.

8.2 Intermediate Extensions

  • Add network errors from /proc/net/dev.
  • Add disk stats from /proc/diskstats.

8.3 Advanced Extensions

  • Add HTML report with charts.
  • Add a repeatable baseline comparison.

9. Real-World Connections

9.1 Industry Applications

  • Incident response: capture evidence for postmortems.
  • sysstat: https://github.com/sysstat/sysstat
  • sar: https://linux.die.net/man/1/sar

9.3 Interview Relevance

  • System diagnosis workflows are common for SRE roles.

10. Resources

10.1 Essential Reading

  • uptime(1) - man 1 uptime
  • free(1) - man 1 free

10.2 Video Resources

  • Brendan Gregg USE method talks (search “USE method”)

10.3 Tools & Documentation

  • /proc/loadavg and /proc/meminfo

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain the USE method.
  • I can interpret load and memory signals.
  • I can correlate metrics to causes.

11.2 Implementation

  • Snapshot collects all sections.
  • Summary highlights issues correctly.
  • Report is readable.

11.3 Growth

  • I can customize thresholds for my systems.
  • I can apply the report to a real incident.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Collect and output core metrics in one report.

Full Completion:

  • Add summary warnings and top processes.

Excellence (Going Above & Beyond):

  • Add HTML/JSON outputs and baseline comparison.

This guide was generated from LINUX_SYSTEM_TOOLS_MASTERY.md. For the complete learning path, see the parent directory.