Project 5: System Health Monitor

Build a real-time dashboard that shows load, memory, swap, and VM stats in one view.

Quick Reference

Attribute Value
Difficulty Beginner
Time Estimate Weekend
Language Bash (Alternatives: Python, Go, Rust)
Prerequisites Basic shell scripting
Key Topics load average, free/available memory, vmstat

1. Learning Objectives

By completing this project, you will:

  1. Parse load averages from /proc/loadavg.
  2. Extract memory metrics from /proc/meminfo and free.
  3. Interpret vmstat fields for CPU and I/O pressure.
  4. Present metrics with thresholds and trend indicators.

2. Theoretical Foundation

2.1 Core Concepts

  • Load average: Average runnable or uninterruptible tasks over time.
  • Memory accounting: Free vs available and the role of page cache.
  • vmstat: Snapshot of run queue, I/O, and memory churn.

2.2 Why This Matters

Most production slowdowns are visible in these metrics. Knowing how to read them gives fast signal during incidents.

2.3 Historical Context / Background

These metrics exist since early Unix and remain the foundation for monitoring systems and dashboards.

2.4 Common Misconceptions

  • “Free memory must be high”: Linux uses free memory for cache.
  • “Load average equals CPU%”: It includes I/O waiters.

3. Project Specification

3.1 What You Will Build

A bash dashboard that refreshes every N seconds, showing load, memory, swap, and vmstat values with basic status flags.

3.2 Functional Requirements

  1. Show 1/5/15 minute load averages.
  2. Show total/used/available memory and swap usage.
  3. Show vmstat fields (r, b, si, so, wa).

3.3 Non-Functional Requirements

  • Performance: Minimal overhead.
  • Reliability: Handles missing tools gracefully.
  • Usability: Clear labels and consistent units.

3.4 Example Usage / Output

$ ./health-monitor --interval 2
Load: 1.25 1.50 1.75
Mem: 12.4G used / 16G total (avail 3.4G)
Swap: 0.4G / 8G
vmstat: r=2 b=0 wa=1 si=0 so=0

3.5 Real World Outcome

You will run the script and get a concise view of system health. Example:

$ ./health-monitor --interval 2
Load: 1.25 1.50 1.75
Mem: 12.4G used / 16G total (avail 3.4G)
Swap: 0.4G / 8G
vmstat: r=2 b=0 wa=1 si=0 so=0

4. Solution Architecture

4.1 High-Level Design

collect metrics -> compute thresholds -> render -> sleep -> repeat

4.2 Key Components

Component Responsibility Key Decisions
Load reader /proc/loadavg Prefer /proc over uptime
Memory reader /proc/meminfo Use MemAvailable
vmstat reader vmstat 1 2 Use second line
Renderer Format output Fixed units and labels

4.3 Data Structures

LOAD_1=; LOAD_5=; LOAD_15=
MEM_TOTAL=; MEM_AVAIL=

4.4 Algorithm Overview

Key Algorithm: Thresholding

  1. Compare load to CPU count.
  2. Compare MemAvailable to total.
  3. Flag swap in/out activity.

Complexity Analysis:

  • Time: O(1) per refresh
  • Space: O(1)

5. Implementation Guide

5.1 Development Environment Setup

uname -a

5.2 Project Structure

project-root/
├── health_monitor.sh
└── README.md

5.3 The Core Question You’re Answering

“Is the system slow due to CPU, memory pressure, or I/O wait?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Load averages
    • Compare to CPU count.
  2. MemAvailable
    • Why it differs from MemFree.
  3. vmstat fields
    • r, b, wa, si, so.

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. What thresholds indicate warning or alert?
  2. How often should you sample to avoid noise?
  3. Should you keep a small history for trend arrows?

5.6 Thinking Exercise

Compare raw sources

Compare output of uptime, /proc/loadavg, and vmstat 1 2. Confirm they agree.

5.7 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What does load average represent?”
  2. “Why is low MemFree not necessarily a problem?”
  3. “What does high wa in vmstat mean?”

5.8 Hints in Layers

Hint 1: Use /proc /proc/loadavg is easy to parse.

Hint 2: Use MemAvailable It is the best signal of usable memory.

Hint 3: Use vmstat 1 2 Ignore the first line (since boot).

5.9 Books That Will Help

Topic Book Chapter
Load average “How Linux Works” Ch. 8
Memory stats “Linux System Programming” Ch. 4
vmstat “Systems Performance” Ch. 7

5.10 Implementation Phases

Phase 1: Foundation (Half day)

Goals:

  • Parse load and memory.

Tasks:

  1. Read /proc/loadavg.
  2. Read /proc/meminfo.

Checkpoint: Values match uptime and free.

Phase 2: Core Functionality (Half day)

Goals:

  • Add vmstat and thresholds.

Tasks:

  1. Parse vmstat sample.
  2. Compute warning flags.

Checkpoint: Warnings align with actual load.

Phase 3: Polish & Edge Cases (Half day)

Goals:

  • Add formatting and refresh loop.

Tasks:

  1. Print clear output.
  2. Handle missing tools.

Checkpoint: Dashboard refreshes cleanly.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Data sources /proc vs commands /proc Stable parsing
Output One-line vs multi-line Multi-line Readability

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Parsing Validate values Compare with free
Thresholds Validate warnings Simulate high load
Refresh Validate loop Run 1 min

6.2 Critical Test Cases

  1. Load averages match /proc/loadavg.
  2. MemAvailable matches free output.
  3. vmstat fields align with vmstat output.

6.3 Test Data

Sample load: 1.25 1.50 1.75

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Using MemFree only False alarms Use MemAvailable
Parsing vmstat header Bad values Use second line
Comparing load to 1.0 Wrong threshold Normalize by CPU count

7.2 Debugging Strategies

  • Print raw /proc values alongside parsed output.
  • Check CPU count from /proc/cpuinfo.

7.3 Performance Traps

Very short intervals can cause self-induced load; use 1-2 seconds.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add disk usage with df -h.
  • Add CPU usage from /proc/stat.

8.2 Intermediate Extensions

  • Track trends across 5 samples.
  • Add JSON output.

8.3 Advanced Extensions

  • Build a small ncurses UI.
  • Add alert hooks (email/webhook).

9. Real-World Connections

9.1 Industry Applications

  • Basic triage during incident response.
  • collectd: https://collectd.org
  • node_exporter: https://github.com/prometheus/node_exporter

9.3 Interview Relevance

  • Load and memory interpretation is standard Linux interview material.

10. Resources

10.1 Essential Reading

  • free(1) - man 1 free
  • vmstat(8) - man 8 vmstat

10.2 Video Resources

  • Linux performance basics (search “Linux load average”)

10.3 Tools & Documentation

  • /proc/meminfo and /proc/loadavg

11. Self-Assessment Checklist

11.1 Understanding

  • I can interpret load averages.
  • I can explain MemAvailable.
  • I can read vmstat fields.

11.2 Implementation

  • Metrics are parsed correctly.
  • Thresholds are reasonable.
  • Dashboard refreshes smoothly.

11.3 Growth

  • I can explain system health to a teammate.
  • I can extend the dashboard with new metrics.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Display load, memory, and vmstat in a loop.

Full Completion:

  • Add thresholds and clear status labels.

Excellence (Going Above & Beyond):

  • Add historical trends and JSON export.

This guide was generated from LINUX_SYSTEM_TOOLS_MASTERY.md. For the complete learning path, see the parent directory.