Project 4: System Resource Monitor

Build a CLI monitor that samples CPU load, memory usage, and disk usage at fixed intervals and writes clean CSV output for analysis.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 8 to 12 hours
Main Programming Language Bash
Alternative Programming Languages Python, Go
Coolness Level Level 2: Practical and Useful
Business Potential Level 3: Service & Support
Prerequisites Shell basics, basic understanding of CPU/memory terms
Key Topics /proc metrics, load average vs CPU usage, memory/disk parsing, CSV logging, sampling intervals

1. Learning Objectives

By completing this project, you will:

  1. Interpret system metrics correctly (load average, memory usage, disk usage).
  2. Extract numeric values from command output reliably.
  3. Produce a clean CSV time series with consistent timestamps.
  4. Understand sampling intervals and their trade-offs.
  5. Build a monitor that runs for long periods without corrupting output.

2. All Theory Needed (Per-Concept Breakdown)

Concept 1: System Metrics and the /proc Interface

Fundamentals

Linux exposes system metrics through both commands (free, df, uptime) and virtual files in /proc. These numbers are not arbitrary; they reflect how the kernel measures load, memory, and disk usage. Load average is not the same as CPU utilization: it represents the number of runnable or waiting processes over time. Memory usage includes caches and buffers, which are reclaimable. Disk usage is typically measured by filesystem statistics, not by raw disk size. If you misunderstand these metrics, your monitoring script will produce misleading results. This concept gives you the mental model needed to interpret the numbers correctly.

Deep Dive into the Concept

The /proc filesystem is a pseudo-filesystem provided by the kernel. Files like /proc/loadavg and /proc/meminfo are generated on the fly and contain real-time system data. /proc/loadavg includes the 1, 5, and 15-minute load averages. A load average of 1.0 means, on a single-core system, that there is on average one runnable process. On multi-core systems, a load of 1.0 may indicate idle capacity. This is why load must be interpreted relative to CPU count. It is common to compute load / cores to determine relative saturation.

Memory metrics are also nuanced. free reports total, used, and available memory. The “used” number includes caches and buffers, which are not necessarily a problem because the kernel can reclaim them. The available field is often the most meaningful because it estimates how much memory is actually available for new workloads. If you simply compute used/total, you might overstate memory pressure. A monitoring script should choose a metric and explain it. For this project, you can compute used_percent as 100 * (total - available) / total to reflect pressure rather than raw usage.

Disk usage is obtained via df, which reports filesystem usage. df / shows the root filesystem usage percentage. If the system uses multiple mount points, you might want to monitor a specific mount like / or /var. Disk usage percentage is a simple metric but is often the most actionable. For more detailed analysis, you could also record total and available space in bytes, but for this project, a percentage is sufficient.

CPU usage can be measured in multiple ways. A naive approach is to parse top, but this is complex and inconsistent. A simpler approach is to use load average from /proc/loadavg, which is a well-defined metric. If you want CPU percentage, you can read /proc/stat twice and compute deltas, but that adds complexity. For this project, load average plus memory and disk usage is enough to monitor overall health. The key is to document what each metric means so the user does not misinterpret it.

Parsing /proc and command output requires careful text processing. Use awk to extract fields, and ensure you remove percent signs or units. Always test your parsing commands in isolation before embedding them in a script. If the output format varies between distributions, choose the most stable sources: /proc/loadavg, /proc/meminfo, and df -P (POSIX output) are relatively stable. df -P forces a portable output format that is easier to parse.

Finally, be aware of measurement timing. If you read memory and disk at different times in a loop, they will not correspond to the exact same instant, but for this monitoring script that is acceptable. The important part is consistent sampling intervals and timestamps. Your script should record the timestamp of each sample and include it in the CSV output so later analysis can align metrics correctly.

How This Fits in Projects

This concept is essential for §3.2 (Functional Requirements) and §3.4 (Example Output), because you define the specific metrics and how they are computed. It also informs §5.4 and §6.2 test cases. These metric interpretation skills also help in P01 Website Status Checker when you measure response times.

Definitions & Key Terms

  • Load average: Average number of runnable/blocked processes.
  • /proc: Virtual filesystem exposing kernel data structures.
  • Available memory: Estimate of memory that can be allocated without swapping.
  • Filesystem usage: Percentage of space used on a mounted filesystem.
  • Sampling: Periodic measurement over time.

Mental Model Diagram (ASCII)

/proc -> metrics -> parse -> normalize -> log

How It Works (Step-by-Step)

  1. Read /proc/loadavg for load average.
  2. Read /proc/meminfo or free for memory totals.
  3. Read df -P / for disk usage.
  4. Normalize values into percentages.
  5. Emit a CSV line with timestamp.

Invariants: metrics are numeric; output must be consistent. Failure modes: parsing wrong field, locale differences, missing commands.

Minimal Concrete Example

load=$(awk '{print $1}' /proc/loadavg)
mem=$(free | awk '/Mem/ {printf "%.0f", ($2-$7)/$2*100}')
disk=$(df -P / | awk 'NR==2 {gsub(/%/,"",$5); print $5}')

Common Misconceptions

  • “Load average equals CPU usage.” -> It measures runnable processes, not CPU percent.
  • “Used memory is always bad.” -> Caches and buffers are reclaimable.
  • “df shows disk usage for the whole system.” -> It shows a specific filesystem.

Check-Your-Understanding Questions

  1. Why is load average not the same as CPU usage?
  2. Why does free show high used memory even on idle systems?
  3. Why is df -P better for scripts?

Check-Your-Understanding Answers

  1. Load average measures runnable/blocked processes, not percent CPU time.
  2. Linux uses memory for caches; it is not necessarily pressure.
  3. It enforces a consistent, parseable output format.

Real-World Applications

  • Monitoring servers for overload
  • Capacity planning
  • Baseline performance measurement

Where You’ll Apply It

References

  • How Linux Works (Ward), Ch. 4
  • The Linux Command Line (Shotts), Ch. 10

Key Insight

Metrics are only useful when you understand what they actually measure.

Summary

Understanding system metrics prevents false alarms and misleading reports.

Homework/Exercises to Practice the Concept

  1. Compare load average and CPU usage during an idle period.
  2. Compute memory pressure using MemAvailable from /proc/meminfo.
  3. Print disk usage percentage for / using df -P.

Solutions to the Homework/Exercises

  1. uptime; top -bn1 | head -n 5
  2. awk '/MemAvailable/ {avail=$2} /MemTotal/ {total=$2} END {print 100*(total-avail)/total}' /proc/meminfo
  3. df -P / | awk 'NR==2 {print $5}'

Concept 2: Sampling, Time-Series Logging, and CSV Hygiene

Fundamentals

A monitor is only as good as its sampling strategy and output format. Sampling defines how often you collect metrics, and it shapes what you can detect. A 1-second interval captures spikes but generates large logs; a 60-second interval is lighter but misses short events. CSV is the simplest format for time-series data because it is easy to parse and import into spreadsheets or scripts. But CSV output must be consistent: a fixed header, stable ordering, and no extra whitespace or logging mixed into the data. This concept teaches you how to design a monitoring loop that is reliable, long-running, and produces clean data.

Deep Dive into the Concept

Sampling is a trade-off between resolution and overhead. If your interval is too short, the monitor itself can consume CPU and disk; if it is too long, you miss transient spikes. For most learning purposes, 5 seconds is a good default. Your script should accept an interval argument and validate it as a positive integer. It should also handle termination gracefully so the CSV file is not corrupted. For example, if the user hits Ctrl+C, you should not leave a partially written line. A trap handler can cleanly exit or ensure the last line is complete.

CSV hygiene is about keeping data and logs separate. The CSV file should contain only data lines, with a single header at the top. Your script should check if the file exists; if not, write the header. If it exists, append without duplicating the header. This is crucial for long-running monitors. Use printf to avoid unexpected formatting. Avoid echo because its behavior can vary, especially with -e and escape sequences. Set LC_ALL=C to ensure numeric formatting uses . as the decimal separator.

Timestamp consistency is another critical point. Decide on a timestamp format, such as ISO 8601 (YYYY-MM-DDTHH:MM:SSZ), and stick to it. Use UTC (date -u) to avoid confusion across time zones. If you want deterministic output for tests, allow SOURCE_DATE_EPOCH to override the current time. This is a standard convention for reproducible builds and works well for test fixtures.

When logging, you must decide where to store your CSV. If you log to the current directory, a cron job might run with a different working directory. Therefore, use an explicit output path or resolve it relative to the script location. This avoids logs going to unexpected places. Also consider log rotation: if you run a monitor for days, the CSV file can grow large. A simple approach is to include the date in the filename so a new file is created each day. This can be an extension for advanced learners.

Finally, think about measurement jitter. If you simply sleep $interval at the end of each loop, the actual interval will be interval + measurement_time. This might be fine, but if you want accurate spacing, you can measure the time each iteration takes and adjust sleep accordingly. This is optional, but understanding this effect is part of good monitoring design. The key is to document your approach so the data is interpreted correctly.

How This Fits in Projects

This concept governs §3.4 (Example Output), §5.2 (Project Structure), and §6.2 (Test Cases). The same output hygiene applies to P01 Website Status Checker and P02 Log File Analyzer.

Definitions & Key Terms

  • Sampling interval: Time between measurements.
  • CSV header: The first line describing column names.
  • Deterministic output: Output that does not change across runs with the same inputs.
  • Trap handler: Shell mechanism to handle signals cleanly.
  • Jitter: Variation in sampling interval due to processing time.

Mental Model Diagram (ASCII)

loop -> measure -> format -> append -> sleep -> repeat

How It Works (Step-by-Step)

  1. Parse interval and output file path.
  2. Write CSV header if file missing.
  3. Loop: read metrics, format line, append.
  4. Sleep for the interval.
  5. Handle signals to exit cleanly.

Invariants: header appears once; each line has the same number of fields. Failure modes: duplicate headers, mixed logs in CSV, negative intervals.

Minimal Concrete Example

if [ ! -f "$out" ]; then
  printf "timestamp,mem_used_percent,cpu_load_1m,disk_used_percent\n" >> "$out"
fi
printf "%s,%s,%s,%s\n" "$ts" "$mem" "$load" "$disk" >> "$out"

Common Misconceptions

  • “Any timestamp is fine.” -> Inconsistent formats make analysis difficult.
  • “Headers can repeat.” -> It breaks CSV parsers.
  • “Sleep always equals interval.” -> Processing time adds overhead.

Check-Your-Understanding Questions

  1. Why should the CSV header be written only once?
  2. How can you make timestamps consistent across machines?
  3. What is sampling jitter and why does it matter?

Check-Your-Understanding Answers

  1. Multiple headers break parsers and complicate analysis.
  2. Use UTC and a fixed format like ISO 8601.
  3. It is the deviation from expected interval, which affects analysis accuracy.

Real-World Applications

  • Performance monitoring and alerting
  • Capacity planning dashboards
  • Incident investigation with time-series data

Where You’ll Apply It

References

  • Effective Shell (Kerr), Ch. 6
  • The Linux Command Line (Shotts), Ch. 6

Key Insight

Clean time-series data requires disciplined formatting and consistent timing.

Summary

Sampling and CSV hygiene turn raw metrics into usable data.

Homework/Exercises to Practice the Concept

  1. Write a loop that logs a counter to a CSV every 2 seconds.
  2. Add a trap handler that prints a message on Ctrl+C.
  3. Modify the loop to avoid duplicate headers.

Solutions to the Homework/Exercises

  1. i=0; while :; do printf "%s,%s\n" "$(date -u +%FT%TZ)" "$i" >> log.csv; i=$((i+1)); sleep 2; done
  2. trap 'echo "stopping"; exit 0' INT
  3. if [ ! -f log.csv ]; then echo "ts,val" > log.csv; fi

3. Project Specification

3.1 What You Will Build

You will build monitor.sh, a CLI script that samples CPU load (1m), memory usage percentage, and disk usage percentage at a configurable interval and appends rows to a CSV file. It will create the CSV header once, support deterministic timestamps for testing, and run continuously until interrupted. It will not implement alerts or charts.

3.2 Functional Requirements

  1. Interval argument: Accept a sampling interval in seconds.
  2. CSV output: Write to resource_log.csv or a provided path.
  3. Metrics: Record timestamp, mem_used_percent, cpu_load_1m, disk_used_percent.
  4. Header: Ensure header appears exactly once.
  5. Determinism: Respect SOURCE_DATE_EPOCH for fixed timestamps when provided.
  6. Exit codes: 0 on clean exit, 1 on usage error.

3.3 Non-Functional Requirements

  • Reliability: Run for at least 1 hour without corrupting output.
  • Performance: Minimal overhead; use lightweight commands.
  • Usability: Clear message on start and graceful shutdown on Ctrl+C.

3.4 Example Usage / Output

$ ./monitor.sh 5
Monitoring system... press Ctrl+C to stop

$ head -n 5 resource_log.csv
timestamp,mem_used_percent,cpu_load_1m,disk_used_percent
2025-12-31T12:00:00Z,35,0.12,45
2025-12-31T12:00:05Z,36,0.18,45
2025-12-31T12:00:10Z,36,0.22,45

3.5 Data Formats / Schemas / Protocols

CSV schema:

timestamp,mem_used_percent,cpu_load_1m,disk_used_percent

3.6 Edge Cases

  • Interval is non-numeric or zero.
  • Output file is not writable.
  • free or df not available.
  • Locale uses comma for decimal separator.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

cd /path/to/monitor
chmod +x monitor.sh
TZ=UTC SOURCE_DATE_EPOCH=1767268800 ./monitor.sh 5

3.7.2 Golden Path Demo (Deterministic)

With TZ=UTC and SOURCE_DATE_EPOCH=1767268800 (2025-12-31T12:00:00Z), your first line should show that timestamp.

3.7.3 If CLI: Exact Terminal Transcript

$ TZ=UTC SOURCE_DATE_EPOCH=1767268800 ./monitor.sh 5
Monitoring system... press Ctrl+C to stop

$ head -n 3 resource_log.csv
timestamp,mem_used_percent,cpu_load_1m,disk_used_percent
2025-12-31T12:00:00Z,35,0.12,45
2025-12-31T12:00:05Z,36,0.18,45

3.7.4 Failure Demo (Bad Input)

$ ./monitor.sh abc
ERROR: interval must be a positive integer
Exit code: 1

4. Solution Architecture

4.1 High-Level Design

parse args -> ensure header -> loop (collect -> format -> append) -> sleep

4.2 Key Components

Component Responsibility Key Decisions
Arg Parser Validate interval and output path Default to 5 seconds
Metrics Collector Read /proc and commands Prefer /proc for load
Formatter CSV line generation ISO 8601 UTC timestamps
Writer Append to CSV Header only once

4.3 Data Structures (No Full Code)

interval=5
out="resource_log.csv"

4.4 Algorithm Overview

Key Algorithm: Sampling Loop

  1. Validate interval.
  2. Write header if needed.
  3. Loop: read metrics, format line, append.
  4. Sleep for interval.

Complexity Analysis:

  • Time: O(k) per iteration where k is constant.
  • Space: O(1) plus output file.

5. Implementation Guide

5.1 Development Environment Setup

which awk df free

5.2 Project Structure

monitor/
├── monitor.sh
├── resource_log.csv
└── tests/
    └── fixtures/

5.3 The Core Question You’re Answering

“How can I continuously measure system health using standard CLI tools?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. System metrics meaning (Concept 1)
  2. Sampling and CSV hygiene (Concept 2)

5.5 Questions to Guide Your Design

  1. What interval gives useful data without too much overhead?
  2. Which memory metric best represents pressure?
  3. Should you log to a fixed file or rotate by date?
  4. What happens if commands fail temporarily?

5.6 Thinking Exercise

If a CPU spike lasts 1 second, how likely is a 10-second sampler to detect it? What interval would you choose for short spikes?

5.7 The Interview Questions They’ll Ask

  1. “What is the difference between load average and CPU usage?”
  2. “Why is MemAvailable more useful than MemFree?”
  3. “How do you ensure CSV output stays clean?”
  4. “How would you test a long-running monitor?”

5.8 Hints in Layers

Hint 1: Load average

load=$(awk '{print $1}' /proc/loadavg)

Hint 2: Memory usage

mem=$(free | awk '/Mem/ {printf "%.0f", ($2-$7)/$2*100}')

Hint 3: Disk usage

disk=$(df -P / | awk 'NR==2 {gsub(/%/,"",$5); print $5}')

5.9 Books That Will Help

Topic Book Chapter
Processes and load The Linux Command Line Ch. 10
System internals How Linux Works Ch. 4

5.10 Implementation Phases

Phase 1: Foundation (2 hours)

Goals:

  • Read metrics once and print

Tasks:

  1. Implement metric extraction.
  2. Print formatted line to stdout.

Checkpoint: One correct line printed with numeric values.

Phase 2: Core Functionality (3 hours)

Goals:

  • Loop and append to CSV

Tasks:

  1. Add interval loop.
  2. Create CSV header if missing.

Checkpoint: CSV grows with consistent lines.

Phase 3: Polish & Edge Cases (2 hours)

Goals:

  • Deterministic timestamps
  • Input validation

Tasks:

  1. Add SOURCE_DATE_EPOCH support.
  2. Validate interval argument.

Checkpoint: Script exits with error on invalid interval.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
CPU metric loadavg vs percent loadavg simpler and consistent
Timestamp format local vs UTC UTC easier comparisons
Output file fixed vs rotating fixed simple base case

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Validate parsing numeric extraction from commands
Integration Tests Run loop for 3 samples correct CSV format
Edge Case Tests Invalid interval error and exit code

6.2 Critical Test Cases

  1. Valid run: produces header and first sample.
  2. Invalid interval: exits 1 with error.
  3. CSV header: appears only once when appending.
  4. Deterministic time: with SOURCE_DATE_EPOCH, timestamps are fixed.

6.3 Test Data

interval=1
SOURCE_DATE_EPOCH=1767268800

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Header duplicates CSV has repeated headers Write header only if file missing
Wrong memory field Values look too high Use MemAvailable for pressure
Locale issues Commas in decimals Set LC_ALL=C

7.2 Debugging Strategies

  • Print raw command output before parsing.
  • Run with set -x to trace variable values.

7.3 Performance Traps

  • Extremely short intervals (<1s) can consume significant CPU and disk.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add disk usage for /var.
  • Add optional output path argument.

8.2 Intermediate Extensions

  • Rotate CSV daily.
  • Add memory used in MB alongside percent.

8.3 Advanced Extensions

  • Compute CPU percent from /proc/stat deltas.
  • Add alert thresholds and notifications.

9. Real-World Connections

9.1 Industry Applications

  • Ops monitoring: quick health visibility.
  • Capacity planning: long-term trend analysis.
  • collectd: system statistics collection daemon.
  • node_exporter: Prometheus metrics exporter.

9.3 Interview Relevance

  • Systems knowledge: interpreting load and memory.
  • Data handling: reliable logging and sampling.

10. Resources

10.1 Essential Reading

  • How Linux Works by Brian Ward - Ch. 4
  • The Linux Command Line by William E. Shotts - Ch. 10

10.2 Video Resources

  • “Linux Performance Basics” (any systems course)

10.3 Tools & Documentation

  • /proc: man proc
  • free/df: man free, man df

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain what load average means.
  • I can explain why MemAvailable is useful.
  • I can describe the trade-off of sampling intervals.

11.2 Implementation

  • CSV output is clean and consistent.
  • Interval argument is validated.
  • Script runs for 1 hour without corruption.

11.3 Growth

  • I can import the CSV into a tool and plot it.
  • I can explain my metric choices.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Script logs timestamps, load, memory, and disk usage.
  • Header appears exactly once.
  • Interval argument works.

Full Completion:

  • Deterministic timestamps supported.
  • Handles invalid input with clear errors.

Excellence (Going Above & Beyond):

  • CPU percent calculation with /proc/stat.
  • Daily CSV rotation with archival.