Project 4: System Resource Monitor
Build a CLI monitor that samples CPU load, memory usage, and disk usage at fixed intervals and writes clean CSV output for analysis.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 8 to 12 hours |
| Main Programming Language | Bash |
| Alternative Programming Languages | Python, Go |
| Coolness Level | Level 2: Practical and Useful |
| Business Potential | Level 3: Service & Support |
| Prerequisites | Shell basics, basic understanding of CPU/memory terms |
| Key Topics | /proc metrics, load average vs CPU usage, memory/disk parsing, CSV logging, sampling intervals |
1. Learning Objectives
By completing this project, you will:
- Interpret system metrics correctly (load average, memory usage, disk usage).
- Extract numeric values from command output reliably.
- Produce a clean CSV time series with consistent timestamps.
- Understand sampling intervals and their trade-offs.
- Build a monitor that runs for long periods without corrupting output.
2. All Theory Needed (Per-Concept Breakdown)
Concept 1: System Metrics and the /proc Interface
Fundamentals
Linux exposes system metrics through both commands (free, df, uptime) and virtual files in /proc. These numbers are not arbitrary; they reflect how the kernel measures load, memory, and disk usage. Load average is not the same as CPU utilization: it represents the number of runnable or waiting processes over time. Memory usage includes caches and buffers, which are reclaimable. Disk usage is typically measured by filesystem statistics, not by raw disk size. If you misunderstand these metrics, your monitoring script will produce misleading results. This concept gives you the mental model needed to interpret the numbers correctly.
Deep Dive into the Concept
The /proc filesystem is a pseudo-filesystem provided by the kernel. Files like /proc/loadavg and /proc/meminfo are generated on the fly and contain real-time system data. /proc/loadavg includes the 1, 5, and 15-minute load averages. A load average of 1.0 means, on a single-core system, that there is on average one runnable process. On multi-core systems, a load of 1.0 may indicate idle capacity. This is why load must be interpreted relative to CPU count. It is common to compute load / cores to determine relative saturation.
Memory metrics are also nuanced. free reports total, used, and available memory. The “used” number includes caches and buffers, which are not necessarily a problem because the kernel can reclaim them. The available field is often the most meaningful because it estimates how much memory is actually available for new workloads. If you simply compute used/total, you might overstate memory pressure. A monitoring script should choose a metric and explain it. For this project, you can compute used_percent as 100 * (total - available) / total to reflect pressure rather than raw usage.
Disk usage is obtained via df, which reports filesystem usage. df / shows the root filesystem usage percentage. If the system uses multiple mount points, you might want to monitor a specific mount like / or /var. Disk usage percentage is a simple metric but is often the most actionable. For more detailed analysis, you could also record total and available space in bytes, but for this project, a percentage is sufficient.
CPU usage can be measured in multiple ways. A naive approach is to parse top, but this is complex and inconsistent. A simpler approach is to use load average from /proc/loadavg, which is a well-defined metric. If you want CPU percentage, you can read /proc/stat twice and compute deltas, but that adds complexity. For this project, load average plus memory and disk usage is enough to monitor overall health. The key is to document what each metric means so the user does not misinterpret it.
Parsing /proc and command output requires careful text processing. Use awk to extract fields, and ensure you remove percent signs or units. Always test your parsing commands in isolation before embedding them in a script. If the output format varies between distributions, choose the most stable sources: /proc/loadavg, /proc/meminfo, and df -P (POSIX output) are relatively stable. df -P forces a portable output format that is easier to parse.
Finally, be aware of measurement timing. If you read memory and disk at different times in a loop, they will not correspond to the exact same instant, but for this monitoring script that is acceptable. The important part is consistent sampling intervals and timestamps. Your script should record the timestamp of each sample and include it in the CSV output so later analysis can align metrics correctly.
How This Fits in Projects
This concept is essential for §3.2 (Functional Requirements) and §3.4 (Example Output), because you define the specific metrics and how they are computed. It also informs §5.4 and §6.2 test cases. These metric interpretation skills also help in P01 Website Status Checker when you measure response times.
Definitions & Key Terms
- Load average: Average number of runnable/blocked processes.
- /proc: Virtual filesystem exposing kernel data structures.
- Available memory: Estimate of memory that can be allocated without swapping.
- Filesystem usage: Percentage of space used on a mounted filesystem.
- Sampling: Periodic measurement over time.
Mental Model Diagram (ASCII)
/proc -> metrics -> parse -> normalize -> log
How It Works (Step-by-Step)
- Read
/proc/loadavgfor load average. - Read
/proc/meminfoorfreefor memory totals. - Read
df -P /for disk usage. - Normalize values into percentages.
- Emit a CSV line with timestamp.
Invariants: metrics are numeric; output must be consistent. Failure modes: parsing wrong field, locale differences, missing commands.
Minimal Concrete Example
load=$(awk '{print $1}' /proc/loadavg)
mem=$(free | awk '/Mem/ {printf "%.0f", ($2-$7)/$2*100}')
disk=$(df -P / | awk 'NR==2 {gsub(/%/,"",$5); print $5}')
Common Misconceptions
- “Load average equals CPU usage.” -> It measures runnable processes, not CPU percent.
- “Used memory is always bad.” -> Caches and buffers are reclaimable.
- “df shows disk usage for the whole system.” -> It shows a specific filesystem.
Check-Your-Understanding Questions
- Why is load average not the same as CPU usage?
- Why does
freeshow high used memory even on idle systems? - Why is
df -Pbetter for scripts?
Check-Your-Understanding Answers
- Load average measures runnable/blocked processes, not percent CPU time.
- Linux uses memory for caches; it is not necessarily pressure.
- It enforces a consistent, parseable output format.
Real-World Applications
- Monitoring servers for overload
- Capacity planning
- Baseline performance measurement
Where You’ll Apply It
- Project 4: §3.2, §3.4, §5.4
- Also used in: P01 Website Status Checker (timing discipline)
References
- How Linux Works (Ward), Ch. 4
- The Linux Command Line (Shotts), Ch. 10
Key Insight
Metrics are only useful when you understand what they actually measure.
Summary
Understanding system metrics prevents false alarms and misleading reports.
Homework/Exercises to Practice the Concept
- Compare load average and CPU usage during an idle period.
- Compute memory pressure using
MemAvailablefrom/proc/meminfo. - Print disk usage percentage for
/usingdf -P.
Solutions to the Homework/Exercises
uptime; top -bn1 | head -n 5awk '/MemAvailable/ {avail=$2} /MemTotal/ {total=$2} END {print 100*(total-avail)/total}' /proc/meminfodf -P / | awk 'NR==2 {print $5}'
Concept 2: Sampling, Time-Series Logging, and CSV Hygiene
Fundamentals
A monitor is only as good as its sampling strategy and output format. Sampling defines how often you collect metrics, and it shapes what you can detect. A 1-second interval captures spikes but generates large logs; a 60-second interval is lighter but misses short events. CSV is the simplest format for time-series data because it is easy to parse and import into spreadsheets or scripts. But CSV output must be consistent: a fixed header, stable ordering, and no extra whitespace or logging mixed into the data. This concept teaches you how to design a monitoring loop that is reliable, long-running, and produces clean data.
Deep Dive into the Concept
Sampling is a trade-off between resolution and overhead. If your interval is too short, the monitor itself can consume CPU and disk; if it is too long, you miss transient spikes. For most learning purposes, 5 seconds is a good default. Your script should accept an interval argument and validate it as a positive integer. It should also handle termination gracefully so the CSV file is not corrupted. For example, if the user hits Ctrl+C, you should not leave a partially written line. A trap handler can cleanly exit or ensure the last line is complete.
CSV hygiene is about keeping data and logs separate. The CSV file should contain only data lines, with a single header at the top. Your script should check if the file exists; if not, write the header. If it exists, append without duplicating the header. This is crucial for long-running monitors. Use printf to avoid unexpected formatting. Avoid echo because its behavior can vary, especially with -e and escape sequences. Set LC_ALL=C to ensure numeric formatting uses . as the decimal separator.
Timestamp consistency is another critical point. Decide on a timestamp format, such as ISO 8601 (YYYY-MM-DDTHH:MM:SSZ), and stick to it. Use UTC (date -u) to avoid confusion across time zones. If you want deterministic output for tests, allow SOURCE_DATE_EPOCH to override the current time. This is a standard convention for reproducible builds and works well for test fixtures.
When logging, you must decide where to store your CSV. If you log to the current directory, a cron job might run with a different working directory. Therefore, use an explicit output path or resolve it relative to the script location. This avoids logs going to unexpected places. Also consider log rotation: if you run a monitor for days, the CSV file can grow large. A simple approach is to include the date in the filename so a new file is created each day. This can be an extension for advanced learners.
Finally, think about measurement jitter. If you simply sleep $interval at the end of each loop, the actual interval will be interval + measurement_time. This might be fine, but if you want accurate spacing, you can measure the time each iteration takes and adjust sleep accordingly. This is optional, but understanding this effect is part of good monitoring design. The key is to document your approach so the data is interpreted correctly.
How This Fits in Projects
This concept governs §3.4 (Example Output), §5.2 (Project Structure), and §6.2 (Test Cases). The same output hygiene applies to P01 Website Status Checker and P02 Log File Analyzer.
Definitions & Key Terms
- Sampling interval: Time between measurements.
- CSV header: The first line describing column names.
- Deterministic output: Output that does not change across runs with the same inputs.
- Trap handler: Shell mechanism to handle signals cleanly.
- Jitter: Variation in sampling interval due to processing time.
Mental Model Diagram (ASCII)
loop -> measure -> format -> append -> sleep -> repeat
How It Works (Step-by-Step)
- Parse interval and output file path.
- Write CSV header if file missing.
- Loop: read metrics, format line, append.
- Sleep for the interval.
- Handle signals to exit cleanly.
Invariants: header appears once; each line has the same number of fields. Failure modes: duplicate headers, mixed logs in CSV, negative intervals.
Minimal Concrete Example
if [ ! -f "$out" ]; then
printf "timestamp,mem_used_percent,cpu_load_1m,disk_used_percent\n" >> "$out"
fi
printf "%s,%s,%s,%s\n" "$ts" "$mem" "$load" "$disk" >> "$out"
Common Misconceptions
- “Any timestamp is fine.” -> Inconsistent formats make analysis difficult.
- “Headers can repeat.” -> It breaks CSV parsers.
- “Sleep always equals interval.” -> Processing time adds overhead.
Check-Your-Understanding Questions
- Why should the CSV header be written only once?
- How can you make timestamps consistent across machines?
- What is sampling jitter and why does it matter?
Check-Your-Understanding Answers
- Multiple headers break parsers and complicate analysis.
- Use UTC and a fixed format like ISO 8601.
- It is the deviation from expected interval, which affects analysis accuracy.
Real-World Applications
- Performance monitoring and alerting
- Capacity planning dashboards
- Incident investigation with time-series data
Where You’ll Apply It
- Project 4: §3.4, §5.2, §6.2
- Also used in: P01 Website Status Checker, P02 Log File Analyzer
References
- Effective Shell (Kerr), Ch. 6
- The Linux Command Line (Shotts), Ch. 6
Key Insight
Clean time-series data requires disciplined formatting and consistent timing.
Summary
Sampling and CSV hygiene turn raw metrics into usable data.
Homework/Exercises to Practice the Concept
- Write a loop that logs a counter to a CSV every 2 seconds.
- Add a trap handler that prints a message on Ctrl+C.
- Modify the loop to avoid duplicate headers.
Solutions to the Homework/Exercises
i=0; while :; do printf "%s,%s\n" "$(date -u +%FT%TZ)" "$i" >> log.csv; i=$((i+1)); sleep 2; donetrap 'echo "stopping"; exit 0' INTif [ ! -f log.csv ]; then echo "ts,val" > log.csv; fi
3. Project Specification
3.1 What You Will Build
You will build monitor.sh, a CLI script that samples CPU load (1m), memory usage percentage, and disk usage percentage at a configurable interval and appends rows to a CSV file. It will create the CSV header once, support deterministic timestamps for testing, and run continuously until interrupted. It will not implement alerts or charts.
3.2 Functional Requirements
- Interval argument: Accept a sampling interval in seconds.
- CSV output: Write to
resource_log.csvor a provided path. - Metrics: Record
timestamp,mem_used_percent,cpu_load_1m,disk_used_percent. - Header: Ensure header appears exactly once.
- Determinism: Respect
SOURCE_DATE_EPOCHfor fixed timestamps when provided. - Exit codes: 0 on clean exit, 1 on usage error.
3.3 Non-Functional Requirements
- Reliability: Run for at least 1 hour without corrupting output.
- Performance: Minimal overhead; use lightweight commands.
- Usability: Clear message on start and graceful shutdown on Ctrl+C.
3.4 Example Usage / Output
$ ./monitor.sh 5
Monitoring system... press Ctrl+C to stop
$ head -n 5 resource_log.csv
timestamp,mem_used_percent,cpu_load_1m,disk_used_percent
2025-12-31T12:00:00Z,35,0.12,45
2025-12-31T12:00:05Z,36,0.18,45
2025-12-31T12:00:10Z,36,0.22,45
3.5 Data Formats / Schemas / Protocols
CSV schema:
timestamp,mem_used_percent,cpu_load_1m,disk_used_percent
3.6 Edge Cases
- Interval is non-numeric or zero.
- Output file is not writable.
freeordfnot available.- Locale uses comma for decimal separator.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
cd /path/to/monitor
chmod +x monitor.sh
TZ=UTC SOURCE_DATE_EPOCH=1767268800 ./monitor.sh 5
3.7.2 Golden Path Demo (Deterministic)
With TZ=UTC and SOURCE_DATE_EPOCH=1767268800 (2025-12-31T12:00:00Z), your first line should show that timestamp.
3.7.3 If CLI: Exact Terminal Transcript
$ TZ=UTC SOURCE_DATE_EPOCH=1767268800 ./monitor.sh 5
Monitoring system... press Ctrl+C to stop
$ head -n 3 resource_log.csv
timestamp,mem_used_percent,cpu_load_1m,disk_used_percent
2025-12-31T12:00:00Z,35,0.12,45
2025-12-31T12:00:05Z,36,0.18,45
3.7.4 Failure Demo (Bad Input)
$ ./monitor.sh abc
ERROR: interval must be a positive integer
Exit code: 1
4. Solution Architecture
4.1 High-Level Design
parse args -> ensure header -> loop (collect -> format -> append) -> sleep
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Arg Parser | Validate interval and output path | Default to 5 seconds |
| Metrics Collector | Read /proc and commands | Prefer /proc for load |
| Formatter | CSV line generation | ISO 8601 UTC timestamps |
| Writer | Append to CSV | Header only once |
4.3 Data Structures (No Full Code)
interval=5
out="resource_log.csv"
4.4 Algorithm Overview
Key Algorithm: Sampling Loop
- Validate interval.
- Write header if needed.
- Loop: read metrics, format line, append.
- Sleep for interval.
Complexity Analysis:
- Time: O(k) per iteration where k is constant.
- Space: O(1) plus output file.
5. Implementation Guide
5.1 Development Environment Setup
which awk df free
5.2 Project Structure
monitor/
├── monitor.sh
├── resource_log.csv
└── tests/
└── fixtures/
5.3 The Core Question You’re Answering
“How can I continuously measure system health using standard CLI tools?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- System metrics meaning (Concept 1)
- Sampling and CSV hygiene (Concept 2)
5.5 Questions to Guide Your Design
- What interval gives useful data without too much overhead?
- Which memory metric best represents pressure?
- Should you log to a fixed file or rotate by date?
- What happens if commands fail temporarily?
5.6 Thinking Exercise
If a CPU spike lasts 1 second, how likely is a 10-second sampler to detect it? What interval would you choose for short spikes?
5.7 The Interview Questions They’ll Ask
- “What is the difference between load average and CPU usage?”
- “Why is
MemAvailablemore useful thanMemFree?” - “How do you ensure CSV output stays clean?”
- “How would you test a long-running monitor?”
5.8 Hints in Layers
Hint 1: Load average
load=$(awk '{print $1}' /proc/loadavg)
Hint 2: Memory usage
mem=$(free | awk '/Mem/ {printf "%.0f", ($2-$7)/$2*100}')
Hint 3: Disk usage
disk=$(df -P / | awk 'NR==2 {gsub(/%/,"",$5); print $5}')
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Processes and load | The Linux Command Line | Ch. 10 |
| System internals | How Linux Works | Ch. 4 |
5.10 Implementation Phases
Phase 1: Foundation (2 hours)
Goals:
- Read metrics once and print
Tasks:
- Implement metric extraction.
- Print formatted line to stdout.
Checkpoint: One correct line printed with numeric values.
Phase 2: Core Functionality (3 hours)
Goals:
- Loop and append to CSV
Tasks:
- Add interval loop.
- Create CSV header if missing.
Checkpoint: CSV grows with consistent lines.
Phase 3: Polish & Edge Cases (2 hours)
Goals:
- Deterministic timestamps
- Input validation
Tasks:
- Add
SOURCE_DATE_EPOCHsupport. - Validate interval argument.
Checkpoint: Script exits with error on invalid interval.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| CPU metric | loadavg vs percent | loadavg | simpler and consistent |
| Timestamp format | local vs UTC | UTC | easier comparisons |
| Output file | fixed vs rotating | fixed | simple base case |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Validate parsing | numeric extraction from commands |
| Integration Tests | Run loop for 3 samples | correct CSV format |
| Edge Case Tests | Invalid interval | error and exit code |
6.2 Critical Test Cases
- Valid run: produces header and first sample.
- Invalid interval: exits 1 with error.
- CSV header: appears only once when appending.
- Deterministic time: with
SOURCE_DATE_EPOCH, timestamps are fixed.
6.3 Test Data
interval=1
SOURCE_DATE_EPOCH=1767268800
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Header duplicates | CSV has repeated headers | Write header only if file missing |
| Wrong memory field | Values look too high | Use MemAvailable for pressure |
| Locale issues | Commas in decimals | Set LC_ALL=C |
7.2 Debugging Strategies
- Print raw command output before parsing.
- Run with
set -xto trace variable values.
7.3 Performance Traps
- Extremely short intervals (<1s) can consume significant CPU and disk.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add disk usage for
/var. - Add optional output path argument.
8.2 Intermediate Extensions
- Rotate CSV daily.
- Add memory used in MB alongside percent.
8.3 Advanced Extensions
- Compute CPU percent from
/proc/statdeltas. - Add alert thresholds and notifications.
9. Real-World Connections
9.1 Industry Applications
- Ops monitoring: quick health visibility.
- Capacity planning: long-term trend analysis.
9.2 Related Open Source Projects
- collectd: system statistics collection daemon.
- node_exporter: Prometheus metrics exporter.
9.3 Interview Relevance
- Systems knowledge: interpreting load and memory.
- Data handling: reliable logging and sampling.
10. Resources
10.1 Essential Reading
- How Linux Works by Brian Ward - Ch. 4
- The Linux Command Line by William E. Shotts - Ch. 10
10.2 Video Resources
- “Linux Performance Basics” (any systems course)
10.3 Tools & Documentation
- /proc:
man proc - free/df:
man free,man df
10.4 Related Projects in This Series
- Project 1: Website Status Checker: output formatting discipline
- Project 3: Automated Backup Script: cron-safe logging
11. Self-Assessment Checklist
11.1 Understanding
- I can explain what load average means.
- I can explain why
MemAvailableis useful. - I can describe the trade-off of sampling intervals.
11.2 Implementation
- CSV output is clean and consistent.
- Interval argument is validated.
- Script runs for 1 hour without corruption.
11.3 Growth
- I can import the CSV into a tool and plot it.
- I can explain my metric choices.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Script logs timestamps, load, memory, and disk usage.
- Header appears exactly once.
- Interval argument works.
Full Completion:
- Deterministic timestamps supported.
- Handles invalid input with clear errors.
Excellence (Going Above & Beyond):
- CPU percent calculation with
/proc/stat. - Daily CSV rotation with archival.