Project 2: Log Hunter

Build a log filtering tool that extracts high-signal errors, shows context, and produces a deterministic incident summary.

Quick Reference

Attribute	Value
Difficulty	Beginner
Time Estimate	4-8 hours
Main Programming Language	Bash (Alternatives: Python)
Alternative Programming Languages	Python
Coolness Level	Level 3 - “incident responder”
Business Potential	High (production observability)
Prerequisites	regex basics, shell pipelines
Key Topics	grep ERE, context lines, exit codes, aggregation

1. Learning Objectives

By completing this project, you will:

Design regex patterns that reduce false positives in log streams.
Control grep output for line numbers, filenames, and context.
Interpret grep exit codes correctly in automation.
Produce a summary of unique errors with counts.
Build a deterministic report from noisy logs.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Regex Matching for Log Streams (ERE and Boundaries)

Fundamentals

Grep reads input line-by-line and applies a regular expression to each line. In log analysis, your regex is the difference between signal and noise. Basic regex features like literals, alternation, anchors, and character classes are enough to build high-quality filters, but only if you understand how matching works. Grep uses leftmost-longest matching for POSIX ERE. This means it will choose the earliest match on the line and then the longest possible match at that position. If you write a broad pattern like ERROR|WARN, you might match unrelated lines that contain those tokens in unrelated contexts. Good log hunting starts with precise, anchored patterns that reflect the log format. Logs are not free-form text; they usually follow a grammar. Whether it is an ISO timestamp, a severity token, or a request ID, the structure of the line gives you a roadmap for matching. Treating logs as structured text reduces false positives and makes results explainable. That is the difference between “grep until it looks right” and “grep with a model of the data.” Your regex is a model of the log format, and every piece of that model should map to a real field in the log line.

Deep Dive into the concept

Log formats are semi-structured. Many logs start with timestamps, then a severity token, then a message. A regex that mirrors that structure is more accurate than a free-form token match. For example, ^[0-9]{4}-[0-9]{2}-[0-9]{2}.*(ERROR|FATAL) is more precise than ERROR|FATAL because it requires the line to look like a timestamped log. Anchors (^ and $) are essential for reducing false positives.

POSIX ERE does not support lazy quantifiers, so you must control breadth using character classes and explicit delimiters. For example, if a log format is LEVEL: message, you can use ^(ERROR|FATAL): to avoid matching message text. Word boundaries in grep are not identical to \b in PCRE; grep has -w for whole words but that only matches word characters [A-Za-z0-9_]. If your logs contain hyphenated tokens, -w may miss them. Understanding these subtleties prevents missed matches or overmatching.

Another important detail is case sensitivity. Most system logs use uppercase severity tokens, but application logs vary. grep -i is convenient, but it can be too permissive. A better approach is to normalize or to explicitly include variants, such as (ERROR|Error|error), if you want to be strict. When you are hunting incidents, precision beats recall; false positives can drown the real signal.

Log lines can also be multi-line in practice, such as stack traces. Grep is line-based, so you must decide if you treat stack traces as context rather than matches. One strategy is to match the header line (the exception line) and show context with -C or -A. Another is to use tools that can join lines, but for this project you stay within grep and leverage context to reveal stack traces.

The leftmost-longest rule can surprise you when you use .* in the middle of a pattern. .* is greedy and will consume as much as possible, which can push your match past the intended token. Instead, prefer explicit classes like [^ ]+ for a non-space token or [^]]+ for bracketed fields. The key is to model the log grammar rather than guess with a broad pattern.

Finally, you must test your regex on a representative sample. Build a small fixture of log lines, mark which ones should match, and iterate. Good log hunting is as much about test data as about regex syntax. If your pattern captures the exact format of the logs you care about, the rest of the pipeline becomes reliable.

Another layer of nuance is that many systems emit multiple log formats in the same file. For example, a web server might log access lines and error lines in the same directory. If you use a single pattern that matches both, you will lose precision. A better strategy is to define multiple patterns and annotate each with a label (for example, ERROR, FATAL, PANIC). This allows your report to show counts by category. Even if you implement the first version with a single regex, you should design the script so that it can accept a list of patterns later.

You also need to consider encoding and control characters. Logs can contain ANSI color codes or non-printable bytes. Grep will happily match them, but your output may be hard to read. A robust log hunter strips or normalizes these sequences before aggregation. That can be as simple as sed -r 's/\\x1B\\[[0-9;]*[mK]//g' or by piping through tr -cd '\\11\\12\\15\\40-\\176' to remove non-printable characters. This is an advanced but practical detail if you are dealing with colored or binary-adjacent logs.

Finally, consider performance. Grep is fast, but it still scans every byte of input. Anchored patterns that match early in the line are faster than patterns with complex backtracking. Although POSIX ERE avoids catastrophic backtracking, overly broad patterns still cost time and create large outputs. The most performance-friendly approach is to make your pattern as specific as the log format allows.

How this fit on projects

This project depends on accurate regex selection. Your pattern determines which lines appear in the report and which lines get counted as incidents.

Definitions & key terms

ERE: Extended Regular Expressions used by grep -E.
anchor: ^ or $, used to match line boundaries.
alternation: A|B to match either pattern.
leftmost-longest: POSIX regex rule for choosing a match.

Mental model diagram (ASCII)

log line: 2026-01-01T10:01:00 ERROR Database timeout
pattern : ^[0-9-]+T[0-9:]+ (ERROR|FATAL)
match   : ------------------^^^^^

How it works (step-by-step)

Grep reads a line from the log.
The regex engine finds the leftmost match.
If a match exists, the line is selected.
Output formatting flags add filename, line number, and context.
Invariant: the pattern must match the intended log grammar, not arbitrary text.
Failure modes: unanchored patterns, mixed log formats, or multiline logs that need context.

Minimal concrete example

grep -n -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}.*(ERROR|FATAL)' app.log

Common misconceptions

“Regex is just string contains” -> False; anchors and classes change meaning.
“-w is the same as \b” -> False; word characters are limited.
“Adding .* makes a pattern safer” -> False; it often makes it too broad.

Check-your-understanding questions

Why does anchoring to the start of line reduce false positives?
What does leftmost-longest mean in POSIX ERE?
Why can grep -w miss hyphenated tokens?
How can you match only lines with a severity token after a timestamp?

Check-your-understanding answers

It enforces log structure and prevents matching message text.
The engine picks the earliest match and extends it as far as possible.
-w only treats alphanumerics and underscore as word characters.
Use a pattern like ^[0-9-]+T[0-9:]+ (ERROR|FATAL).

Real-world applications

Incident triage and error burst detection.
Security log scanning for known signatures.
Quality control in ETL pipelines.

Where you will apply it

In this project: see §3.2 (requirements) and §5.10 (phases).
Also used in: P03-data-miner.md and P04-code-auditor.md.

References

The Linux Command Line (Shotts), Chapter 19
man grep
POSIX regex specification

Key insights

Regex is a modeling tool. Match the log grammar, not random tokens.

Summary

Accurate log hunting starts with precise, anchored regex patterns that reflect real log structure.

Homework/Exercises to practice the concept

Write a regex that matches only ERROR lines with ISO-8601 timestamps.
Identify two false positives in a sample log and refine the regex.
Compare grep -E with and without anchors on the same dataset.

Solutions to the homework/exercises

grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9:]+ (ERROR|FATAL)' log.txt
Add a token boundary like ^[0-9-]+T[0-9:]+ (ERROR|FATAL) .
Observe how the unanchored version matches unrelated lines.

2.2 Context, Exit Codes, and Aggregation Pipelines

Fundamentals

Grep is a line filter, but log hunting requires context and summary. Flags like -n, -H, and -C add line numbers, filenames, and surrounding lines so you can interpret errors. Exit codes also matter: grep returns 0 for match, 1 for no match, and 2 for errors. In a script, treating exit code 1 as failure is wrong because it means “no matches” rather than a crash. Aggregation is done with sort, uniq -c, and awk to count unique messages and rank them. This is how you turn a thousand log lines into a top-10 incident report. These mechanics are also how you build confidence in your report. If you are on call, you need to know whether “no errors” means the system is healthy or your pattern is wrong. That is why exit code handling and explicit headers are critical: they are the guardrails that tell you whether the pipeline actually worked. A log hunter is a trust-building tool, not just a grep command.

Deep Dive into the concept

Context lines are your bridge from a matching line to the surrounding event. -C N shows N lines before and after, -A shows after, -B shows before. For stack traces, a common approach is -A 5 to show the first few frames. You must control context size to avoid flooding the report. If the report is intended for humans, you might limit context to a small number and then provide a separate raw output file for deeper analysis.

Exit codes are subtle but critical. In scripts, set -e will terminate on non-zero exit codes. If you use grep in such a script, a “no matches” result will exit with 1 and abort the script unless you explicitly handle it. A robust log hunter checks $? or uses a pattern like grep ... || true for the selection stage, and then makes a decision about whether to mark the report as empty or failed. Distinguishing between “no errors found” and “grep failed” is a fundamental automation practice.

Aggregation involves normalizing lines. If log lines contain timestamps or request IDs, you should strip or transform those fields before counting unique messages. The simplest method is awk or sed to remove a leading timestamp and keep the severity token plus message. You then sort the normalized messages and count with uniq -c. Since uniq only counts adjacent duplicates, sort is mandatory. Determinism depends on LC_ALL=C as in Project 1.

Another issue is file selection. In real systems, logs are rotated; you may have app.log, app.log.1, app.log.2.gz. Grep can read multiple files, but it does not handle gzip without zgrep. For this project, you can restrict to uncompressed logs or optionally detect gzip and use zgrep. The key is to be explicit about what you include. If you include only *.log, you may miss archived errors.

Finally, reporting format matters. Your report should include a header with the scan time, the pattern used, and the number of files scanned. This makes the output auditable. It should also include a summary section and a sample context section. Think of the output as a mini incident report: it must answer “what happened”, “where”, and “how often”.

Another subtlety is pipeline failure propagation. If you use set -e or pipefail, a grep exit code of 1 will cause the entire script to abort unless you explicitly handle it. A robust pattern is to capture the output into a variable or temp file, check the exit code, and then continue to aggregation only when appropriate. Similarly, if you tee intermediate output, remember that some versions of tee can mask exit codes unless you check $PIPESTATUS. These details matter when you turn a shell pipeline into a reusable tool.

Aggregation should also normalize message text. In practice, logs often include request IDs, user IDs, or timestamps within the message body. If you count raw lines, you may end up with thousands of “unique” errors that are actually the same failure with different IDs. A good log hunter strips variable segments or uses regex capturing to extract only the stable portion of the message. This can be done with awk or a small sed script, and it makes the counts far more useful.

Finally, consider output determinism and readability. Sorting by count is helpful, but for equal counts you should break ties by message text to ensure stable output. You should also cap the number of results in the report (for example, top 20) to keep it readable. If the report is too long, engineers will not read it. A good log hunter optimizes for clarity, not raw volume.

How this fit on projects

This project uses context flags to show evidence, and aggregation pipelines to produce a ranked summary of errors.

Definitions & key terms

context lines: lines before or after a match shown by -B, -A, -C.
exit status: grep return code (0 match, 1 no match, 2 error).
aggregation: counting and ranking unique messages.
normalization: removing volatile fields like timestamps.

Mental model diagram (ASCII)

logs -> grep (matches) -> normalize -> sort -> uniq -c -> report
           | context lines |

How it works (step-by-step)

Select matching lines with grep.
Capture context lines for evidence.
Normalize message text to remove timestamps.
Sort and count unique messages.
Render a human-readable summary.
Invariant: summary counts are based on normalized messages, not raw lines.
Failure modes: grep exit code 1 treated as fatal, unsorted input to uniq, or excessive context.

Minimal concrete example

grep -h -E 'ERROR|FATAL' *.log | awk '{print $2, $3, $4}' | sort | uniq -c | sort -rn

Common misconceptions

“grep returning 1 means failure” -> False; it means no matches.
“uniq counts duplicates anywhere” -> False; input must be sorted.
“context lines are always safe” -> False; too much context can leak sensitive data.

Check-your-understanding questions

What exit code does grep return when there are no matches?
Why must you sort before uniq -c?
How do you include filenames in grep output?
Why would you normalize log lines before counting them?

Check-your-understanding answers

Exit code 1.
uniq only counts adjacent duplicates.
Use -H or search multiple files to include filenames.
To remove volatile fields and group identical messages.

Real-world applications

Incident reports in on-call rotations.
Monitoring pipelines that summarize errors per hour.
Compliance audits for recurring failures.

Where you will apply it

In this project: see §3.7 (report format) and §5.10 (phases).
Also used in: P05-the-pipeline.md and P07-stats-engine.md.

References

man grep
The Linux Command Line (Shotts), Chapter 20
Effective Shell (Kerr), Chapter 6

Key insights

A log hunter is a pipeline. Matching is only the first step; reporting is the real product.

Summary

Context control, exit code handling, and aggregation transform raw grep output into a reliable incident report.

Homework/Exercises to practice the concept

Build a pipeline that strips timestamps and counts unique errors.
Write a script that treats grep exit code 1 as “no findings” instead of failure.
Compare output with -C 1 vs -C 5 and document the difference.

Solutions to the homework/exercises

grep -h -E 'ERROR|FATAL' *.log | awk '{print $2, $3, $4}' | sort | uniq -c | sort -rn
grep -E 'ERROR' app.log || true and check $?.
Use diff between outputs and note how context volume grows.

3. Project Specification

3.1 What You Will Build

A CLI tool that scans a directory of log files, extracts error lines using a configurable regex, includes context around each match, and produces a deterministic summary report of unique errors with counts.

3.2 Functional Requirements

Pattern filtering: accept a regex pattern and search all matching logs.
Context output: include line numbers, filenames, and N context lines.
Summary aggregation: produce ranked counts of unique error messages.
Exit codes: differentiate between no matches and hard errors.
Deterministic output: stable order of summary lines.

3.3 Non-Functional Requirements

Performance: handle 50MB of logs in under 10 seconds on a laptop.
Reliability: exit code 1 indicates no matches, not a crash.
Usability: default pattern for common severities.

3.4 Example Usage / Output

$ ./log_hunter.sh /var/log/app --pattern 'ERROR|FATAL|panic' --context 2

3.5 Data Formats / Schemas / Protocols

Report format (text):

SCAN_TIME=2026-01-01T12:00:00
PATTERN=ERROR|FATAL|panic
FILES_SCANNED=12

Top messages:
  19 ERROR Database timeout
  12 FATAL Out of memory

Sample context:
/path/app.log:3121:ERROR Database timeout
/path/app.log-1:97:FATAL Out of memory

3.6 Edge Cases

Logs with no matches.
Large lines (stack traces).
Mixed encodings or binary data.

3.7 Real World Outcome

A concise incident report that an on-call engineer can read in minutes.

3.7.1 How to Run (Copy/Paste)

./log_hunter.sh ./fixtures/logs --pattern 'ERROR|FATAL|panic' --context 2

3.7.2 Golden Path Demo (Deterministic)

Use a fixed fixture dataset and a frozen scan timestamp of 2026-01-01T12:00:00 in the report header.

3.7.3 If CLI: exact terminal transcript

$ ./log_hunter.sh ./fixtures/logs --pattern 'ERROR|FATAL|panic' --context 2
[2026-01-01T12:00:00] TARGET=./fixtures/logs
[2026-01-01T12:00:00] PATTERN=ERROR|FATAL|panic
[2026-01-01T12:00:00] CONTEXT=2
[2026-01-01T12:00:00] FILES=3
[2026-01-01T12:00:00] REPORT=log_report_2026-01-01.txt
[2026-01-01T12:00:00] DONE

$ cat log_report_2026-01-01.txt
Top messages:
  3 ERROR Database timeout
  1 FATAL Out of memory

Sample context:
./fixtures/logs/app.log:12:ERROR Database timeout
./fixtures/logs/app.log:13:Connection reset by peer
./fixtures/logs/app.log:14:Retrying request

Failure demo (no matches):

$ ./log_hunter.sh ./fixtures/logs --pattern 'DOES_NOT_EXIST'
[2026-01-01T12:00:00] NO_MATCHES
EXIT_CODE=1

Exit codes:

0: matches found and report generated
1: no matches found
2: invalid arguments or read errors

4. Solution Architecture

4.1 High-Level Design

logs -> grep filter -> context extract -> normalize -> sort/uniq -> report

4.2 Key Components

Component	Responsibility	Key Decisions
CLI parser	parse target, pattern, context	sensible defaults
Matcher	run grep with pattern	use -E and -n -H
Normalizer	strip timestamps	`awk` field selection
Aggregator	count unique messages	`sort \| uniq -c`

4.3 Data Structures (No Full Code)

message_counts: map[string]int

4.4 Algorithm Overview

Key Algorithm: Error Summary

Run grep to select matching lines.
Normalize lines to remove timestamps.
Sort and count unique messages.
Render top N results.

Complexity Analysis:

Time: O(n log n)
Space: O(n) for summary processing

5. Implementation Guide

5.1 Development Environment Setup

# No extra dependencies required

5.2 Project Structure

project-root/
├── log_hunter.sh
├── fixtures/
│   └── logs/
└── README.md

5.3 The Core Question You’re Answering

“How do I filter massive logs into a concise, useful incident report?”

5.4 Concepts You Must Understand First

Regex matching and anchors
Context flags and exit codes
Aggregation pipelines and sorting

5.5 Questions to Guide Your Design

What severity tokens are meaningful for your logs?
How many context lines are helpful without noise?
What should happen if there are zero matches?

5.6 Thinking Exercise

Draft a regex that matches only lines starting with ISO timestamps followed by a severity token. Then decide how you would count unique messages.

5.7 The Interview Questions They’ll Ask

“Why does grep return exit code 1 sometimes?”
“How do you show context around a match?”
“Why must you sort before uniq?”

5.8 Hints in Layers

Hint 1: Basic match

grep -n -H -E 'ERROR|FATAL|panic' *.log

Hint 2: Add context

grep -n -H -C 2 -E 'ERROR|FATAL|panic' *.log

Hint 3: Summarize

grep -h -E 'ERROR|FATAL|panic' *.log | awk '{print $2, $3, $4}' | sort | uniq -c | sort -rn

5.9 Books That Will Help

Topic	Book	Chapter
Regex basics	The Linux Command Line (Shotts)	Ch. 19
Text processing	The Linux Command Line (Shotts)	Ch. 20
Shell scripting	Effective Shell (Kerr)	Ch. 6

5.10 Implementation Phases

Phase 1: Foundation (1-2 hours)

Goals:

Parse args and validate target
Run basic grep match

Tasks:

Implement --pattern and --context flags.
Verify grep exit codes.

Checkpoint: matching lines printed with line numbers.

Phase 2: Core Functionality (2-3 hours)

Goals:

Context extraction and summary report

Tasks:

Add -C and -H for context.
Normalize and count unique messages.

Checkpoint: report shows top messages with counts.

Phase 3: Polish & Edge Cases (1-2 hours)

Goals:

Deterministic output and clean headers

Tasks:

Add report headers and fixed ordering.
Handle no-match case gracefully.

Checkpoint: NO_MATCHES report created with exit code 1.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Pattern syntax	BRE vs ERE	ERE	easier alternation
Context size	0-10 lines	2 lines	enough evidence, low noise
Summary key	full line vs normalized	normalized	reduces duplicates

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	regex correctness	pattern matches fixtures
Integration Tests	end-to-end report	run on fixture logs
Edge Case Tests	no matches	empty report case

6.2 Critical Test Cases

No matches: exit code 1 and explicit message.
Case variation: ensure case handling is correct.
Large file: pipeline still completes and summary is sorted.

6.3 Test Data

fixtures/logs/app.log
fixtures/logs/app.log.1

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Over-broad regex	too many matches	anchor to log format
Treating exit 1 as error	script fails	handle no-match case
Missing normalization	many duplicates	strip timestamps

7.2 Debugging Strategies

Test regex on a 20-line fixture before scanning all logs.
Add set -x and inspect pipeline stages.
Use tee to capture intermediate output.

7.3 Performance Traps

Unbounded context on huge logs can produce massive output. Limit context size and cap report size.

8. Extensions & Challenges

8.1 Beginner Extensions

Add --ignore-case flag.
Output JSON summary for easy ingestion.

8.2 Intermediate Extensions

Add gzip support with zgrep.
Add severity ranking and weighting.

8.3 Advanced Extensions

Build a continuous log watcher with tail -F.
Correlate errors across multiple services.

9. Real-World Connections

9.1 Industry Applications

On-call incident summaries.
SLA error rate tracking.
Compliance monitoring for critical failures.

logrotate: log management and rotation.
ripgrep: high-performance text search.

9.3 Interview Relevance

Regex and text filtering.
Exit code handling.
Pipeline design for reports.

10. Resources

10.1 Essential Reading

The Linux Command Line (Shotts), Chapters 19-20
Effective Shell (Kerr), Chapter 6

10.2 Video Resources

“Regex for Log Analysis” (conference talk)
“Unix Pipelines for Ops” (YouTube)

10.3 Tools & Documentation

man grep
man sort
man uniq

P01-digital-census.md - metadata inventory
P03-data-miner.md - regex extraction

11. Self-Assessment Checklist

11.1 Understanding

I can explain leftmost-longest matching.
I can explain grep exit codes.
I understand why sorting is required for uniq.

11.2 Implementation

Report includes counts and sample context.
Script handles no matches correctly.
Output is deterministic.

11.3 Growth

I can propose one improvement to pattern quality.
I documented a false positive and fixed it.
I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Pattern search works and produces a summary report.
Context lines are included for at least one match.
Exit code 1 used for no matches.

Full Completion:

Deterministic summary with sorted output.
Patterns anchored to log format.

Excellence (Going Above & Beyond):

Report includes severity weighting and top-N by service.
Support for compressed logs included.