Project 2: Log File Analyzer
Build a pipeline or short script that turns raw access logs into a ranked, human-readable summary of client IPs and requested URLs, with optional filtering by status code and date.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 8 to 12 hours |
| Main Programming Language | Bash (with awk/sort/uniq) |
| Alternative Programming Languages | Python, Go, Ruby |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | Level 2: Internal Analytics |
| Prerequisites | Shell basics, comfort reading logs |
| Key Topics | Log formats, pipelines, awk field extraction, regex filtering, aggregation |
1. Learning Objectives
By completing this project, you will:
- Parse real access logs and extract relevant fields reliably.
- Build a multi-stage pipeline that transforms large datasets into summaries.
- Use
awk,sort, anduniq -ccorrectly and efficiently. - Handle malformed input without breaking your analysis.
- Produce a report that can be compared across runs.
2. All Theory Needed (Per-Concept Breakdown)
Concept 1: Access Log Formats and Field Extraction
Fundamentals
Access logs are structured but not self-describing. Each line represents a single request and encodes information in a fixed order: client IP, timestamp, request line, status code, and response size. To analyze logs correctly, you must know the format and how to extract the fields you care about. Apache and Nginx often use the Common Log Format (CLF) or Combined Log Format. Both are whitespace-separated, but the request line is quoted and contains spaces, which means naive splitting can break. awk solves this by treating whitespace as field separators, but you must be aware of which fields contain quotes and how many fields you can trust. A good analyzer starts by assuming the log format and then validating that assumption by checking field counts and patterns.
Deep Dive into the Concept
The Common Log Format typically looks like this:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
The Combined Log Format adds referrer and user-agent fields at the end, which are also quoted:
127.0.0.1 - - [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/1.1" 200 1234 "https://example.com" "Mozilla/5.0"
These formats are mostly whitespace-separated, but the quoted fields contain spaces, so you cannot blindly split the line into tokens and assume field 7 is always the URL unless you know the format. In CLF, the request line is field 6-8 when using default whitespace splitting: field 6 is the quoted method, field 7 is the path, and field 8 is the protocol. In Combined Log Format, the same positions still exist, but extra quoted fields appear later. This is why a safe approach is to extract the request line by parsing between quotes, or at least to check that field 6 starts with a quote and field 8 ends with a quote. Another robust method is to use awk with a custom field separator like " to split on quotes, then extract the request line as $2 and other fields from $1 and $3.
Understanding timestamps is another challenge. Logs use formats like [10/Oct/2000:13:55:36 -0700]. If you want to filter by date, you must parse or match the date portion carefully. A simple grep for 21/Dec/2025 is often enough for daily filters, but if you need ranges you will need to parse timestamps into comparable values. For this project, a simple pattern match by date prefix is acceptable, but you must document that limitation.
Malformed lines are common: log rotation interruptions, truncated lines, or custom log formats. A robust analyzer should ignore lines that do not match the expected pattern rather than failing. You can use awk 'NF >= 9' as a minimum guard for CLF, and additional checks like ($9 ~ /^[0-9][0-9][0-9]$/) to ensure the status code is numeric. This protects your analysis from garbage input. It also makes your results more reliable, because you are not counting broken lines.
Field extraction should be explicit and testable. If you say “field 1 is IP, field 9 is status” you should verify that assumption by printing those fields for a few lines. awk '{print $1, $9}' access.log | head is a simple sanity check. For URLs, the request line is usually field 7 in CLF, but only if the request line is properly quoted. If you use a quote-based field separator, you can extract the request line as $2 and then split it into method, path, and protocol. This method is more robust and is worth practicing because it generalizes to other logs with quoted fields.
When you are done extracting, treat the fields as data. Do not rely on manual inspection for correctness; instead, count the number of lines processed, the number skipped, and the number that matched filters. This gives you confidence that your analyzer did not silently drop data. You can include these counts in your final report as a sanity check. This approach mirrors real operational analytics pipelines: correctness is about accounting, not just formatting.
How This Fits in Projects
This concept is central to §3 (Project Specification) and §5.4 (Concepts You Must Understand First). It is also used in the filtering logic in §5.5 and the test cases in §6.2. The same field-extraction discipline applies to P04 System Resource Monitor when parsing command output.
Definitions & Key Terms
- Common Log Format (CLF): Standard web server log format.
- Combined Log Format: CLF plus referrer and user-agent fields.
- Request line: The quoted string containing method, path, and protocol.
- Malformed line: A log entry that does not match expected format.
- Field separator: The character(s) used to split input into fields.
Mental Model Diagram (ASCII)
Raw log line
|
v
Split into fields ----> validate pattern ----> extract IP/status/URL
How It Works (Step-by-Step)
- Read each log line as a full record.
- Validate minimum field count and pattern.
- Extract IP from field 1.
- Extract status code from field 9 (CLF assumption).
- Extract URL from request line (field 7 or quote-based parsing).
- Emit fields for downstream aggregation.
Invariants: the request line is quoted; status is numeric. Failure modes: custom log formats, missing quotes, truncated lines.
Minimal Concrete Example
# Quote-based parsing
awk -F'"' '{print $1, $2}' access.log | head
Common Misconceptions
- “Field 7 is always the URL.” -> Only if the log format matches CLF and is well-formed.
- “All lines are valid.” -> Production logs often contain malformed lines.
- “grep is enough.” -> Without proper parsing, you can miscount fields.
Check-Your-Understanding Questions
- Why does
awk '{print $7}'sometimes fail to extract the URL? - What is the difference between CLF and Combined Log Format?
- How can you detect malformed lines quickly?
Check-Your-Understanding Answers
- Because the request line is quoted and contains spaces; field positions can shift.
- Combined adds referrer and user-agent fields at the end.
- Check
NFand validate that the status field is numeric.
Real-World Applications
- Website analytics dashboards
- Security analysis (top IPs, suspicious paths)
- Capacity planning using traffic patterns
Where You’ll Apply It
- Project 2: §3.2, §5.4, §6.2
- Also used in: P04 System Resource Monitor
References
- The Linux Command Line (Shotts), Ch. 19-20
- Wicked Cool Shell Scripts (Taylor), Ch. 1
Key Insight
Log parsing is a contract between the log format and your extraction logic.
Summary
Correct field extraction is the difference between meaningful analytics and noise.
Homework/Exercises to Practice the Concept
- Identify the IP, status, and URL fields from three sample log lines.
- Write an awk command that prints only valid lines with numeric status.
- Use
-F'"'to split on quotes and extract the request line.
Solutions to the Homework/Exercises
- Print field numbers with
awk '{for (i=1;i<=NF;i++) print i,$i}'. awk 'NF >= 9 && $9 ~ /^[0-9][0-9][0-9]$/' access.logawk -F'"' '{print $2}' access.log | head
Concept 2: Pipelines, Filtering, and Aggregation
Fundamentals
The Unix philosophy encourages building complex behavior by chaining simple tools. In this project, you will build a pipeline where each command performs a small transformation: extract a field, filter by a condition, sort, count, and rank. The order of stages matters. sort must come before uniq -c, and numeric sorting must be explicit with sort -nr. These details are what make the pipeline correct. You will also learn to reduce input size early by filtering before sorting, which is a key performance idea.
Deep Dive into the Concept
A pipeline connects the stdout of one command to the stdin of the next. This enables a streaming dataflow: each line is processed in sequence and no intermediate files are required. For log analysis, this is ideal because logs can be large and you want to avoid loading everything into memory. A typical pipeline might look like:
awk '...extract...' access.log | sort | uniq -c | sort -nr | head -n 10
The first awk stage extracts the field you care about (IP or URL) and optionally filters by status code or date. This stage is crucial because it reduces the data volume. If you wait to filter until after sorting, you waste time sorting irrelevant lines. This is a fundamental principle: filter early, aggregate later.
sort groups identical values together. uniq -c counts consecutive duplicates, which is why sorting is required. The output of uniq -c includes counts with leading spaces; you can normalize this with awk or sort -nr which handles leading spaces as numeric values. head -n 10 limits output to the top N results. The pipeline is deterministic if you control locale; set LC_ALL=C so sorting is byte-based and consistent across machines. This is particularly important if your logs contain non-ASCII characters or your environment uses a locale with different collation rules.
Filtering with grep or awk can be done in many ways. grep is great for simple string matches, but awk can combine filtering and extraction in one step, reducing process overhead. For example, awk '$9==404 {print $1}' both filters and extracts the IP. This is more efficient than grep followed by awk. Regex filtering is powerful but should be used carefully; in log analysis, it’s often better to filter by exact numeric status rather than by textual matches. When you do use regex, remember that grep and awk use slightly different regex flavors; always test on a small sample.
Another important concept is stability and reproducibility. If you are comparing results across runs, you need deterministic ordering when counts are equal. sort -nr will not guarantee stable ordering for equal values. If you want deterministic output, you can add a secondary sort key, such as the field itself: sort -k1,1nr -k2,2. This ensures that ties are broken consistently. This matters in monitoring reports where you want consistent output for diffs.
Finally, pipelines should report errors. If awk fails due to an invalid script or sort fails because of disk space, the pipeline may still exit with the status of the last command by default. You can use set -o pipefail to ensure failures are detected. In a log analyzer, a silent failure is worse than a loud one. Consider writing a wrapper script that checks exit codes and prints a clear error message if any stage fails.
How This Fits in Projects
This concept defines the implementation approach in §4 (Solution Architecture) and the pipeline examples in §3.4. It also informs the testing strategy in §6.2 by defining where to validate each pipeline stage. These pipeline skills also apply directly to P01 Website Status Checker when formatting output.
Definitions & Key Terms
- Pipeline: A chain of commands connected by pipes.
- Aggregation: Summarizing data (counts, top N).
- Filtering: Removing lines that do not match conditions.
- Deterministic sorting: Sorting with explicit keys and locale.
- Streaming: Processing data line-by-line without full buffering.
Mental Model Diagram (ASCII)
log -> extract -> filter -> sort -> count -> rank -> report
How It Works (Step-by-Step)
- Extract relevant field(s) with
awk. - Filter by status or date at extraction time.
- Sort the extracted values.
- Count duplicates with
uniq -c. - Sort counts descending.
- Take the top N results.
Invariants: sort before uniq -c; filter early for efficiency.
Failure modes: unsorted input to uniq, locale-dependent ordering.
Minimal Concrete Example
awk '$9==404 {print $1}' access.log | sort | uniq -c | sort -nr | head -n 10
Common Misconceptions
- “uniq -c counts regardless of order.” -> It only counts adjacent duplicates.
- “Sorting after uniq is enough.” -> You must sort before
uniq -c. - “Locale does not matter.” -> It changes sort order and can break tests.
Check-Your-Understanding Questions
- Why must
sortcome beforeuniq -c? - How do you make output deterministic when counts tie?
- Why is filtering early more efficient?
Check-Your-Understanding Answers
uniqonly counts adjacent duplicates, so identical values must be grouped.- Use a secondary sort key such as the value itself (
sort -k1,1nr -k2,2). - It reduces the amount of data that later stages must process.
Real-World Applications
- Top talkers in firewall logs
- Most common endpoints in API logs
- Error frequency analysis
Where You’ll Apply It
- Project 2: §3.4, §4.4, §6.2
- Also used in: P01 Website Status Checker
References
- The Linux Command Line (Shotts), Ch. 6 and 20
- Wicked Cool Shell Scripts (Taylor), Ch. 1-2
Key Insight
A pipeline is a dataflow graph; correctness depends on the order of stages.
Summary
Filtering, sorting, and aggregation are the backbone of CLI analytics.
Homework/Exercises to Practice the Concept
- Build a pipeline to list the top 5 URLs in a log file.
- Modify it to show only 404 errors.
- Add a stable secondary sort key so ties are predictable.
Solutions to the Homework/Exercises
awk '{print $7}' access.log | sort | uniq -c | sort -nr | head -n 5awk '$9==404 {print $7}' access.log | sort | uniq -c | sort -nr | head -n 5awk '{print $7}' access.log | sort | uniq -c | sort -k1,1nr -k2,2 | head -n 5
3. Project Specification
3.1 What You Will Build
You will build a CLI pipeline or short script named analyze_log.sh that reads an access log and outputs the top 10 client IPs and top 10 requested URLs. It will support optional filters by status code and date prefix, handle malformed lines gracefully, and print a deterministic summary. It will not implement a full log parser or support arbitrary log formats beyond CLF/Combined; it will document assumptions explicitly.
3.2 Functional Requirements
- Input: Accept a log file path.
- Parsing: Assume CLF/Combined format and document this assumption.
- Filtering: Optional
--statusfilter (e.g., 404) and--datefilter (e.g.,21/Dec/2025). - Aggregation: Output top 10 IPs and top 10 URLs.
- Malformed lines: Skip invalid lines and report the count skipped.
- Determinism: Use
LC_ALL=Cand stable sorting. - Exit codes: 0 on success, 1 on usage error, 2 if file not readable.
3.3 Non-Functional Requirements
- Performance: Must handle multi-GB logs using streaming commands.
- Reliability: No partial output if input file is missing.
- Usability: Clear usage message and consistent output format.
3.4 Example Usage / Output
$ ./analyze_log.sh access.log --status 404
Top 10 IPs for status 404:
52 203.0.113.9
41 198.51.100.12
30 192.0.2.55
Top 10 URLs for status 404:
18 /robots.txt
12 /favicon.ico
9 /admin
Skipped malformed lines: 3
3.5 Data Formats / Schemas / Protocols
Input log format (CLF/Combined example):
203.0.113.9 - - [21/Dec/2025:15:00:00 +0000] "GET /index.html HTTP/1.1" 404 1234
Output format:
- Two ranked lists (IPs and URLs)
- Optional skipped line count
- Deterministic numeric ordering
3.6 Edge Cases
- Log file contains malformed or truncated lines.
- Log file uses unexpected format (non-CLF).
- Status code field missing or non-numeric.
- URL field missing or
-. - File is large and requires streaming.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
LC_ALL=C ./analyze_log.sh access.log --status 404 --date 21/Dec/2025
3.7.2 Golden Path Demo (Deterministic)
Set LC_ALL=C to ensure sort ordering and stable output for comparisons.
3.7.3 If CLI: Exact Terminal Transcript
$ LC_ALL=C ./analyze_log.sh access.log --status 404
Top 10 IPs for status 404:
52 203.0.113.9
41 198.51.100.12
30 192.0.2.55
Top 10 URLs for status 404:
18 /robots.txt
12 /favicon.ico
9 /admin
Skipped malformed lines: 3
Exit code: 0
3.7.4 Failure Demo (Bad Input)
$ ./analyze_log.sh missing.log
ERROR: cannot read log file: missing.log
Exit code: 2
4. Solution Architecture
4.1 High-Level Design
log file -> filter -> extract -> sort -> uniq -c -> sort -nr -> top N
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Parser | Validate line format | Use NF and status numeric checks |
| Filter | Apply status/date constraints | Implement in awk for speed |
| Aggregator | Count top IPs and URLs | Sort before uniq |
| Reporter | Format output | Deterministic ordering and counts |
4.3 Data Structures (No Full Code)
status_filter=""
date_filter=""
invalid_count=0
4.4 Algorithm Overview
Key Algorithm: Top-N Aggregation
- Read log lines.
- Filter malformed lines.
- Extract field (IP or URL).
- Sort values.
- Count with
uniq -c. - Sort counts descending and take top N.
Complexity Analysis:
- Time: O(n log n) due to sorting.
- Space: O(n) for sort buffers (external sort if large).
5. Implementation Guide
5.1 Development Environment Setup
which awk sort uniq
5.2 Project Structure
analyze-log/
├── analyze_log.sh
├── sample.log
└── tests/
└── fixtures/
5.3 The Core Question You’re Answering
“How do I convert raw log noise into a ranked summary that reveals real behavior?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Log format assumptions and field positions (Concept 1)
- Pipeline ordering and aggregation (Concept 2)
5.5 Questions to Guide Your Design
- Which log format are you assuming and how will you validate it?
- Should malformed lines be counted or ignored silently?
- How will you keep output deterministic for comparisons?
- Will you allow filtering by method or only by status/date?
5.6 Thinking Exercise
Given the log line below, identify the IP, URL, and status code:
203.0.113.9 - - [21/Dec/2025:15:00:00 +0000] "GET /index.html HTTP/1.1" 404 1234
Explain why your field extraction method works even if the user-agent field exists.
5.7 The Interview Questions They’ll Ask
- “Why must
sortcome beforeuniq -c?” - “How do you handle malformed log lines?”
- “How would you speed this up for very large logs?”
- “How do you ensure deterministic output?”
5.8 Hints in Layers
Hint 1: Extract IPs
awk 'NF >= 9 {print $1}' access.log
Hint 2: Filter by status
awk 'NF >= 9 && $9 == 404 {print $1}' access.log
Hint 3: Top N
awk 'NF >= 9 {print $1}' access.log | sort | uniq -c | sort -nr | head -n 10
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Redirection and pipes | The Linux Command Line | Ch. 6 |
| Regex and text parsing | The Linux Command Line | Ch. 19-20 |
| Practical pipelines | Wicked Cool Shell Scripts | Ch. 1-2 |
5.10 Implementation Phases
Phase 1: Foundation (2 hours)
Goals:
- Parse file and extract IPs
- Produce a top 10 list
Tasks:
- Implement basic pipeline.
- Verify output on small sample.
Checkpoint: Top 10 IPs printed correctly.
Phase 2: Core Functionality (3 hours)
Goals:
- Add URL list and status filter
- Add malformed line handling
Tasks:
- Implement filter flags.
- Add
NFchecks and skip counters.
Checkpoint: Filters work and skipped line count is correct.
Phase 3: Polish & Edge Cases (2 hours)
Goals:
- Deterministic ordering
- Clear usage output
Tasks:
- Set
LC_ALL=Cin script. - Add usage/help and exit codes.
Checkpoint: Output format is stable and script exits correctly on bad input.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Parsing strategy | Field numbers vs quote-split | Field numbers + validation | Simpler and adequate for CLF |
| Filtering | grep vs awk | awk | One process, combined filter/extract |
| Output | Free-form vs stable | Stable columns | Easier to compare across runs |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Validate parsing logic | IP/status extraction |
| Integration Tests | Run pipeline on sample log | top 10 IPs/URLs |
| Edge Case Tests | Malformed lines | truncated lines, missing fields |
6.2 Critical Test Cases
- Valid CLF line: Extract IP, status, URL correctly.
- Malformed line: Skipped and counted.
- Status filter: Only counts matching status.
- Deterministic ordering: Ties break consistently.
6.3 Test Data
203.0.113.9 - - [21/Dec/2025:15:00:00 +0000] "GET /index.html HTTP/1.1" 404 1234
198.51.100.12 - - [21/Dec/2025:15:01:00 +0000] "GET /robots.txt HTTP/1.1" 404 234
broken line here
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| No sort before uniq | Wrong counts | Add sort before uniq -c |
| Wrong field numbers | Empty output | Inspect awk '{print NF, $0}' |
| Locale differences | Output order changes | Set LC_ALL=C |
7.2 Debugging Strategies
- Print field positions with
awk '{for (i=1;i<=NF;i++) print i,$i}'. - Test each stage of the pipeline separately.
7.3 Performance Traps
- Sorting large logs without enough disk space can fail; use
/tmpwith space orsort -T.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
--top Nflag. - Add
--method GETfilter.
8.2 Intermediate Extensions
- Output CSV for IP counts and URL counts.
- Add a separate report for user agents.
8.3 Advanced Extensions
- Support JSON logs by adding a jq-based path.
- Add a histogram of response sizes.
9. Real-World Connections
9.1 Industry Applications
- Security: identify brute-force or scanning IPs.
- Performance: spot hot endpoints.
9.2 Related Open Source Projects
- GoAccess: real-time web log analyzer.
- AWStats: classic log analysis tool.
9.3 Interview Relevance
- Data pipelines: streaming vs batch processing.
- Text processing: regex and field extraction.
10. Resources
10.1 Essential Reading
- The Linux Command Line by William E. Shotts - Ch. 6, 19, 20
- Wicked Cool Shell Scripts by Dave Taylor - Ch. 1-2
10.2 Video Resources
- “Intro to Log Analysis” (any systems or security course)
10.3 Tools & Documentation
- awk:
man awkfor field splitting and regex - sort/uniq:
man sort,man uniq
10.4 Related Projects in This Series
- Project 1: Website Status Checker: similar parsing and reporting
- Project 3: Automated Backup Script: shell scripting robustness
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the difference between CLF and Combined logs.
- I can justify why
sortmust precedeuniq -c. - I can describe how to detect malformed lines.
11.2 Implementation
- All functional requirements are met.
- Output is deterministic and reproducible.
- Script exits with correct codes on errors.
11.3 Growth
- I can analyze a log file without guessing field positions.
- I can explain this pipeline in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Script outputs top 10 IPs and URLs.
- Optional status filter works.
- Malformed lines are skipped and counted.
Full Completion:
- All minimum criteria plus:
- Deterministic ordering with
LC_ALL=Cand tie-breaking. - Clear usage/help and exit codes.
Excellence (Going Above & Beyond):
- Supports multiple log formats with a
--formatflag. - Produces CSV and JSON summaries for integration.