Project 2: Log File Analyzer

Build a pipeline or short script that turns raw access logs into a ranked, human-readable summary of client IPs and requested URLs, with optional filtering by status code and date.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 8 to 12 hours
Main Programming Language Bash (with awk/sort/uniq)
Alternative Programming Languages Python, Go, Ruby
Coolness Level Level 3: Genuinely Clever
Business Potential Level 2: Internal Analytics
Prerequisites Shell basics, comfort reading logs
Key Topics Log formats, pipelines, awk field extraction, regex filtering, aggregation

1. Learning Objectives

By completing this project, you will:

  1. Parse real access logs and extract relevant fields reliably.
  2. Build a multi-stage pipeline that transforms large datasets into summaries.
  3. Use awk, sort, and uniq -c correctly and efficiently.
  4. Handle malformed input without breaking your analysis.
  5. Produce a report that can be compared across runs.

2. All Theory Needed (Per-Concept Breakdown)

Concept 1: Access Log Formats and Field Extraction

Fundamentals

Access logs are structured but not self-describing. Each line represents a single request and encodes information in a fixed order: client IP, timestamp, request line, status code, and response size. To analyze logs correctly, you must know the format and how to extract the fields you care about. Apache and Nginx often use the Common Log Format (CLF) or Combined Log Format. Both are whitespace-separated, but the request line is quoted and contains spaces, which means naive splitting can break. awk solves this by treating whitespace as field separators, but you must be aware of which fields contain quotes and how many fields you can trust. A good analyzer starts by assuming the log format and then validating that assumption by checking field counts and patterns.

Deep Dive into the Concept

The Common Log Format typically looks like this:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

The Combined Log Format adds referrer and user-agent fields at the end, which are also quoted:

127.0.0.1 - - [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/1.1" 200 1234 "https://example.com" "Mozilla/5.0"

These formats are mostly whitespace-separated, but the quoted fields contain spaces, so you cannot blindly split the line into tokens and assume field 7 is always the URL unless you know the format. In CLF, the request line is field 6-8 when using default whitespace splitting: field 6 is the quoted method, field 7 is the path, and field 8 is the protocol. In Combined Log Format, the same positions still exist, but extra quoted fields appear later. This is why a safe approach is to extract the request line by parsing between quotes, or at least to check that field 6 starts with a quote and field 8 ends with a quote. Another robust method is to use awk with a custom field separator like " to split on quotes, then extract the request line as $2 and other fields from $1 and $3.

Understanding timestamps is another challenge. Logs use formats like [10/Oct/2000:13:55:36 -0700]. If you want to filter by date, you must parse or match the date portion carefully. A simple grep for 21/Dec/2025 is often enough for daily filters, but if you need ranges you will need to parse timestamps into comparable values. For this project, a simple pattern match by date prefix is acceptable, but you must document that limitation.

Malformed lines are common: log rotation interruptions, truncated lines, or custom log formats. A robust analyzer should ignore lines that do not match the expected pattern rather than failing. You can use awk 'NF >= 9' as a minimum guard for CLF, and additional checks like ($9 ~ /^[0-9][0-9][0-9]$/) to ensure the status code is numeric. This protects your analysis from garbage input. It also makes your results more reliable, because you are not counting broken lines.

Field extraction should be explicit and testable. If you say “field 1 is IP, field 9 is status” you should verify that assumption by printing those fields for a few lines. awk '{print $1, $9}' access.log | head is a simple sanity check. For URLs, the request line is usually field 7 in CLF, but only if the request line is properly quoted. If you use a quote-based field separator, you can extract the request line as $2 and then split it into method, path, and protocol. This method is more robust and is worth practicing because it generalizes to other logs with quoted fields.

When you are done extracting, treat the fields as data. Do not rely on manual inspection for correctness; instead, count the number of lines processed, the number skipped, and the number that matched filters. This gives you confidence that your analyzer did not silently drop data. You can include these counts in your final report as a sanity check. This approach mirrors real operational analytics pipelines: correctness is about accounting, not just formatting.

How This Fits in Projects

This concept is central to §3 (Project Specification) and §5.4 (Concepts You Must Understand First). It is also used in the filtering logic in §5.5 and the test cases in §6.2. The same field-extraction discipline applies to P04 System Resource Monitor when parsing command output.

Definitions & Key Terms

  • Common Log Format (CLF): Standard web server log format.
  • Combined Log Format: CLF plus referrer and user-agent fields.
  • Request line: The quoted string containing method, path, and protocol.
  • Malformed line: A log entry that does not match expected format.
  • Field separator: The character(s) used to split input into fields.

Mental Model Diagram (ASCII)

Raw log line
   |
   v
Split into fields ----> validate pattern ----> extract IP/status/URL

How It Works (Step-by-Step)

  1. Read each log line as a full record.
  2. Validate minimum field count and pattern.
  3. Extract IP from field 1.
  4. Extract status code from field 9 (CLF assumption).
  5. Extract URL from request line (field 7 or quote-based parsing).
  6. Emit fields for downstream aggregation.

Invariants: the request line is quoted; status is numeric. Failure modes: custom log formats, missing quotes, truncated lines.

Minimal Concrete Example

# Quote-based parsing
awk -F'"' '{print $1, $2}' access.log | head

Common Misconceptions

  • “Field 7 is always the URL.” -> Only if the log format matches CLF and is well-formed.
  • “All lines are valid.” -> Production logs often contain malformed lines.
  • “grep is enough.” -> Without proper parsing, you can miscount fields.

Check-Your-Understanding Questions

  1. Why does awk '{print $7}' sometimes fail to extract the URL?
  2. What is the difference between CLF and Combined Log Format?
  3. How can you detect malformed lines quickly?

Check-Your-Understanding Answers

  1. Because the request line is quoted and contains spaces; field positions can shift.
  2. Combined adds referrer and user-agent fields at the end.
  3. Check NF and validate that the status field is numeric.

Real-World Applications

  • Website analytics dashboards
  • Security analysis (top IPs, suspicious paths)
  • Capacity planning using traffic patterns

Where You’ll Apply It

References

  • The Linux Command Line (Shotts), Ch. 19-20
  • Wicked Cool Shell Scripts (Taylor), Ch. 1

Key Insight

Log parsing is a contract between the log format and your extraction logic.

Summary

Correct field extraction is the difference between meaningful analytics and noise.

Homework/Exercises to Practice the Concept

  1. Identify the IP, status, and URL fields from three sample log lines.
  2. Write an awk command that prints only valid lines with numeric status.
  3. Use -F'"' to split on quotes and extract the request line.

Solutions to the Homework/Exercises

  1. Print field numbers with awk '{for (i=1;i<=NF;i++) print i,$i}'.
  2. awk 'NF >= 9 && $9 ~ /^[0-9][0-9][0-9]$/' access.log
  3. awk -F'"' '{print $2}' access.log | head

Concept 2: Pipelines, Filtering, and Aggregation

Fundamentals

The Unix philosophy encourages building complex behavior by chaining simple tools. In this project, you will build a pipeline where each command performs a small transformation: extract a field, filter by a condition, sort, count, and rank. The order of stages matters. sort must come before uniq -c, and numeric sorting must be explicit with sort -nr. These details are what make the pipeline correct. You will also learn to reduce input size early by filtering before sorting, which is a key performance idea.

Deep Dive into the Concept

A pipeline connects the stdout of one command to the stdin of the next. This enables a streaming dataflow: each line is processed in sequence and no intermediate files are required. For log analysis, this is ideal because logs can be large and you want to avoid loading everything into memory. A typical pipeline might look like:

awk '...extract...' access.log | sort | uniq -c | sort -nr | head -n 10

The first awk stage extracts the field you care about (IP or URL) and optionally filters by status code or date. This stage is crucial because it reduces the data volume. If you wait to filter until after sorting, you waste time sorting irrelevant lines. This is a fundamental principle: filter early, aggregate later.

sort groups identical values together. uniq -c counts consecutive duplicates, which is why sorting is required. The output of uniq -c includes counts with leading spaces; you can normalize this with awk or sort -nr which handles leading spaces as numeric values. head -n 10 limits output to the top N results. The pipeline is deterministic if you control locale; set LC_ALL=C so sorting is byte-based and consistent across machines. This is particularly important if your logs contain non-ASCII characters or your environment uses a locale with different collation rules.

Filtering with grep or awk can be done in many ways. grep is great for simple string matches, but awk can combine filtering and extraction in one step, reducing process overhead. For example, awk '$9==404 {print $1}' both filters and extracts the IP. This is more efficient than grep followed by awk. Regex filtering is powerful but should be used carefully; in log analysis, it’s often better to filter by exact numeric status rather than by textual matches. When you do use regex, remember that grep and awk use slightly different regex flavors; always test on a small sample.

Another important concept is stability and reproducibility. If you are comparing results across runs, you need deterministic ordering when counts are equal. sort -nr will not guarantee stable ordering for equal values. If you want deterministic output, you can add a secondary sort key, such as the field itself: sort -k1,1nr -k2,2. This ensures that ties are broken consistently. This matters in monitoring reports where you want consistent output for diffs.

Finally, pipelines should report errors. If awk fails due to an invalid script or sort fails because of disk space, the pipeline may still exit with the status of the last command by default. You can use set -o pipefail to ensure failures are detected. In a log analyzer, a silent failure is worse than a loud one. Consider writing a wrapper script that checks exit codes and prints a clear error message if any stage fails.

How This Fits in Projects

This concept defines the implementation approach in §4 (Solution Architecture) and the pipeline examples in §3.4. It also informs the testing strategy in §6.2 by defining where to validate each pipeline stage. These pipeline skills also apply directly to P01 Website Status Checker when formatting output.

Definitions & Key Terms

  • Pipeline: A chain of commands connected by pipes.
  • Aggregation: Summarizing data (counts, top N).
  • Filtering: Removing lines that do not match conditions.
  • Deterministic sorting: Sorting with explicit keys and locale.
  • Streaming: Processing data line-by-line without full buffering.

Mental Model Diagram (ASCII)

log -> extract -> filter -> sort -> count -> rank -> report

How It Works (Step-by-Step)

  1. Extract relevant field(s) with awk.
  2. Filter by status or date at extraction time.
  3. Sort the extracted values.
  4. Count duplicates with uniq -c.
  5. Sort counts descending.
  6. Take the top N results.

Invariants: sort before uniq -c; filter early for efficiency. Failure modes: unsorted input to uniq, locale-dependent ordering.

Minimal Concrete Example

awk '$9==404 {print $1}' access.log | sort | uniq -c | sort -nr | head -n 10

Common Misconceptions

  • “uniq -c counts regardless of order.” -> It only counts adjacent duplicates.
  • “Sorting after uniq is enough.” -> You must sort before uniq -c.
  • “Locale does not matter.” -> It changes sort order and can break tests.

Check-Your-Understanding Questions

  1. Why must sort come before uniq -c?
  2. How do you make output deterministic when counts tie?
  3. Why is filtering early more efficient?

Check-Your-Understanding Answers

  1. uniq only counts adjacent duplicates, so identical values must be grouped.
  2. Use a secondary sort key such as the value itself (sort -k1,1nr -k2,2).
  3. It reduces the amount of data that later stages must process.

Real-World Applications

  • Top talkers in firewall logs
  • Most common endpoints in API logs
  • Error frequency analysis

Where You’ll Apply It

References

  • The Linux Command Line (Shotts), Ch. 6 and 20
  • Wicked Cool Shell Scripts (Taylor), Ch. 1-2

Key Insight

A pipeline is a dataflow graph; correctness depends on the order of stages.

Summary

Filtering, sorting, and aggregation are the backbone of CLI analytics.

Homework/Exercises to Practice the Concept

  1. Build a pipeline to list the top 5 URLs in a log file.
  2. Modify it to show only 404 errors.
  3. Add a stable secondary sort key so ties are predictable.

Solutions to the Homework/Exercises

  1. awk '{print $7}' access.log | sort | uniq -c | sort -nr | head -n 5
  2. awk '$9==404 {print $7}' access.log | sort | uniq -c | sort -nr | head -n 5
  3. awk '{print $7}' access.log | sort | uniq -c | sort -k1,1nr -k2,2 | head -n 5

3. Project Specification

3.1 What You Will Build

You will build a CLI pipeline or short script named analyze_log.sh that reads an access log and outputs the top 10 client IPs and top 10 requested URLs. It will support optional filters by status code and date prefix, handle malformed lines gracefully, and print a deterministic summary. It will not implement a full log parser or support arbitrary log formats beyond CLF/Combined; it will document assumptions explicitly.

3.2 Functional Requirements

  1. Input: Accept a log file path.
  2. Parsing: Assume CLF/Combined format and document this assumption.
  3. Filtering: Optional --status filter (e.g., 404) and --date filter (e.g., 21/Dec/2025).
  4. Aggregation: Output top 10 IPs and top 10 URLs.
  5. Malformed lines: Skip invalid lines and report the count skipped.
  6. Determinism: Use LC_ALL=C and stable sorting.
  7. Exit codes: 0 on success, 1 on usage error, 2 if file not readable.

3.3 Non-Functional Requirements

  • Performance: Must handle multi-GB logs using streaming commands.
  • Reliability: No partial output if input file is missing.
  • Usability: Clear usage message and consistent output format.

3.4 Example Usage / Output

$ ./analyze_log.sh access.log --status 404
Top 10 IPs for status 404:
   52 203.0.113.9
   41 198.51.100.12
   30 192.0.2.55

Top 10 URLs for status 404:
   18 /robots.txt
   12 /favicon.ico
    9 /admin

Skipped malformed lines: 3

3.5 Data Formats / Schemas / Protocols

Input log format (CLF/Combined example):

203.0.113.9 - - [21/Dec/2025:15:00:00 +0000] "GET /index.html HTTP/1.1" 404 1234

Output format:

  • Two ranked lists (IPs and URLs)
  • Optional skipped line count
  • Deterministic numeric ordering

3.6 Edge Cases

  • Log file contains malformed or truncated lines.
  • Log file uses unexpected format (non-CLF).
  • Status code field missing or non-numeric.
  • URL field missing or -.
  • File is large and requires streaming.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

LC_ALL=C ./analyze_log.sh access.log --status 404 --date 21/Dec/2025

3.7.2 Golden Path Demo (Deterministic)

Set LC_ALL=C to ensure sort ordering and stable output for comparisons.

3.7.3 If CLI: Exact Terminal Transcript

$ LC_ALL=C ./analyze_log.sh access.log --status 404
Top 10 IPs for status 404:
   52 203.0.113.9
   41 198.51.100.12
   30 192.0.2.55

Top 10 URLs for status 404:
   18 /robots.txt
   12 /favicon.ico
    9 /admin

Skipped malformed lines: 3
Exit code: 0

3.7.4 Failure Demo (Bad Input)

$ ./analyze_log.sh missing.log
ERROR: cannot read log file: missing.log
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

log file -> filter -> extract -> sort -> uniq -c -> sort -nr -> top N

4.2 Key Components

Component Responsibility Key Decisions
Parser Validate line format Use NF and status numeric checks
Filter Apply status/date constraints Implement in awk for speed
Aggregator Count top IPs and URLs Sort before uniq
Reporter Format output Deterministic ordering and counts

4.3 Data Structures (No Full Code)

status_filter=""
date_filter=""
invalid_count=0

4.4 Algorithm Overview

Key Algorithm: Top-N Aggregation

  1. Read log lines.
  2. Filter malformed lines.
  3. Extract field (IP or URL).
  4. Sort values.
  5. Count with uniq -c.
  6. Sort counts descending and take top N.

Complexity Analysis:

  • Time: O(n log n) due to sorting.
  • Space: O(n) for sort buffers (external sort if large).

5. Implementation Guide

5.1 Development Environment Setup

which awk sort uniq

5.2 Project Structure

analyze-log/
├── analyze_log.sh
├── sample.log
└── tests/
    └── fixtures/

5.3 The Core Question You’re Answering

“How do I convert raw log noise into a ranked summary that reveals real behavior?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Log format assumptions and field positions (Concept 1)
  2. Pipeline ordering and aggregation (Concept 2)

5.5 Questions to Guide Your Design

  1. Which log format are you assuming and how will you validate it?
  2. Should malformed lines be counted or ignored silently?
  3. How will you keep output deterministic for comparisons?
  4. Will you allow filtering by method or only by status/date?

5.6 Thinking Exercise

Given the log line below, identify the IP, URL, and status code:

203.0.113.9 - - [21/Dec/2025:15:00:00 +0000] "GET /index.html HTTP/1.1" 404 1234

Explain why your field extraction method works even if the user-agent field exists.

5.7 The Interview Questions They’ll Ask

  1. “Why must sort come before uniq -c?”
  2. “How do you handle malformed log lines?”
  3. “How would you speed this up for very large logs?”
  4. “How do you ensure deterministic output?”

5.8 Hints in Layers

Hint 1: Extract IPs

awk 'NF >= 9 {print $1}' access.log

Hint 2: Filter by status

awk 'NF >= 9 && $9 == 404 {print $1}' access.log

Hint 3: Top N

awk 'NF >= 9 {print $1}' access.log | sort | uniq -c | sort -nr | head -n 10

5.9 Books That Will Help

Topic Book Chapter
Redirection and pipes The Linux Command Line Ch. 6
Regex and text parsing The Linux Command Line Ch. 19-20
Practical pipelines Wicked Cool Shell Scripts Ch. 1-2

5.10 Implementation Phases

Phase 1: Foundation (2 hours)

Goals:

  • Parse file and extract IPs
  • Produce a top 10 list

Tasks:

  1. Implement basic pipeline.
  2. Verify output on small sample.

Checkpoint: Top 10 IPs printed correctly.

Phase 2: Core Functionality (3 hours)

Goals:

  • Add URL list and status filter
  • Add malformed line handling

Tasks:

  1. Implement filter flags.
  2. Add NF checks and skip counters.

Checkpoint: Filters work and skipped line count is correct.

Phase 3: Polish & Edge Cases (2 hours)

Goals:

  • Deterministic ordering
  • Clear usage output

Tasks:

  1. Set LC_ALL=C in script.
  2. Add usage/help and exit codes.

Checkpoint: Output format is stable and script exits correctly on bad input.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Parsing strategy Field numbers vs quote-split Field numbers + validation Simpler and adequate for CLF
Filtering grep vs awk awk One process, combined filter/extract
Output Free-form vs stable Stable columns Easier to compare across runs

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Validate parsing logic IP/status extraction
Integration Tests Run pipeline on sample log top 10 IPs/URLs
Edge Case Tests Malformed lines truncated lines, missing fields

6.2 Critical Test Cases

  1. Valid CLF line: Extract IP, status, URL correctly.
  2. Malformed line: Skipped and counted.
  3. Status filter: Only counts matching status.
  4. Deterministic ordering: Ties break consistently.

6.3 Test Data

203.0.113.9 - - [21/Dec/2025:15:00:00 +0000] "GET /index.html HTTP/1.1" 404 1234
198.51.100.12 - - [21/Dec/2025:15:01:00 +0000] "GET /robots.txt HTTP/1.1" 404 234
broken line here

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
No sort before uniq Wrong counts Add sort before uniq -c
Wrong field numbers Empty output Inspect awk '{print NF, $0}'
Locale differences Output order changes Set LC_ALL=C

7.2 Debugging Strategies

  • Print field positions with awk '{for (i=1;i<=NF;i++) print i,$i}'.
  • Test each stage of the pipeline separately.

7.3 Performance Traps

  • Sorting large logs without enough disk space can fail; use /tmp with space or sort -T.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add --top N flag.
  • Add --method GET filter.

8.2 Intermediate Extensions

  • Output CSV for IP counts and URL counts.
  • Add a separate report for user agents.

8.3 Advanced Extensions

  • Support JSON logs by adding a jq-based path.
  • Add a histogram of response sizes.

9. Real-World Connections

9.1 Industry Applications

  • Security: identify brute-force or scanning IPs.
  • Performance: spot hot endpoints.
  • GoAccess: real-time web log analyzer.
  • AWStats: classic log analysis tool.

9.3 Interview Relevance

  • Data pipelines: streaming vs batch processing.
  • Text processing: regex and field extraction.

10. Resources

10.1 Essential Reading

  • The Linux Command Line by William E. Shotts - Ch. 6, 19, 20
  • Wicked Cool Shell Scripts by Dave Taylor - Ch. 1-2

10.2 Video Resources

  • “Intro to Log Analysis” (any systems or security course)

10.3 Tools & Documentation

  • awk: man awk for field splitting and regex
  • sort/uniq: man sort, man uniq

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain the difference between CLF and Combined logs.
  • I can justify why sort must precede uniq -c.
  • I can describe how to detect malformed lines.

11.2 Implementation

  • All functional requirements are met.
  • Output is deterministic and reproducible.
  • Script exits with correct codes on errors.

11.3 Growth

  • I can analyze a log file without guessing field positions.
  • I can explain this pipeline in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Script outputs top 10 IPs and URLs.
  • Optional status filter works.
  • Malformed lines are skipped and counted.

Full Completion:

  • All minimum criteria plus:
  • Deterministic ordering with LC_ALL=C and tie-breaking.
  • Clear usage/help and exit codes.

Excellence (Going Above & Beyond):

  • Supports multiple log formats with a --format flag.
  • Produces CSV and JSON summaries for integration.