Web Server Log Processor

Goal: Build a robust log ingestion pipeline that parses raw web server logs into a structured SQLite database. You will master schema design for semi-structured data, reliable parsing strategies, and data-quality checks that prevent corrupted analytics. By the end, you can prove that each parsed field is correct, complete, and traceable back to the original log line.

Context and Problem

  • Real-world scenario: A website team wants to analyze traffic, errors, and user agents from raw Nginx logs.
  • Stakeholders and constraints: SREs need error rates, Marketing needs top pages, Product needs unique visitors. Constraints include messy log lines and performance requirements.
  • What happens if this system fails? Incident analysis is wrong and SLA reporting is inaccurate.

Real World Outcome

  • You run a parser that loads a SQLite database and prints a summary.
  • Example CLI transcript:
$ python process_logs.py data/access.log --db data/logs.db
Parsed lines=120000 ok=119842 bad=158
Inserted rows=119842
Top statuses: 200=108402 404=6421 500=319
  • Example table row:
logs
id=89421
ip=192.168.10.44
timestamp=2024-05-12T14:22:01Z
method=GET
path=/products/123
status=200
bytes=5123
user_agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 13_5)
referrer=https://example.com/
raw_line_hash=6f3f...
  • Example query output:
SELECT status, COUNT(*) FROM logs GROUP BY status ORDER BY COUNT(*) DESC;
-- 200|108402
-- 404|6421
-- 500|319

Core Concepts

  • Parsing semi-structured text: Extracting fields reliably with regex or tokenizer.
  • Schema design: Choosing types and columns for log analytics.
  • Batch inserts: Writing to SQLite efficiently without locking.
  • Traceability: Linking parsed rows back to the exact raw line.
  • Data quality dimensions: Accuracy, completeness, consistency, timeliness, traceability.

Architecture

+-----------------+     +-------------------+     +--------------------+
| access.log      | --> | parser + validator| --> | SQLite logs.db     |
+-----------------+     +-------------------+     +--------------------+
        |                         |
        |                         +--> rejected_lines.csv
        |
        +--> checksum + line count

Data Model

  • Table: logs
    • id INTEGER PRIMARY KEY
    • ip TEXT
    • timestamp TEXT (ISO 8601)
    • method TEXT
    • path TEXT
    • status INTEGER
    • bytes INTEGER
    • user_agent TEXT
    • referrer TEXT
    • raw_line_hash TEXT
  • Example record:
(89421,'192.168.10.44','2024-05-12T14:22:01Z','GET','/products/123',200,5123,'Mozilla/5.0', 'https://example.com/','6f3f...')

Implementation Plan

  1. Define the log format and write a parser (regex or token-based).
  2. Parse each line into a dict; validate required fields.
  3. Normalize timestamps to UTC ISO format.
  4. Create the SQLite schema with appropriate types.
  5. Insert rows in batches (e.g., 1000 rows per transaction).
  6. Write bad lines to rejected_lines.csv with reason and line number.
  7. Print a summary report with counts and top status codes.

Validation and Data Quality

  • Accuracy: Randomly sample 20 rows and compare parsed values to original line.
  • Completeness: SELECT COUNT(*) FROM logs WHERE ip IS NULL OR status IS NULL; should be 0.
  • Consistency: Status is 3-digit integer; bytes >= 0.
  • Timeliness: Timestamps fall within log file range.
  • Traceability: raw_line_hash stored and matched with original line hash.

Failure Modes and Debugging

  • Regex drift: Logs change format after server config update.
    • Symptom: spike in rejected_lines.csv. Fix by updating parser and versioning formats.
  • SQLite locked: Multiple processes write concurrently.
    • Symptom: “database is locked” errors. Fix by single-writer or WAL mode.
  • Timestamp parsing failures: Timezone offset not handled.
    • Symptom: NULL timestamps. Fix by using explicit timezone parsing and UTC conversion.

Definition of Done

  • Parser loads a large log file with <1% rejected lines (unless input is bad).
  • SQLite database contains expected schema and row count matches ok lines.
  • Summary query results match expected counts.
  • rejected_lines.csv includes line number and reason for every rejected entry.
  • Sampling validation confirms parsed fields are accurate.