Web Server Log Processor
Goal: Build a robust log ingestion pipeline that parses raw web server logs into a structured SQLite database. You will master schema design for semi-structured data, reliable parsing strategies, and data-quality checks that prevent corrupted analytics. By the end, you can prove that each parsed field is correct, complete, and traceable back to the original log line.
Context and Problem
- Real-world scenario: A website team wants to analyze traffic, errors, and user agents from raw Nginx logs.
- Stakeholders and constraints: SREs need error rates, Marketing needs top pages, Product needs unique visitors. Constraints include messy log lines and performance requirements.
- What happens if this system fails? Incident analysis is wrong and SLA reporting is inaccurate.
Real World Outcome
- You run a parser that loads a SQLite database and prints a summary.
- Example CLI transcript:
$ python process_logs.py data/access.log --db data/logs.db
Parsed lines=120000 ok=119842 bad=158
Inserted rows=119842
Top statuses: 200=108402 404=6421 500=319
- Example table row:
logs
id=89421
ip=192.168.10.44
timestamp=2024-05-12T14:22:01Z
method=GET
path=/products/123
status=200
bytes=5123
user_agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 13_5)
referrer=https://example.com/
raw_line_hash=6f3f...
- Example query output:
SELECT status, COUNT(*) FROM logs GROUP BY status ORDER BY COUNT(*) DESC;
-- 200|108402
-- 404|6421
-- 500|319
Core Concepts
- Parsing semi-structured text: Extracting fields reliably with regex or tokenizer.
- Schema design: Choosing types and columns for log analytics.
- Batch inserts: Writing to SQLite efficiently without locking.
- Traceability: Linking parsed rows back to the exact raw line.
- Data quality dimensions: Accuracy, completeness, consistency, timeliness, traceability.
Architecture
+-----------------+ +-------------------+ +--------------------+
| access.log | --> | parser + validator| --> | SQLite logs.db |
+-----------------+ +-------------------+ +--------------------+
| |
| +--> rejected_lines.csv
|
+--> checksum + line count
Data Model
- Table:
logs- id INTEGER PRIMARY KEY
- ip TEXT
- timestamp TEXT (ISO 8601)
- method TEXT
- path TEXT
- status INTEGER
- bytes INTEGER
- user_agent TEXT
- referrer TEXT
- raw_line_hash TEXT
- Example record:
(89421,'192.168.10.44','2024-05-12T14:22:01Z','GET','/products/123',200,5123,'Mozilla/5.0', 'https://example.com/','6f3f...')
Implementation Plan
- Define the log format and write a parser (regex or token-based).
- Parse each line into a dict; validate required fields.
- Normalize timestamps to UTC ISO format.
- Create the SQLite schema with appropriate types.
- Insert rows in batches (e.g., 1000 rows per transaction).
- Write bad lines to
rejected_lines.csvwith reason and line number. - Print a summary report with counts and top status codes.
Validation and Data Quality
- Accuracy: Randomly sample 20 rows and compare parsed values to original line.
- Completeness:
SELECT COUNT(*) FROM logs WHERE ip IS NULL OR status IS NULL;should be 0. - Consistency: Status is 3-digit integer; bytes >= 0.
- Timeliness: Timestamps fall within log file range.
- Traceability:
raw_line_hashstored and matched with original line hash.
Failure Modes and Debugging
- Regex drift: Logs change format after server config update.
- Symptom: spike in rejected_lines.csv. Fix by updating parser and versioning formats.
- SQLite locked: Multiple processes write concurrently.
- Symptom: “database is locked” errors. Fix by single-writer or WAL mode.
- Timestamp parsing failures: Timezone offset not handled.
- Symptom: NULL timestamps. Fix by using explicit timezone parsing and UTC conversion.
Definition of Done
- Parser loads a large log file with <1% rejected lines (unless input is bad).
- SQLite database contains expected schema and row count matches ok lines.
- Summary query results match expected counts.
- rejected_lines.csv includes line number and reason for every rejected entry.
- Sampling validation confirms parsed fields are accurate.