Project 7: Stats Engine

Build a deterministic repo analytics tool that counts files, lines of code, and recent activity.

Quick Reference

Attribute	Value
Difficulty	Expert
Time Estimate	1 week
Main Programming Language	Bash (Alternatives: Python)
Alternative Programming Languages	Python
Coolness Level	Level 4 - “codebase analyst”
Business Potential	Medium (developer analytics)
Prerequisites	find, wc, sort, awk
Key Topics	aggregation, line counting, determinism, timestamps

1. Learning Objectives

By completing this project, you will:

Count files and lines by language in a repo.
Build aggregation pipelines with wc and awk.
Generate deterministic reports with stable ordering.
Identify top recently modified files by timestamp.
Handle large repos without scanning vendor directories.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Aggregation and Counting Pipelines (wc, awk, and accuracy)

Fundamentals

Counting files and lines is not as simple as wc -l. When you run find -exec wc -l {} +, the output includes per-file counts and a total line. You must parse that total correctly. Aggregation requires a stable pipeline that does not double count or include non-source files. The key is to define a clear file selection strategy and then apply counting consistently across file types. Counting is also a definition. Are you counting physical lines, logical lines, or non-empty lines? Different stakeholders care about different definitions, and a stats tool must choose one and document it. For this project, we use physical line counts because they are reproducible and easy to compute, but a good report should make that explicit.

Deep Dive into the concept

Counting is an aggregation problem. A repo stats tool must answer questions like “How many Python files are there?” and “How many lines of JavaScript code?” The naive approach is find . -name '*.py' -type f | wc -l for file counts and find . -name '*.py' -type f -exec wc -l {} + | tail -1 for line counts. But this has pitfalls. wc -l output includes filenames, and the total line has a different format. You must ensure you parse the total correctly, or you will report incorrect numbers.

Another pitfall is comments and blank lines. This project does not require logical line counts, but you should document what your counts represent (physical lines). If you attempt to exclude blank lines, you need a filter like grep -v '^\s*$', but then your pipeline becomes more complex. For this project, we keep it simple but explicit.

File selection is critical. If you include node_modules, your counts will be meaningless. Use find with prune to exclude vendor and build directories. The aggregation should be performed on the same file list to keep counts consistent. It is easy to accidentally run two different find commands with slightly different predicates and end up with mismatched counts. A robust approach is to generate a file list once, then reuse it for all counts.

awk is useful for aggregation. For example, if you want total lines and counts per file type in a single pass, you can extract extensions and accumulate counts in an associative array. This is more advanced but produces a cleaner report and reduces redundant traversal.

Finally, performance matters. Running wc -l on each file individually is slower than batching with -exec ... +. The batching approach reduces process overhead. For large repos, you may want to limit scanning to tracked files (via git ls-files), but that is optional here.

Another subtlety is encoding. If your repo includes files with unusual encodings, wc still counts newlines, but other tools in your pipeline might choke or misinterpret characters. For a stats tool, you can usually ignore encoding because you are counting newlines, not parsing content. Still, you should be aware that some files may not be valid UTF-8, and your report should not assume they are.

It is also worth considering counting stability across refactors. For example, if you want to compare counts over time, you must ensure you are including the same file sets. Changing the extension list or prune rules will create artificial deltas. A robust tool records its selection rules in the report header so that future comparisons are meaningful.

A practical pitfall is line-ending differences. Files with Windows-style CRLF still count as one line per newline, but if you later add logic to strip blank lines or comments, CR characters can interfere. If you extend the tool, normalize line endings or ensure your filters handle \r. Another common pitfall is generated code: if your repo includes generated files (protobufs, build artifacts), counts can jump dramatically. Decide whether generated files are in scope and document that choice clearly.

How this fit on projects

This project uses aggregation to produce file counts and line counts per language and total.

Definitions & key terms

aggregation: combining many values into a summary (sum, count).
physical lines: literal newline-delimited lines in a file.
batching: processing many files per command invocation.
file selection: the set of files included in stats.

Mental model diagram (ASCII)

file list -> wc -l -> parse totals -> report

How it works (step-by-step)

Build a safe file list per extension.
Use wc -l to count lines.
Parse totals from wc output.
Aggregate results and render report.
Invariant: counts are computed from the same file list used for all metrics.
Failure modes: parsing the wrong wc line, or mismatched file selection across metrics.

Minimal concrete example

find . -name '*.py' -type f -exec wc -l {} + | tail -1

Common misconceptions

“wc -l already gives totals” -> True, but you must parse correctly.
“All lines are equal” -> Not always; document what you count.
“Multiple find commands are fine” -> Risky if predicates differ.

Check-your-understanding questions

Why does wc -l output include filenames?
Why might two counts differ if you run two separate find commands?
What is the difference between physical and logical lines?
Why is batching faster than per-file execution?

Check-your-understanding answers

wc reports per-file counts with filenames by default.
Small predicate differences can include different files.
Physical lines are literal; logical lines exclude blanks/comments.
Fewer process invocations reduce overhead.

Real-world applications

Repo statistics for engineering dashboards.
Estimating effort for migrations.
Monitoring codebase growth over time.

Where you will apply it

In this project: see §3.5 (schema) and §5.10 (phases).
Also used in: P01-digital-census.md.

References

The Linux Command Line (Shotts), Chapter 20
man wc
man awk

Key insights

Aggregation is reliable only when the file list is consistent and well-defined.

Summary

Counts are easy to compute but easy to get wrong. Define the file list and parse totals carefully.

Homework/Exercises to practice the concept

Count lines in .py files and verify totals manually for a small fixture.
Write an awk script that totals counts per extension.
Compare per-file wc vs batched wc performance.

Solutions to the homework/exercises

Use a small fixture repo and sum counts by hand.
awk -F. '{ext=$NF; counts[ext]+=1} END {for (e in counts) print e, counts[e]}'
Use time to compare two methods.

2.2 Deterministic Ordering and Timestamp Analysis

Fundamentals

A stats report must be reproducible. Sorting with a fixed locale (LC_ALL=C) ensures consistent ordering. When listing recent files, you must format timestamps consistently and sort by time. find -printf '%T+ %p' produces sortable ISO-like timestamps. Without deterministic ordering, stats reports cannot be diffed reliably. Determinism is also about secondary ordering rules. If two file types have the same count, the order should still be stable. That means adding a secondary sort key (like the extension name) when you sort by count. This makes diffs clean and prevents spurious ordering changes that look like real data changes.

Deep Dive into the concept

Determinism in reporting has multiple dimensions. First, ordering: the same data should produce the same output order. Because filesystem traversal order is not guaranteed, you must sort explicitly. LC_ALL=C sets a bytewise collation order so the sort is stable across locales. Without this, systems with different locale settings can produce different output, which is disastrous for CI or auditing.

Second, timestamp formatting. find -printf '%T+' produces a timestamp with high precision, which sorts lexicographically. This is ideal for reporting “most recently modified” files. But you must decide whether to display the full timestamp or truncate to seconds for readability. The report should document what precision is used.

Third, a deterministic report should include a scan time or run ID. This makes it clear when the data was collected. If you do not include scan time, you can confuse readers when stats appear to change unexpectedly due to file modifications.

Fourth, when counting files by extension, you should order the output alphabetically or by count. Sorting by count is useful for ranking, but the ties should be broken by name for stability. This is a subtle but important detail if you want reproducible output.

Finally, consider that timestamps can change during the scan. A file modified while the scan is running may appear in an unexpected position. This is acceptable but should be documented. For full determinism, you would need snapshotting, which is out of scope. The practical solution is to accept small race conditions but make the output as stable as possible through sorting.

Another important point is to normalize time zones. If you print timestamps in local time, reports created on machines in different time zones will differ even if the files are identical. To avoid this, choose UTC or include the time zone offset in the timestamp. find -printf '%TY-%Tm-%TdT%TH:%TM:%S%z' can include the offset. This makes your report more portable and easier to compare across environments.

When sorting recent files, you should also decide whether to include symlinks and directories. A stats engine focused on code metrics usually excludes directories and may choose to include symlinks as paths rather than following them. This decision affects the output and should be documented to avoid confusion about why some files appear or do not appear.

Deterministic ordering also benefits from explicit tie-breaking in every summary section. If you sort by count, add a secondary key for file extension. If you sort by time, add a secondary key for path to make ordering stable when timestamps are equal (common when files are generated in batches). These tiny details make diffs clean and prevent churn in CI reports that would otherwise show spurious changes.

You can also include a lightweight hash of the report inputs, such as a hash of the file list or a count of files scanned, in the header. This doesn’t make the report deterministic by itself, but it provides a checksum for sanity. If two reports have the same hash but different ordering, you know the difference is a formatting issue rather than a data change. This small addition can save a lot of debugging time when reports are generated on different machines.

How this fit on projects

This project uses sorted outputs for counts and for the “top modified files” section, making reports stable and diffable.

Definitions & key terms

determinism: stable output given the same input.
collation: the ordering rules used by sort.
timestamp precision: the granularity of time data.
race condition: file changes during scan affecting results.

Mental model diagram (ASCII)

raw stats -> sort (LC_ALL=C) -> deterministic report

How it works (step-by-step)

Generate raw counts and timestamp lines.
Apply LC_ALL=C sort.
Format and render report with a scan header.
Invariant: the same input always produces the same ordering.
Failure modes: locale drift, missing secondary sort keys, or time zone mismatch.

Minimal concrete example

LC_ALL=C find . -type f -printf '%T+ %p\n' | sort -r | head -5

Common misconceptions

“Filesystem order is stable” -> False; never rely on it.
“Sort is the same everywhere” -> False; locale affects collation.
“Timestamps are immutable” -> False; files can change mid-scan.

Check-your-understanding questions

Why do we set LC_ALL=C?
How does find -printf '%T+' help with sorting by time?
Why should reports include scan time?
How do you make ordering deterministic when counts tie?

Check-your-understanding answers

It ensures predictable bytewise collation.
The format is lexicographically sortable.
It documents when data was collected.
Add a secondary sort key like filename.

Real-world applications

Repo metrics dashboards.
Codebase change monitoring.
Release readiness reporting.

Where you will apply it

In this project: see §3.7 (golden output) and §5.11 (decisions).
Also used in: P01-digital-census.md and P02-log-hunter.md.

References

man sort
The Linux Command Line (Shotts), Chapter 20

Key insights

Determinism is what makes a report trustworthy over time.

Summary

Sorting with fixed locale and consistent timestamp formatting makes stats reports reproducible and auditable.

Homework/Exercises to practice the concept

Run the stats tool twice and compare outputs.
Change locale and observe differences without LC_ALL=C.
Build a sorted list of recent files with a fixed format.

Solutions to the homework/exercises

diff report1.txt report2.txt should be stable.
LANG=de_DE.UTF-8 sort vs LC_ALL=C sort.
find . -type f -printf '%T+ %p\n' | sort -r | head -5.

3. Project Specification

3.1 What You Will Build

A CLI tool that reports counts of files and lines by language and lists the most recently modified files, with deterministic ordering and clear headers.

3.2 Functional Requirements

File counts: count files by extension.
Line counts: count total lines per extension.
Recent files: list top N modified files.
Deterministic output: sorted and stable.
Exclusions: prune vendor directories.

3.3 Non-Functional Requirements

Performance: handle large repos efficiently.
Reliability: output still produced with minor errors.
Usability: clear report format.

3.4 Example Usage / Output

$ ./stats_engine.sh ~/Projects/app

3.5 Data Formats / Schemas / Protocols

Report format:

SCAN_TIME=2026-01-01T12:00:00
Python files: 84 (22410 lines)
JavaScript files: 41 (9380 lines)
Top 5 recent files:
  2025-12-31T22:10:05 src/api/auth.py

3.6 Edge Cases

Repos with no files of a given type.
Large files affecting performance.
Files modified during scan.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

./stats_engine.sh ./fixtures/repo

3.7.2 Golden Path Demo (Deterministic)

Use a fixture repo and fixed scan time 2026-01-01T12:00:00.

3.7.3 If CLI: exact terminal transcript

$ ./stats_engine.sh ./fixtures/repo
[2026-01-01T12:00:00] TARGET=./fixtures/repo
[2026-01-01T12:00:00] REPORT=stats_2026-01-01.txt
[2026-01-01T12:00:00] DONE

$ cat stats_2026-01-01.txt
Python files: 3 (120 lines)
JavaScript files: 2 (80 lines)
Top 3 recent files:
  2025-12-31T22:10:05 src/api/auth.py
  2025-12-31T20:01:00 src/db/schema.sql

Failure demo (no target):

$ ./stats_engine.sh /no/such/path
[2026-01-01T12:00:00] ERROR: target not found
EXIT_CODE=2

Exit codes:

0: success
1: partial success (minor errors)
2: invalid arguments or missing path

4. Solution Architecture

4.1 High-Level Design

find + prune -> file list -> counts -> sort -> report

4.2 Key Components

Component	Responsibility	Key Decisions
Selector	build file lists	prune vendor dirs
Counter	count lines/files	batched wc
Reporter	render stats	stable ordering
Recent	list top modified	sort by timestamp

4.3 Data Structures (No Full Code)

counts: map[ext]{files, lines}

4.4 Algorithm Overview

Key Algorithm: Repo Stats

Build file lists by extension.
Count files and lines per extension.
Generate recent file list.
Sort and render report.

Complexity Analysis:

Time: O(n log n) due to sorting
Space: O(n) for file lists

5. Implementation Guide

5.1 Development Environment Setup

# Requires find, wc, sort, awk

5.2 Project Structure

project-root/
├── stats_engine.sh
├── fixtures/
│   └── repo/
└── README.md

5.3 The Core Question You’re Answering

“How do I build a reliable, reproducible codebase report using only Unix tools?”

5.4 Concepts You Must Understand First

Aggregation and accurate counting
Deterministic sorting and timestamps

5.5 Questions to Guide Your Design

Which file types should be included?
How will you handle vendor directories?
How will you present the top modified files?

5.6 Thinking Exercise

Design a pipeline that counts Python files, sums their lines, and lists the five most recently modified files.

5.7 The Interview Questions They’ll Ask

“Why is uniq wrong without sort?”
“How do you count lines across many files safely?”
“How do you avoid scanning vendor directories?”

5.8 Hints in Layers

Hint 1: Count files

find . -name '*.py' -type f | wc -l

Hint 2: Count lines

find . -name '*.py' -type f -exec wc -l {} + | tail -1

Hint 3: Recent files

find . -type f -printf '%T+ %p\n' | sort -r | head -5

5.9 Books That Will Help

Topic	Book	Chapter
Text processing	The Linux Command Line (Shotts)	Ch. 20
Find	The Linux Command Line (Shotts)	Ch. 17
Pipelines	Effective Shell (Kerr)	Ch. 6

5.10 Implementation Phases

Phase 1: Foundation (1-2 days)

Goals:

Build file lists by extension
Exclude vendor directories

Tasks:

Implement prune logic.
Define extension list.

Checkpoint: file lists are correct.

Phase 2: Core Functionality (2-3 days)

Goals:

Count files and lines

Tasks:

Implement batched wc.
Parse totals and format counts.

Checkpoint: counts match manual checks.

Phase 3: Polish & Edge Cases (1-2 days)

Goals:

Deterministic output
Recent files list

Tasks:

Add LC_ALL=C sort.
Render report header with scan time.

Checkpoint: report is stable across runs.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
File list source	find vs git ls-files	find + prune	works for any repo
Line counting	per-file vs batched	batched	faster
Output ordering	by name vs by count	by count with tie-break	useful and stable

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	parsing totals	wc output parsing
Integration Tests	full report	fixture repo
Edge Case Tests	empty repo	no files case

6.2 Critical Test Cases

Empty repo: report still produced.
Vendor dir: excluded from counts.
Large file: counts still correct.

6.3 Test Data

fixtures/repo/src/app.py
fixtures/repo/src/app.js

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Wrong totals	line counts off	parse wc total correctly
Unstable output	diff noise	use `LC_ALL=C sort`
Vendor files included	huge counts	prune directories

7.2 Debugging Strategies

Print intermediate file lists.
Validate counts on a small fixture.
Use set -x to trace steps.

7.3 Performance Traps

Scanning large repos without pruning can take minutes. Always exclude vendor and build directories.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a --include flag for custom extensions.
Output JSON in addition to text.

8.2 Intermediate Extensions

Count blank lines and comment lines separately.
Add per-directory statistics.

8.3 Advanced Extensions

Integrate with git to track churn and blame.
Export metrics to Prometheus.

9. Real-World Connections

9.1 Industry Applications

Engineering metrics dashboards.
Audit of codebase growth.
Pre-migration assessments.

cloc: code line counting tool.
tokei: code statistics by language.

9.3 Interview Relevance

Aggregation and counting accuracy.
Deterministic reporting.
Safe traversal and pruning.

10. Resources

10.1 Essential Reading

The Linux Command Line (Shotts), Chapter 20
Effective Shell (Kerr), Chapter 6

10.2 Video Resources

“Counting Lines of Code” (YouTube)
“Deterministic Reports” (talk)

10.3 Tools & Documentation

man wc
man sort

P01-digital-census.md - inventory reporting
P02-log-hunter.md - aggregation pipelines

11. Self-Assessment Checklist

11.1 Understanding

I can explain batched wc output.
I can explain why sort order matters.
I can explain mtime semantics.

11.2 Implementation

Report includes counts and recent files.
Vendor directories are excluded.
Output is deterministic.

11.3 Growth

I can propose a metric improvement.
I documented one performance risk.
I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

File and line counts reported.
Recent files list included.

Full Completion:

Deterministic output with scan header.
Exclusion list implemented.

Excellence (Going Above & Beyond):

Per-directory stats and churn metrics.
Export to JSON or metrics system.