Project 7: Stats Engine

Build a deterministic repo analytics tool that counts files, lines of code, and recent activity.

Quick Reference

Attribute Value
Difficulty Expert
Time Estimate 1 week
Main Programming Language Bash (Alternatives: Python)
Alternative Programming Languages Python
Coolness Level Level 4 - “codebase analyst”
Business Potential Medium (developer analytics)
Prerequisites find, wc, sort, awk
Key Topics aggregation, line counting, determinism, timestamps

1. Learning Objectives

By completing this project, you will:

  1. Count files and lines by language in a repo.
  2. Build aggregation pipelines with wc and awk.
  3. Generate deterministic reports with stable ordering.
  4. Identify top recently modified files by timestamp.
  5. Handle large repos without scanning vendor directories.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Aggregation and Counting Pipelines (wc, awk, and accuracy)

Fundamentals

Counting files and lines is not as simple as wc -l. When you run find -exec wc -l {} +, the output includes per-file counts and a total line. You must parse that total correctly. Aggregation requires a stable pipeline that does not double count or include non-source files. The key is to define a clear file selection strategy and then apply counting consistently across file types. Counting is also a definition. Are you counting physical lines, logical lines, or non-empty lines? Different stakeholders care about different definitions, and a stats tool must choose one and document it. For this project, we use physical line counts because they are reproducible and easy to compute, but a good report should make that explicit.

Deep Dive into the concept

Counting is an aggregation problem. A repo stats tool must answer questions like “How many Python files are there?” and “How many lines of JavaScript code?” The naive approach is find . -name '*.py' -type f | wc -l for file counts and find . -name '*.py' -type f -exec wc -l {} + | tail -1 for line counts. But this has pitfalls. wc -l output includes filenames, and the total line has a different format. You must ensure you parse the total correctly, or you will report incorrect numbers.

Another pitfall is comments and blank lines. This project does not require logical line counts, but you should document what your counts represent (physical lines). If you attempt to exclude blank lines, you need a filter like grep -v '^\s*$', but then your pipeline becomes more complex. For this project, we keep it simple but explicit.

File selection is critical. If you include node_modules, your counts will be meaningless. Use find with prune to exclude vendor and build directories. The aggregation should be performed on the same file list to keep counts consistent. It is easy to accidentally run two different find commands with slightly different predicates and end up with mismatched counts. A robust approach is to generate a file list once, then reuse it for all counts.

awk is useful for aggregation. For example, if you want total lines and counts per file type in a single pass, you can extract extensions and accumulate counts in an associative array. This is more advanced but produces a cleaner report and reduces redundant traversal.

Finally, performance matters. Running wc -l on each file individually is slower than batching with -exec ... +. The batching approach reduces process overhead. For large repos, you may want to limit scanning to tracked files (via git ls-files), but that is optional here.

Another subtlety is encoding. If your repo includes files with unusual encodings, wc still counts newlines, but other tools in your pipeline might choke or misinterpret characters. For a stats tool, you can usually ignore encoding because you are counting newlines, not parsing content. Still, you should be aware that some files may not be valid UTF-8, and your report should not assume they are.

It is also worth considering counting stability across refactors. For example, if you want to compare counts over time, you must ensure you are including the same file sets. Changing the extension list or prune rules will create artificial deltas. A robust tool records its selection rules in the report header so that future comparisons are meaningful.

A practical pitfall is line-ending differences. Files with Windows-style CRLF still count as one line per newline, but if you later add logic to strip blank lines or comments, CR characters can interfere. If you extend the tool, normalize line endings or ensure your filters handle \r. Another common pitfall is generated code: if your repo includes generated files (protobufs, build artifacts), counts can jump dramatically. Decide whether generated files are in scope and document that choice clearly.

How this fit on projects

This project uses aggregation to produce file counts and line counts per language and total.

Definitions & key terms

  • aggregation: combining many values into a summary (sum, count).
  • physical lines: literal newline-delimited lines in a file.
  • batching: processing many files per command invocation.
  • file selection: the set of files included in stats.

Mental model diagram (ASCII)

file list -> wc -l -> parse totals -> report

How it works (step-by-step)

  1. Build a safe file list per extension.
  2. Use wc -l to count lines.
  3. Parse totals from wc output.
  4. Aggregate results and render report.
  5. Invariant: counts are computed from the same file list used for all metrics.
  6. Failure modes: parsing the wrong wc line, or mismatched file selection across metrics.

Minimal concrete example

find . -name '*.py' -type f -exec wc -l {} + | tail -1

Common misconceptions

  • “wc -l already gives totals” -> True, but you must parse correctly.
  • “All lines are equal” -> Not always; document what you count.
  • “Multiple find commands are fine” -> Risky if predicates differ.

Check-your-understanding questions

  1. Why does wc -l output include filenames?
  2. Why might two counts differ if you run two separate find commands?
  3. What is the difference between physical and logical lines?
  4. Why is batching faster than per-file execution?

Check-your-understanding answers

  1. wc reports per-file counts with filenames by default.
  2. Small predicate differences can include different files.
  3. Physical lines are literal; logical lines exclude blanks/comments.
  4. Fewer process invocations reduce overhead.

Real-world applications

  • Repo statistics for engineering dashboards.
  • Estimating effort for migrations.
  • Monitoring codebase growth over time.

Where you will apply it

References

  • The Linux Command Line (Shotts), Chapter 20
  • man wc
  • man awk

Key insights

Aggregation is reliable only when the file list is consistent and well-defined.

Summary

Counts are easy to compute but easy to get wrong. Define the file list and parse totals carefully.

Homework/Exercises to practice the concept

  1. Count lines in .py files and verify totals manually for a small fixture.
  2. Write an awk script that totals counts per extension.
  3. Compare per-file wc vs batched wc performance.

Solutions to the homework/exercises

  1. Use a small fixture repo and sum counts by hand.
  2. awk -F. '{ext=$NF; counts[ext]+=1} END {for (e in counts) print e, counts[e]}'
  3. Use time to compare two methods.

2.2 Deterministic Ordering and Timestamp Analysis

Fundamentals

A stats report must be reproducible. Sorting with a fixed locale (LC_ALL=C) ensures consistent ordering. When listing recent files, you must format timestamps consistently and sort by time. find -printf '%T+ %p' produces sortable ISO-like timestamps. Without deterministic ordering, stats reports cannot be diffed reliably. Determinism is also about secondary ordering rules. If two file types have the same count, the order should still be stable. That means adding a secondary sort key (like the extension name) when you sort by count. This makes diffs clean and prevents spurious ordering changes that look like real data changes.

Deep Dive into the concept

Determinism in reporting has multiple dimensions. First, ordering: the same data should produce the same output order. Because filesystem traversal order is not guaranteed, you must sort explicitly. LC_ALL=C sets a bytewise collation order so the sort is stable across locales. Without this, systems with different locale settings can produce different output, which is disastrous for CI or auditing.

Second, timestamp formatting. find -printf '%T+' produces a timestamp with high precision, which sorts lexicographically. This is ideal for reporting “most recently modified” files. But you must decide whether to display the full timestamp or truncate to seconds for readability. The report should document what precision is used.

Third, a deterministic report should include a scan time or run ID. This makes it clear when the data was collected. If you do not include scan time, you can confuse readers when stats appear to change unexpectedly due to file modifications.

Fourth, when counting files by extension, you should order the output alphabetically or by count. Sorting by count is useful for ranking, but the ties should be broken by name for stability. This is a subtle but important detail if you want reproducible output.

Finally, consider that timestamps can change during the scan. A file modified while the scan is running may appear in an unexpected position. This is acceptable but should be documented. For full determinism, you would need snapshotting, which is out of scope. The practical solution is to accept small race conditions but make the output as stable as possible through sorting.

Another important point is to normalize time zones. If you print timestamps in local time, reports created on machines in different time zones will differ even if the files are identical. To avoid this, choose UTC or include the time zone offset in the timestamp. find -printf '%TY-%Tm-%TdT%TH:%TM:%S%z' can include the offset. This makes your report more portable and easier to compare across environments.

When sorting recent files, you should also decide whether to include symlinks and directories. A stats engine focused on code metrics usually excludes directories and may choose to include symlinks as paths rather than following them. This decision affects the output and should be documented to avoid confusion about why some files appear or do not appear.

Deterministic ordering also benefits from explicit tie-breaking in every summary section. If you sort by count, add a secondary key for file extension. If you sort by time, add a secondary key for path to make ordering stable when timestamps are equal (common when files are generated in batches). These tiny details make diffs clean and prevent churn in CI reports that would otherwise show spurious changes.

You can also include a lightweight hash of the report inputs, such as a hash of the file list or a count of files scanned, in the header. This doesn’t make the report deterministic by itself, but it provides a checksum for sanity. If two reports have the same hash but different ordering, you know the difference is a formatting issue rather than a data change. This small addition can save a lot of debugging time when reports are generated on different machines.

How this fit on projects

This project uses sorted outputs for counts and for the “top modified files” section, making reports stable and diffable.

Definitions & key terms

  • determinism: stable output given the same input.
  • collation: the ordering rules used by sort.
  • timestamp precision: the granularity of time data.
  • race condition: file changes during scan affecting results.

Mental model diagram (ASCII)

raw stats -> sort (LC_ALL=C) -> deterministic report

How it works (step-by-step)

  1. Generate raw counts and timestamp lines.
  2. Apply LC_ALL=C sort.
  3. Format and render report with a scan header.
  4. Invariant: the same input always produces the same ordering.
  5. Failure modes: locale drift, missing secondary sort keys, or time zone mismatch.

Minimal concrete example

LC_ALL=C find . -type f -printf '%T+ %p\n' | sort -r | head -5

Common misconceptions

  • “Filesystem order is stable” -> False; never rely on it.
  • “Sort is the same everywhere” -> False; locale affects collation.
  • “Timestamps are immutable” -> False; files can change mid-scan.

Check-your-understanding questions

  1. Why do we set LC_ALL=C?
  2. How does find -printf '%T+' help with sorting by time?
  3. Why should reports include scan time?
  4. How do you make ordering deterministic when counts tie?

Check-your-understanding answers

  1. It ensures predictable bytewise collation.
  2. The format is lexicographically sortable.
  3. It documents when data was collected.
  4. Add a secondary sort key like filename.

Real-world applications

  • Repo metrics dashboards.
  • Codebase change monitoring.
  • Release readiness reporting.

Where you will apply it

References

  • man sort
  • The Linux Command Line (Shotts), Chapter 20

Key insights

Determinism is what makes a report trustworthy over time.

Summary

Sorting with fixed locale and consistent timestamp formatting makes stats reports reproducible and auditable.

Homework/Exercises to practice the concept

  1. Run the stats tool twice and compare outputs.
  2. Change locale and observe differences without LC_ALL=C.
  3. Build a sorted list of recent files with a fixed format.

Solutions to the homework/exercises

  1. diff report1.txt report2.txt should be stable.
  2. LANG=de_DE.UTF-8 sort vs LC_ALL=C sort.
  3. find . -type f -printf '%T+ %p\n' | sort -r | head -5.

3. Project Specification

3.1 What You Will Build

A CLI tool that reports counts of files and lines by language and lists the most recently modified files, with deterministic ordering and clear headers.

3.2 Functional Requirements

  1. File counts: count files by extension.
  2. Line counts: count total lines per extension.
  3. Recent files: list top N modified files.
  4. Deterministic output: sorted and stable.
  5. Exclusions: prune vendor directories.

3.3 Non-Functional Requirements

  • Performance: handle large repos efficiently.
  • Reliability: output still produced with minor errors.
  • Usability: clear report format.

3.4 Example Usage / Output

$ ./stats_engine.sh ~/Projects/app

3.5 Data Formats / Schemas / Protocols

Report format:

SCAN_TIME=2026-01-01T12:00:00
Python files: 84 (22410 lines)
JavaScript files: 41 (9380 lines)
Top 5 recent files:
  2025-12-31T22:10:05 src/api/auth.py

3.6 Edge Cases

  • Repos with no files of a given type.
  • Large files affecting performance.
  • Files modified during scan.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

./stats_engine.sh ./fixtures/repo

3.7.2 Golden Path Demo (Deterministic)

Use a fixture repo and fixed scan time 2026-01-01T12:00:00.

3.7.3 If CLI: exact terminal transcript

$ ./stats_engine.sh ./fixtures/repo
[2026-01-01T12:00:00] TARGET=./fixtures/repo
[2026-01-01T12:00:00] REPORT=stats_2026-01-01.txt
[2026-01-01T12:00:00] DONE

$ cat stats_2026-01-01.txt
Python files: 3 (120 lines)
JavaScript files: 2 (80 lines)
Top 3 recent files:
  2025-12-31T22:10:05 src/api/auth.py
  2025-12-31T20:01:00 src/db/schema.sql

Failure demo (no target):

$ ./stats_engine.sh /no/such/path
[2026-01-01T12:00:00] ERROR: target not found
EXIT_CODE=2

Exit codes:

  • 0: success
  • 1: partial success (minor errors)
  • 2: invalid arguments or missing path

4. Solution Architecture

4.1 High-Level Design

find + prune -> file list -> counts -> sort -> report

4.2 Key Components

Component Responsibility Key Decisions
Selector build file lists prune vendor dirs
Counter count lines/files batched wc
Reporter render stats stable ordering
Recent list top modified sort by timestamp

4.3 Data Structures (No Full Code)

counts: map[ext]{files, lines}

4.4 Algorithm Overview

Key Algorithm: Repo Stats

  1. Build file lists by extension.
  2. Count files and lines per extension.
  3. Generate recent file list.
  4. Sort and render report.

Complexity Analysis:

  • Time: O(n log n) due to sorting
  • Space: O(n) for file lists

5. Implementation Guide

5.1 Development Environment Setup

# Requires find, wc, sort, awk

5.2 Project Structure

project-root/
├── stats_engine.sh
├── fixtures/
│   └── repo/
└── README.md

5.3 The Core Question You’re Answering

“How do I build a reliable, reproducible codebase report using only Unix tools?”

5.4 Concepts You Must Understand First

  1. Aggregation and accurate counting
  2. Deterministic sorting and timestamps

5.5 Questions to Guide Your Design

  1. Which file types should be included?
  2. How will you handle vendor directories?
  3. How will you present the top modified files?

5.6 Thinking Exercise

Design a pipeline that counts Python files, sums their lines, and lists the five most recently modified files.

5.7 The Interview Questions They’ll Ask

  1. “Why is uniq wrong without sort?”
  2. “How do you count lines across many files safely?”
  3. “How do you avoid scanning vendor directories?”

5.8 Hints in Layers

Hint 1: Count files

find . -name '*.py' -type f | wc -l

Hint 2: Count lines

find . -name '*.py' -type f -exec wc -l {} + | tail -1

Hint 3: Recent files

find . -type f -printf '%T+ %p\n' | sort -r | head -5

5.9 Books That Will Help

Topic Book Chapter
Text processing The Linux Command Line (Shotts) Ch. 20
Find The Linux Command Line (Shotts) Ch. 17
Pipelines Effective Shell (Kerr) Ch. 6

5.10 Implementation Phases

Phase 1: Foundation (1-2 days)

Goals:

  • Build file lists by extension
  • Exclude vendor directories

Tasks:

  1. Implement prune logic.
  2. Define extension list.

Checkpoint: file lists are correct.

Phase 2: Core Functionality (2-3 days)

Goals:

  • Count files and lines

Tasks:

  1. Implement batched wc.
  2. Parse totals and format counts.

Checkpoint: counts match manual checks.

Phase 3: Polish & Edge Cases (1-2 days)

Goals:

  • Deterministic output
  • Recent files list

Tasks:

  1. Add LC_ALL=C sort.
  2. Render report header with scan time.

Checkpoint: report is stable across runs.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
File list source find vs git ls-files find + prune works for any repo
Line counting per-file vs batched batched faster
Output ordering by name vs by count by count with tie-break useful and stable

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests parsing totals wc output parsing
Integration Tests full report fixture repo
Edge Case Tests empty repo no files case

6.2 Critical Test Cases

  1. Empty repo: report still produced.
  2. Vendor dir: excluded from counts.
  3. Large file: counts still correct.

6.3 Test Data

fixtures/repo/src/app.py
fixtures/repo/src/app.js

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Wrong totals line counts off parse wc total correctly
Unstable output diff noise use LC_ALL=C sort
Vendor files included huge counts prune directories

7.2 Debugging Strategies

  • Print intermediate file lists.
  • Validate counts on a small fixture.
  • Use set -x to trace steps.

7.3 Performance Traps

Scanning large repos without pruning can take minutes. Always exclude vendor and build directories.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a --include flag for custom extensions.
  • Output JSON in addition to text.

8.2 Intermediate Extensions

  • Count blank lines and comment lines separately.
  • Add per-directory statistics.

8.3 Advanced Extensions

  • Integrate with git to track churn and blame.
  • Export metrics to Prometheus.

9. Real-World Connections

9.1 Industry Applications

  • Engineering metrics dashboards.
  • Audit of codebase growth.
  • Pre-migration assessments.
  • cloc: code line counting tool.
  • tokei: code statistics by language.

9.3 Interview Relevance

  • Aggregation and counting accuracy.
  • Deterministic reporting.
  • Safe traversal and pruning.

10. Resources

10.1 Essential Reading

  • The Linux Command Line (Shotts), Chapter 20
  • Effective Shell (Kerr), Chapter 6

10.2 Video Resources

  • “Counting Lines of Code” (YouTube)
  • “Deterministic Reports” (talk)

10.3 Tools & Documentation

  • man wc
  • man sort

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain batched wc output.
  • I can explain why sort order matters.
  • I can explain mtime semantics.

11.2 Implementation

  • Report includes counts and recent files.
  • Vendor directories are excluded.
  • Output is deterministic.

11.3 Growth

  • I can propose a metric improvement.
  • I documented one performance risk.
  • I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • File and line counts reported.
  • Recent files list included.

Full Completion:

  • Deterministic output with scan header.
  • Exclusion list implemented.

Excellence (Going Above & Beyond):

  • Per-directory stats and churn metrics.
  • Export to JSON or metrics system.