Project 1: Digital Census

Build a deterministic filesystem inventory tool that turns raw metadata into a queryable CSV and summary report.

Quick Reference

Attribute	Value
Difficulty	Beginner
Time Estimate	4-8 hours
Main Programming Language	Bash (Alternatives: Python, Rust)
Alternative Programming Languages	Python, Rust
Coolness Level	Level 2 - “data janitor” vibes
Business Potential	Medium (compliance, asset inventory)
Prerequisites	Shell basics, permissions, CSV basics
Key Topics	inodes, timestamps, find predicates, deterministic reports, CSV escaping

1. Learning Objectives

By completing this project, you will:

Explain how inode metadata is distinct from filenames and file contents.
Build a find expression that is safe, predictable, and prunes irrelevant trees.
Produce a CSV inventory with stable ordering and explicit schema.
Generate summary statistics that are reproducible across runs.
Handle permission errors without breaking the report.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Inodes and File Metadata

Fundamentals

An inode is the kernel data structure that describes a file on disk. The filename you see is only a directory entry that points to an inode; the inode itself holds metadata such as owner, group, mode bits, size, timestamps, and link count. When find reports size or ownership, it is reading inode fields, not file contents. This is why metadata queries are fast even for huge files. The inode model also explains why deleting a name does not immediately free disk space: the link count must reach zero and there must be no open file descriptors. Understanding inodes gives you a correct mental model for why find results are stable even when content changes, and why timestamps sometimes change without data modification. At the system-call level, tools like find and stat call lstat() or stat() and receive a struct that includes these fields. That struct is the interface between user space and the filesystem’s internal representation. Your census report is effectively a serialized view of that struct. If you understand what fields are and are not guaranteed to be present (for example, birth time on some filesystems), you can design a schema that is correct on both Linux and macOS. In other words, the inode model isn’t just theory: it is the contract your report is built on.

Deep Dive into the concept

Inode metadata is central to Unix filesystems. Each inode is identified by a number that is unique within a filesystem. A directory is itself a file whose contents map names to inode numbers. The mapping is not stored in the inode; it is stored in the parent directory. This separation allows hard links, because multiple directory entries can refer to the same inode. The link count is stored in the inode and is decremented when a directory entry is removed. Data blocks are reclaimed only when the link count is zero and no process holds the file open.

The inode also stores timestamps. mtime reflects the last content modification. ctime reflects the last metadata change, such as permissions or ownership updates. atime reflects last access, but may be disabled or updated lazily by mount options (like noatime or relatime). Many people assume ctime is creation time, but it is not. Some filesystems provide birth time, but it is not guaranteed to be visible through portable interfaces. This matters when you build a census report: you must document exactly which timestamp you are exporting and what it means.

The st_mode field combines file type and permission bits. You can tell if a file is a regular file, directory, symlink, block device, or socket from those bits. The setuid, setgid, and sticky bits also live in st_mode and affect execution and deletion behaviors. find -perm reads those bits directly. For an inventory tool, this is the difference between flagging a world-writable directory and ignoring a benign regular file.

There is a deeper operational implication: inode metadata can change independently from file contents. When a file is renamed, its inode does not change, but the directory entry does. When you change ownership, ctime changes and mtime does not. This is why a deterministic report must be explicit about the data you rely on. If you want to track data changes, you need to rely on mtime; if you want to track ownership changes, you need to include ctime. A census tool may choose to export both, but it must document how to interpret them.

Finally, inode numbers are only unique within a filesystem. If you cross mount points, the same inode number might refer to a different file. This is why find -xdev is important in audit contexts: it keeps the inventory on a single device. Once you treat inodes as the ground truth and filenames as labels, you can reason about odd behaviors like deleted-but-open files, log rotation, and dangling hard links. That mental model is essential to building a correct inventory.

To go even deeper, understand how path resolution works: each pathname component is looked up in a directory’s mapping table, yielding an inode number, which then becomes the next lookup context. If a component is a symlink and the caller uses stat() instead of lstat(), the kernel resolves the link and returns the target’s inode. That means a census tool that follows symlinks can unintentionally inventory files outside the intended tree, and it can also double-count the same underlying file via multiple symlink paths. The safe default is to not follow symlinks and to record symlinks as their own inode entries. If you need to record the target, you should add an explicit field and document the risk of cycles.

Also consider inode caching and stale metadata. Modern filesystems cache inode metadata in memory, so repeated stats are fast, but on network filesystems (NFS, SMB) metadata can be stale or inconsistent due to caching and clock skew. This means a census report should be treated as a snapshot, not as a perfect ground truth across time. If you scan a directory while files are being modified, you can observe a mix of old and new metadata. Your script should therefore record a scan time and accept that the snapshot is approximate. This is normal in production audits.

Another important detail is link count vs. apparent file count. A census that counts each path as a file will overcount when hard links exist. If your environment uses hard links (for example, package managers or deduplicated storage), you may want to include inode numbers to detect duplicates and possibly deduplicate counts. This is an advanced feature, but understanding it matters if you are using the census for storage accounting. It is also the key to explaining why deleting a file sometimes does not free space: the inode still has a positive link count or is held open by a process. Your report can include nlink to help debug this.

How this fit on projects

You will read inode metadata for every file you inventory. The report schema in Section 3.5 mirrors inode fields, and the “world-writable” summary depends on permission bits in st_mode.

Definitions & key terms

inode: metadata record for a file (size, mode, owner, timestamps, link count).
directory entry: a name-to-inode mapping stored in a directory file.
link count: number of directory entries pointing to the same inode.
mtime: last content modification timestamp.
ctime: last metadata change timestamp.
atime: last access timestamp.

Mental model diagram (ASCII)

Directory file (names -> inode numbers)
+--------------------------+
| notes.txt -> inode 1201  |
| report.csv -> inode 1202 |
+------------+-------------+
             |
             v
         inode 1201
  (mode, uid, gid, size, times)
             |
             v
          data blocks

How it works (step-by-step)

find visits a path and calls lstat() to get inode metadata.
The kernel returns fields like size, mode, uid, gid, and timestamps.
find -printf formats those fields into your CSV row.
You sort output for determinism, then compute summaries.
If lstat() fails (permission denied), you log and continue.
Invariant: each CSV row corresponds to exactly one inode snapshot at scan time.
Failure modes: permission denied, stale NFS metadata, or a file being deleted between stat and output.

Minimal concrete example

# Show inode number and timestamps for one file
ls -li ./reports/monthly.txt
stat ./reports/monthly.txt

Common misconceptions

“ctime is creation time” -> False. ctime is last metadata change.
“Deleting a filename deletes the file immediately” -> False if another hard link or open descriptor exists.
“inode numbers are global” -> False; they are only unique per filesystem.

Check-your-understanding questions

Why can two different names refer to the same file data?
Which timestamp changes when you run chmod 600 file?
Why can df and du show different disk usage?
What does find -xdev protect you from in audits?

Check-your-understanding answers

Both names are directory entries pointing to the same inode (hard links).
ctime changes; mtime does not.
df measures allocated blocks, while du follows directory entries.
It prevents crossing into other filesystems where inode numbers differ.

Real-world applications

Compliance inventories of file ownership and permission posture.
Forensic timelines based on metadata changes.
Detecting orphaned or duplicated data via link counts.

Where you will apply it

In this project: see §3.5 (schema design) and §5.4 (concept prerequisites).
Also used in: P06-system-janitor.md and P08-forensic-analyzer.md.

References

The Linux Programming Interface (Kerrisk), Chapter 15 (File attributes)
man 2 stat
man 7 inode

Key insights

Metadata is authoritative; filenames are just pointers to inodes.

Summary

Treating inode metadata as the primary source of truth lets you design a census report that is stable, interpretable, and correct.

Homework/Exercises to practice the concept

Create a file, hard-link it twice, and verify the link count.
Change permissions and observe which timestamps change.
Find files with more than one hard link in a test directory.

Solutions to the homework/exercises

echo hi > a; ln a b; ln a c; ls -li a b c
chmod 600 a; stat a (ctime updates, mtime unchanged)
find . -type f -links +1 -ls

2.2 Find Expressions, Traversal, and Pruning

Fundamentals

find is both a traversal engine and a predicate evaluator. It walks a directory tree depth-first and evaluates a boolean expression for each path. Expressions are built from tests like -name and -type, and actions like -print and -exec. Order matters because evaluation is left-to-right and uses short-circuit logic. This means a poorly ordered expression can be incorrect or slow. Pruning (-prune) lets you skip entire subtrees, which is critical when you want to avoid .git, node_modules, or mounted volumes. A census tool that does not control traversal will be slow and unpredictable. At a practical level, find is your query planner. The way you structure tests is analogous to a database query: if you filter early, you reduce work later. If you misplace -prune or forget parentheses, the query still runs but returns the wrong result. For a census tool, this can produce false confidence because the report “looks” correct but silently skipped or included paths. Understanding the evaluation model is the difference between a reliable audit and a misleading one.

Deep Dive into the concept

The find expression language is deceptively simple but full of edge cases. Tests evaluate to true or false; actions usually return true and have side effects. -print is implied if no action is specified, but the moment you add an action, you must explicitly include -print or -printf. This is one of the most common pitfalls in audit scripts.

Operator precedence is another trap. -a (AND) binds tighter than -o (OR), so A -o B -a C is parsed as A -o (B -a C) unless you add parentheses. Because the shell interprets parentheses, you must escape them as \( ... \) or quote them. If your pruning logic is wrong, you will still traverse subtrees you intended to skip, which can explode runtime and include sensitive data.

Traversal order is depth-first by default. If you need to delete directories, you want -depth so files are processed before directories, but for an inventory report you usually want default order and a full traversal. To keep the census within a device boundary, -xdev limits traversal to a single filesystem. This avoids crossing into /proc, network mounts, or external drives where metadata semantics differ.

find provides rich metadata output via -printf. Each % token maps to a field: %s size, %u user, %g group, %m mode, %TY-%Tm-%Td date fields, and %p path. These come from the same inode metadata described earlier. You can emit CSV directly, but you must handle delimiter collisions. The path field can contain commas or newlines; if you need strict CSV, you may choose a safer delimiter or implement escaping.

Performance and correctness depend on test ordering. Place cheap tests (like -type or -name) before expensive actions. If you are pruning, place -prune early in the expression with an -o that continues evaluation for non-pruned paths. A reliable inventory command typically looks like this:

find root \( -path './.git' -o -path './node_modules' \) -prune -o -type f -printf '...'

This form ensures that the prune takes effect before any expensive work. Once you internalize find as a boolean expression evaluator, you can reason about correctness instead of trial-and-error.

There is also a subtle interaction between -prune and -o. The common pruning pattern works because -prune returns true, causing the OR branch to short-circuit, which prevents descending into the directory. If you remove the -o, you still prune but you also stop evaluating the rest of the expression, so you produce no output for non-pruned paths. This is why the template -prune -o -type f -print is canonical. It encodes the control flow explicitly. Once you see this, you can build more complex filters with confidence.

The -path and -name tests also have nuanced semantics. -name matches only the basename, so -name '*.log' does not match /var/log/app/error.log.gz unless the basename matches exactly. -path matches the whole path, including directories, and its globbing can match /. This is often used for pruning paths. If your pruning pattern is too broad, you might exclude files you intended to include. A safe approach is to make prune paths explicit and anchored (for example -path './node_modules') and to test the command on a fixture tree before running on production data.

Time predicates are another source of surprises. -mtime +7 means “more than 7 * 24 hours ago,” but -mtime 7 means “between 7 and 8 days ago,” because find rounds down to whole days. For a census, you typically want exact timestamps rather than day buckets. That is why -printf is used to export exact timestamps and why -mmin or -newermt are used for precise cutoff logic. If you mix these predicates without understanding them, your selection window will be wrong.

Finally, portability matters. GNU find and BSD find differ in supported -printf directives and in the semantics of -regex and -regextype. For a project that should work on macOS and Linux, document which implementation you expect, or detect and use gfind when available. A census tool should be explicit about its dependencies because the output format can differ subtly between platforms.

How this fit on projects

Your census tool uses pruning to skip irrelevant paths and uses -printf to emit the structured inventory rows that the summary report consumes.

Definitions & key terms

predicate/test: a condition that returns true or false (e.g., -type f).
action: a side effect like -print or -exec.
prune: skip an entire subtree from traversal.
short-circuit: stop evaluation early once truth value is known.
depth-first traversal: find visits a directory before its children by default.

Mental model diagram (ASCII)

root/
  .git/        <- prune
  src/
    a.txt
    b.txt

Expression:
(path .git) -prune OR (type f) -printf

How it works (step-by-step)

find visits root/.git and sees it matches the prune path.
-prune returns true, so the -o short-circuits the rest.
find does not descend into .git.
For root/src/a.txt, the prune test is false.
-type f is true, so -printf emits a CSV row.
Invariant: prune rules must be evaluated before expensive actions.
Failure modes: misplaced parentheses, missing -o, or incorrect glob patterns.

Minimal concrete example

find . \( -path './.git' -o -path './node_modules' \) -prune -o -type f -printf '%s,%u,%g,%m,%TY-%Tm-%Td,%p\n'

Common misconceptions

“Parentheses are optional” -> False; you must group -o expressions explicitly.
“-print is always implied” -> False when any action is present.
“-name uses regex” -> False; it uses glob patterns.

Check-your-understanding questions

Why must you escape parentheses in a find command?
When is -print implied, and when is it not?
How does -prune change traversal behavior?
Why does test ordering affect performance?

Check-your-understanding answers

Because the shell interprets parentheses unless escaped.
It is implied only when no action is present.
It prevents descending into the matched directory subtree.
Cheap tests filter early, preventing expensive actions later.

Real-world applications

Large-scale file inventories in build servers.
Compliance scans that avoid vendor directories.
Fast metadata reports for incident response.

Where you will apply it

In this project: see §3.2 (functional requirements) and §5.10 (implementation phases).
Also used in: P05-the-pipeline.md and P07-stats-engine.md.

References

man find
GNU findutils manual, “Expressions” section
The Linux Command Line (Shotts), Chapter 17

Key insights

find is a boolean expression engine; traversal control is a first-class design decision.

Summary

A correct census depends on correct traversal. If you can reason about pruning and expression order, your tool will be both safe and fast.

Homework/Exercises to practice the concept

Write a find command that skips .git and node_modules.
Create a directory tree and verify -maxdepth vs -mindepth.
Demonstrate the difference between -name and -path.

Solutions to the homework/exercises

find . \( -path './.git' -o -path './node_modules' \) -prune -o -type f -print
find . -maxdepth 2 -type f vs find . -mindepth 2 -type f
find . -name '*.log' (basename only) vs find . -path './logs/*.log'

2.3 Deterministic Reporting and CSV Safety

Fundamentals

A report is only useful if it is reproducible. Filesystem traversal order is not guaranteed, so you must explicitly sort your output. Locale can also change sort order, so using LC_ALL=C ensures deterministic collation. When writing CSV, you must handle delimiter collisions: paths may contain commas or newlines. A census tool should either implement strict CSV escaping or choose a safer delimiter (like tab) and document the schema. Determinism is the difference between a report you can diff over time and one you cannot trust. Determinism also means consistent schema decisions. If you sometimes emit mtime in local time and other times in UTC, your report will be inconsistent even if the sort order is stable. You must define exactly how timestamps are formatted and whether you normalize them (e.g., ISO-8601). A deterministic census tool is one that produces the same output for the same input, regardless of the machine or locale. That level of predictability is what turns a one-off script into an audit artifact.

Deep Dive into the concept

Determinism has multiple layers. First, the order of rows in your CSV must be stable. find does not guarantee order, and filesystem traversal can vary by OS or filesystem. Sorting by path is the simplest way to stabilize output, but you must ensure sorting behavior is consistent across platforms. LC_ALL=C forces byte-order sorting, which is predictable. Without it, a path containing uppercase letters or accented characters can reorder unexpectedly depending on locale settings.

Second, timestamps must be normalized. If you print timestamps with seconds, repeated runs will differ for files that are modified between runs. For a census snapshot, this is expected. However, if you want to compare two runs, you must know exactly which fields are volatile. A best practice is to include both raw timestamps and a run header that records the scan time so you can reason about changes. Some teams also include a content hash, but that is outside the scope of a metadata census.

Third, CSV requires escaping. Standard CSV uses commas as field separators and double quotes to escape fields that contain commas, quotes, or newlines. A naive -printf will break if any path contains a comma or newline. You can solve this by using a tab delimiter, or by post-processing with a CSV escape function. In Bash, a simple escape can replace " with "" and wrap the field in quotes. If you choose a different delimiter (like pipe), you must ensure it does not appear in paths or you must escape it.

Fourth, summary reports must be reproducible. Counts of file types or permissions should be computed from the sorted inventory rather than from a second traversal to avoid inconsistencies. This means your pipeline should produce a single source of truth (the CSV) and derive all summaries from it. That design creates a clean audit trail.

Finally, deterministic output enables automation. If your report is stable, you can store it in git, diff it across days, or feed it into compliance workflows. If it is not stable, you will produce noise. The small discipline of stable sorting and explicit schema definitions turns an ad-hoc script into an auditable tool.

There is also the problem of numeric and date sorting within CSV. If you need to sort by size or time, you must use numeric sort flags (sort -n or sort -t, -k1,1n for numeric columns). Sorting lexicographically by size will produce incorrect ordering because "100" sorts before "20". A census report might need multiple outputs: one sorted by path for diffability, and another sorted by size for the “top N” list. The key is to document which sort order applies to which output so that users interpret the results correctly.

CSV escaping deserves special attention. The RFC 4180 standard requires fields containing commas, quotes, or newlines to be wrapped in double quotes, and embedded quotes to be doubled. If you do not implement this, your report can break when loaded into spreadsheets or analysis tools. That is why some audit tools choose to emit TSV (tab-separated values) instead: tabs are rare in filenames, making escaping less frequent. If you stick with CSV, you can post-process the path column with a small escape function. The choice should be explicit and recorded in the README or report header.

Determinism can also be improved by capturing the environment context. For example, include uname -a, tool versions, and the find flavor (GNU vs BSD) in the report header. This is important when you compare reports across machines. If a report differs, you need to know whether the difference is due to real filesystem changes or to differences in tooling. The more metadata you include, the easier it is to debug discrepancies.

Lastly, consider the trade-off between determinism and performance. Sorting a million-line CSV can be expensive and may require disk spill (sort uses temporary files). For very large inventories, you may need to tune sort with -T to set the temporary directory or --parallel to speed it up. This is not required for small projects, but it becomes important in production-scale audits.

How this fit on projects

You will explicitly sort your census output, define a CSV schema in Section 3.5, and generate summary statistics from the same dataset to avoid inconsistencies.

Definitions & key terms

determinism: ability to reproduce identical output given the same inputs.
collation order: how strings are ordered during sorting.
CSV escaping: quoting rules for fields containing delimiters or quotes.
schema: explicit definition of column meanings and types.

Mental model diagram (ASCII)

raw traversal -> unsorted rows -> sort -> stable CSV -> summaries

How it works (step-by-step)

Emit raw rows from find -printf.
Set LC_ALL=C to fix sort order.
Sort rows by path or size as needed.
Write the sorted CSV to disk.
Build summaries from the CSV (not from a second traversal).
Invariant: summaries are derived from a single source CSV.
Failure modes: locale-dependent ordering, unescaped delimiters, or inconsistent timestamp formats.

Minimal concrete example

LC_ALL=C find . -type f -printf '%s,%u,%g,%m,%TY-%Tm-%TdT%TH:%TM:%TS,%p\n' \
  | sort > census.csv

Common misconceptions

“find always outputs in path order” -> False; order is filesystem-dependent.
“CSV is just comma-separated” -> False; it requires escaping.
“Sorting is optional” -> False if you want deterministic diffs.

Check-your-understanding questions

Why can two runs of the same find command output in different orders?
What does LC_ALL=C change in a pipeline?
How do you safely represent a filename containing a comma in CSV?
Why should summaries be derived from a single source dataset?

Check-your-understanding answers

Traversal order depends on filesystem and directory entry order.
It forces bytewise collation for predictable sorting.
Wrap the field in quotes and escape embedded quotes.
It prevents inconsistencies between two independent traversals.

Real-world applications

Compliance audits stored in version control for change tracking.
Inventory diffs for large monorepos and build caches.
Incident response reports that must be reproducible.

Where you will apply it

In this project: see §3.5 (schema) and §3.7 (golden output).
Also used in: P07-stats-engine.md.

References

RFC 4180 (CSV format)
The Linux Command Line (Shotts), Chapter 20
man sort

Key insights

Determinism is a feature. Without it, audits are noise.

Summary

Stable sorting, explicit schema, and safe delimiter handling turn a list of files into a reliable dataset.

Homework/Exercises to practice the concept

Create two files with names that differ only in case and observe sort order with and without LC_ALL=C.
Create a filename with a comma and test how your CSV handles it.
Produce a summary of top 5 largest files from a CSV without re-running find.

Solutions to the homework/exercises

LC_ALL=C printf 'a\nB\n' | sort vs printf 'a\nB\n' | sort
touch 'file,with,comma.txt' then check CSV output for quoting.
sort -rn -t, -k1,1 census.csv | head -5

3. Project Specification

3.1 What You Will Build

A CLI script that traverses a target directory, collects inode metadata for each file, and outputs:

a deterministic CSV inventory file
a plain-text summary report (top sizes, risky permissions, counts by type)
an error log of inaccessible paths

Included features:

traversal pruning of irrelevant directories
explicit CSV schema with stable ordering
summary statistics derived from the CSV

Excluded features:

content hashing
content scanning or regex matching
network or distributed filesystem inventory

3.2 Functional Requirements

Inventory CSV: output size_bytes,owner,group,mode,mtime,path for each file.
Pruning: skip user-configured directories (.git, node_modules, etc.).
Deterministic order: sorted output with fixed collation.
Summary report: top 10 largest files, total file count, and world-writable files.
Error handling: permission errors captured to a log file.

3.3 Non-Functional Requirements

Performance: handle 100k files in under 1 minute on a typical laptop SSD.
Reliability: never abort on a single permission error.
Usability: clear CLI flags and a human-readable summary.

3.4 Example Usage / Output

$ ./census.sh ~/Projects --exclude .git --exclude node_modules
[+] Target: /Users/alice/Projects
[+] CSV: census_2026-01-01.csv
[+] Summary: census_summary.txt
[+] Errors: census_errors.txt

3.5 Data Formats / Schemas / Protocols

CSV schema (comma-separated, UTF-8, sorted by path):

size_bytes,owner,group,mode,mtime,path
1048576,alice,staff,0644,2025-12-31T20:11:04,./logs/app.log

Summary report schema:

Total files: 12482
Total size bytes: 912381234
Top 10 largest files:
  10485760 ./db/backup.sql
World-writable files:
  ./tmp/unsafe.txt

3.6 Edge Cases

Filenames with commas or newlines.
Directories without read permissions.
Symlink loops (if -L is used, which this tool avoids by default).
Files modified during the scan (timestamps may change mid-run).

3.7 Real World Outcome

A deterministic inventory snapshot suitable for audits and diffs.

3.7.1 How to Run (Copy/Paste)

./census.sh /Users/alice/Projects --exclude .git --exclude node_modules

3.7.2 Golden Path Demo (Deterministic)

Assume a fixed test dataset under ./fixtures/census and a frozen timestamp of 2026-01-01T12:00:00 recorded in the report header.

3.7.3 If CLI: exact terminal transcript

$ ./census.sh ./fixtures/census --exclude .git
[2026-01-01T12:00:00] TARGET=./fixtures/census
[2026-01-01T12:00:00] CSV=census_2026-01-01.csv
[2026-01-01T12:00:00] SUMMARY=census_summary.txt
[2026-01-01T12:00:00] ERRORS=census_errors.txt
[2026-01-01T12:00:00] FILES=6
[2026-01-01T12:00:00] DONE

$ cat census_2026-01-01.csv
size_bytes,owner,group,mode,mtime,path
12,alice,staff,0644,2025-12-30T09:00:00,./fixtures/census/notes/todo.txt
2048,alice,staff,0644,2025-12-29T10:22:12,./fixtures/census/logs/app.log
4096,alice,staff,0755,2025-12-28T18:02:11,./fixtures/census/bin/tool

$ cat census_summary.txt
Total files: 6
Total size bytes: 6156
Top 3 largest files:
  4096 ./fixtures/census/bin/tool
  2048 ./fixtures/census/logs/app.log
World-writable files:
  (none)

Failure demo (missing path):

$ ./census.sh /no/such/path
[2026-01-01T12:00:00] ERROR: target does not exist
EXIT_CODE=2

Exit codes:

0: success
1: partial success with errors logged
2: invalid arguments or missing path

4. Solution Architecture

4.1 High-Level Design

+----------+     +-----------------+     +------------------+
| args     | --> | find/printf CSV | --> | sort + summary   |
+----------+     +-----------------+     +------------------+
        \                                   |
         \--> error log --------------------+

4.2 Key Components

Component	Responsibility	Key Decisions
CLI parser	parse target and excludes	support multiple `--exclude` flags
Scanner	run `find` with prune and `-printf`	default `-P`, no symlink following
Sorter	ensure deterministic CSV	`LC_ALL=C`, sort by path
Summarizer	compute totals, top N, perms	derive from CSV only

4.3 Data Structures (No Full Code)

# Conceptual fields per row
size_bytes, owner, group, mode, mtime, path

4.4 Algorithm Overview

Key Algorithm: Inventory Pipeline

Build prune expression from excludes.
Run find with -type f -printf.
Pipe to sort with fixed locale.
Write CSV and compute summary with awk.

Complexity Analysis:

Time: O(n log n) due to sorting.
Space: O(n) for CSV output.

5. Implementation Guide

5.1 Development Environment Setup

# macOS users may install GNU findutils for consistent -printf behavior
brew install findutils

5.2 Project Structure

project-root/
├── census.sh
├── fixtures/
│   └── census/
├── output/
│   ├── census_YYYY-MM-DD.csv
│   └── census_summary.txt
└── README.md

5.3 The Core Question You’re Answering

“How do I turn raw filesystem metadata into a reliable, queryable inventory report?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

Inodes and timestamps (what changes and why)
find pruning and expression ordering
Deterministic output and CSV safety

5.5 Questions to Guide Your Design

Which directories must be excluded by default?
How will you escape paths that contain commas?
What is the minimum schema that still supports audits?
Do you treat permission errors as warnings or failures?

5.6 Thinking Exercise

Sketch a pipeline that outputs a sorted CSV and then computes the top 5 largest files. Identify where you would log errors.

5.7 The Interview Questions They’ll Ask

“Why is ctime not creation time?”
“How does find decide which paths to traverse?”
“What is the difference between -print and -printf?”
“How do you make a report deterministic?”

5.8 Hints in Layers

Hint 1: Start with -printf

find "$TARGET" -type f -printf '%s,%u,%g,%m,%TY-%Tm-%TdT%TH:%TM:%TS,%p\n'

Hint 2: Add pruning

find "$TARGET" \( -path './.git' -o -path './node_modules' \) -prune -o -type f -printf '...'

Hint 3: Sort deterministically

LC_ALL=C sort > "$CSV"

5.9 Books That Will Help

Topic	Book	Chapter
File metadata	The Linux Programming Interface (Kerrisk)	Ch. 15
Find expressions	The Linux Command Line (Shotts)	Ch. 17
Sorting and reporting	The Linux Command Line (Shotts)	Ch. 20

5.10 Implementation Phases

Phase 1: Foundation (1-2 hours)

Goals:

Parse CLI args and build exclude list
Emit raw CSV rows with find -printf

Tasks:

Implement target validation and exit codes.
Build the prune expression from --exclude flags.

Checkpoint: command prints valid CSV for a small directory.

Phase 2: Core Functionality (2-3 hours)

Goals:

Deterministic output
Summary report generation

Tasks:

Pipe output through LC_ALL=C sort.
Compute totals and top N using awk.

Checkpoint: CSV and summary are generated with stable ordering.

Phase 3: Polish & Edge Cases (1-2 hours)

Goals:

Error logging
CSV safety and documentation

Tasks:

Capture permission errors to an error log.
Document delimiter rules and limitations.

Checkpoint: tool finishes even with permission errors.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
CSV delimiter	comma, tab, pipe	comma with escaping	standard format, compatible with spreadsheets
Sorting key	path, size	path	stable, enables diffs
Error policy	fail-fast, log and continue	log and continue	audits should be best-effort

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	verify parsing and formatting	arg parser, delimiter escaping
Integration Tests	validate pipeline output	fixtures directory inventory
Edge Case Tests	handle weird filenames	commas, newlines, permission errors

6.2 Critical Test Cases

Missing path: exit code 2 and clear error.
Comma in filename: CSV must remain parseable.
Permission denied: error logged, exit code 1, report still produced.

6.3 Test Data

fixtures/census/
  notes/todo.txt
  logs/app.log
  bin/tool
  "file,with,comma.txt"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Missing prune	huge runtime	add `-prune` for vendor dirs
Unstable output	diffs change every run	add `LC_ALL=C sort`
Broken CSV	commas break columns	escape or change delimiter

7.2 Debugging Strategies

Use set -x to trace pipeline stages.
Compare raw find output with sorted output.
Inspect error log to identify permission issues.

7.3 Performance Traps

Scanning large vendor directories or network mounts can dominate runtime. Always prune and use -xdev where appropriate.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a --json output mode using jq.
Add a --max-size filter to exclude large binaries.

8.2 Intermediate Extensions

Track ctime and atime in addition to mtime.
Add a --xdev flag to restrict to a device.

8.3 Advanced Extensions

Compare two census reports and generate a diff summary.
Store reports in SQLite for historical queries.

9. Real-World Connections

9.1 Industry Applications

Compliance audits for permissions and ownership.
Pre-migration inventories for large storage systems.
Build cache analysis and cleanup planning.

findutils: reference implementation for find.
ripgrep: content scanning at scale (out of scope but inspirational).

9.3 Interview Relevance

File metadata and permission bits.
Deterministic reporting and data quality.
Safe traversal strategies.

10. Resources

10.1 Essential Reading

The Linux Programming Interface (Kerrisk) - Chapter 15 (File attributes)
The Linux Command Line (Shotts) - Chapters 17 and 20

10.2 Video Resources

“Linux Filesystem Basics” (YouTube) - inode and metadata overview
“Effective Shell Pipelines” (conference talk) - determinism tips

10.3 Tools & Documentation

man find
man stat
man sort

P02-log-hunter.md - switches to content scanning.
P07-stats-engine.md - extends reporting to code stats.

11. Self-Assessment Checklist

11.1 Understanding

I can explain inode vs filename.
I can explain why ctime is not creation time.
I can describe how find evaluates expressions.

11.2 Implementation

CSV output matches schema.
Summary report is generated from CSV.
Permission errors do not abort the run.

11.3 Growth

I can identify one improvement to reporting quality.
I documented at least one edge case discovered.
I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

CSV file generated with required columns.
Summary report produced with top 10 largest files.
Errors logged without crashing.

Full Completion:

All minimum criteria plus:
Deterministic ordering validated across two runs.
CSV handles commas safely.

Excellence (Going Above & Beyond):

Two reports can be diffed with a clean change summary.
Report includes optional device restriction and timestamp normalization.