Project 1: Digital Census
Build a deterministic filesystem inventory tool that turns raw metadata into a queryable CSV and summary report.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Beginner |
| Time Estimate | 4-8 hours |
| Main Programming Language | Bash (Alternatives: Python, Rust) |
| Alternative Programming Languages | Python, Rust |
| Coolness Level | Level 2 - “data janitor” vibes |
| Business Potential | Medium (compliance, asset inventory) |
| Prerequisites | Shell basics, permissions, CSV basics |
| Key Topics | inodes, timestamps, find predicates, deterministic reports, CSV escaping |
1. Learning Objectives
By completing this project, you will:
- Explain how inode metadata is distinct from filenames and file contents.
- Build a
findexpression that is safe, predictable, and prunes irrelevant trees. - Produce a CSV inventory with stable ordering and explicit schema.
- Generate summary statistics that are reproducible across runs.
- Handle permission errors without breaking the report.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Inodes and File Metadata
Fundamentals
An inode is the kernel data structure that describes a file on disk. The filename you see is only a directory entry that points to an inode; the inode itself holds metadata such as owner, group, mode bits, size, timestamps, and link count. When find reports size or ownership, it is reading inode fields, not file contents. This is why metadata queries are fast even for huge files. The inode model also explains why deleting a name does not immediately free disk space: the link count must reach zero and there must be no open file descriptors. Understanding inodes gives you a correct mental model for why find results are stable even when content changes, and why timestamps sometimes change without data modification.
At the system-call level, tools like find and stat call lstat() or stat() and receive a struct that includes these fields. That struct is the interface between user space and the filesystem’s internal representation. Your census report is effectively a serialized view of that struct. If you understand what fields are and are not guaranteed to be present (for example, birth time on some filesystems), you can design a schema that is correct on both Linux and macOS. In other words, the inode model isn’t just theory: it is the contract your report is built on.
Deep Dive into the concept
Inode metadata is central to Unix filesystems. Each inode is identified by a number that is unique within a filesystem. A directory is itself a file whose contents map names to inode numbers. The mapping is not stored in the inode; it is stored in the parent directory. This separation allows hard links, because multiple directory entries can refer to the same inode. The link count is stored in the inode and is decremented when a directory entry is removed. Data blocks are reclaimed only when the link count is zero and no process holds the file open.
The inode also stores timestamps. mtime reflects the last content modification. ctime reflects the last metadata change, such as permissions or ownership updates. atime reflects last access, but may be disabled or updated lazily by mount options (like noatime or relatime). Many people assume ctime is creation time, but it is not. Some filesystems provide birth time, but it is not guaranteed to be visible through portable interfaces. This matters when you build a census report: you must document exactly which timestamp you are exporting and what it means.
The st_mode field combines file type and permission bits. You can tell if a file is a regular file, directory, symlink, block device, or socket from those bits. The setuid, setgid, and sticky bits also live in st_mode and affect execution and deletion behaviors. find -perm reads those bits directly. For an inventory tool, this is the difference between flagging a world-writable directory and ignoring a benign regular file.
There is a deeper operational implication: inode metadata can change independently from file contents. When a file is renamed, its inode does not change, but the directory entry does. When you change ownership, ctime changes and mtime does not. This is why a deterministic report must be explicit about the data you rely on. If you want to track data changes, you need to rely on mtime; if you want to track ownership changes, you need to include ctime. A census tool may choose to export both, but it must document how to interpret them.
Finally, inode numbers are only unique within a filesystem. If you cross mount points, the same inode number might refer to a different file. This is why find -xdev is important in audit contexts: it keeps the inventory on a single device. Once you treat inodes as the ground truth and filenames as labels, you can reason about odd behaviors like deleted-but-open files, log rotation, and dangling hard links. That mental model is essential to building a correct inventory.
To go even deeper, understand how path resolution works: each pathname component is looked up in a directory’s mapping table, yielding an inode number, which then becomes the next lookup context. If a component is a symlink and the caller uses stat() instead of lstat(), the kernel resolves the link and returns the target’s inode. That means a census tool that follows symlinks can unintentionally inventory files outside the intended tree, and it can also double-count the same underlying file via multiple symlink paths. The safe default is to not follow symlinks and to record symlinks as their own inode entries. If you need to record the target, you should add an explicit field and document the risk of cycles.
Also consider inode caching and stale metadata. Modern filesystems cache inode metadata in memory, so repeated stats are fast, but on network filesystems (NFS, SMB) metadata can be stale or inconsistent due to caching and clock skew. This means a census report should be treated as a snapshot, not as a perfect ground truth across time. If you scan a directory while files are being modified, you can observe a mix of old and new metadata. Your script should therefore record a scan time and accept that the snapshot is approximate. This is normal in production audits.
Another important detail is link count vs. apparent file count. A census that counts each path as a file will overcount when hard links exist. If your environment uses hard links (for example, package managers or deduplicated storage), you may want to include inode numbers to detect duplicates and possibly deduplicate counts. This is an advanced feature, but understanding it matters if you are using the census for storage accounting. It is also the key to explaining why deleting a file sometimes does not free space: the inode still has a positive link count or is held open by a process. Your report can include nlink to help debug this.
How this fit on projects
You will read inode metadata for every file you inventory. The report schema in Section 3.5 mirrors inode fields, and the “world-writable” summary depends on permission bits in st_mode.
Definitions & key terms
- inode: metadata record for a file (size, mode, owner, timestamps, link count).
- directory entry: a name-to-inode mapping stored in a directory file.
- link count: number of directory entries pointing to the same inode.
- mtime: last content modification timestamp.
- ctime: last metadata change timestamp.
- atime: last access timestamp.
Mental model diagram (ASCII)
Directory file (names -> inode numbers)
+--------------------------+
| notes.txt -> inode 1201 |
| report.csv -> inode 1202 |
+------------+-------------+
|
v
inode 1201
(mode, uid, gid, size, times)
|
v
data blocks
How it works (step-by-step)
findvisits a path and callslstat()to get inode metadata.- The kernel returns fields like size, mode, uid, gid, and timestamps.
find -printfformats those fields into your CSV row.- You sort output for determinism, then compute summaries.
- If
lstat()fails (permission denied), you log and continue. - Invariant: each CSV row corresponds to exactly one inode snapshot at scan time.
- Failure modes: permission denied, stale NFS metadata, or a file being deleted between stat and output.
Minimal concrete example
# Show inode number and timestamps for one file
ls -li ./reports/monthly.txt
stat ./reports/monthly.txt
Common misconceptions
- “ctime is creation time” -> False. ctime is last metadata change.
- “Deleting a filename deletes the file immediately” -> False if another hard link or open descriptor exists.
- “inode numbers are global” -> False; they are only unique per filesystem.
Check-your-understanding questions
- Why can two different names refer to the same file data?
- Which timestamp changes when you run
chmod 600 file? - Why can
dfanddushow different disk usage? - What does
find -xdevprotect you from in audits?
Check-your-understanding answers
- Both names are directory entries pointing to the same inode (hard links).
ctimechanges;mtimedoes not.dfmeasures allocated blocks, whiledufollows directory entries.- It prevents crossing into other filesystems where inode numbers differ.
Real-world applications
- Compliance inventories of file ownership and permission posture.
- Forensic timelines based on metadata changes.
- Detecting orphaned or duplicated data via link counts.
Where you will apply it
- In this project: see §3.5 (schema design) and §5.4 (concept prerequisites).
- Also used in: P06-system-janitor.md and P08-forensic-analyzer.md.
References
- The Linux Programming Interface (Kerrisk), Chapter 15 (File attributes)
man 2 statman 7 inode
Key insights
Metadata is authoritative; filenames are just pointers to inodes.
Summary
Treating inode metadata as the primary source of truth lets you design a census report that is stable, interpretable, and correct.
Homework/Exercises to practice the concept
- Create a file, hard-link it twice, and verify the link count.
- Change permissions and observe which timestamps change.
- Find files with more than one hard link in a test directory.
Solutions to the homework/exercises
echo hi > a; ln a b; ln a c; ls -li a b cchmod 600 a; stat a(ctime updates, mtime unchanged)find . -type f -links +1 -ls
2.2 Find Expressions, Traversal, and Pruning
Fundamentals
find is both a traversal engine and a predicate evaluator. It walks a directory tree depth-first and evaluates a boolean expression for each path. Expressions are built from tests like -name and -type, and actions like -print and -exec. Order matters because evaluation is left-to-right and uses short-circuit logic. This means a poorly ordered expression can be incorrect or slow. Pruning (-prune) lets you skip entire subtrees, which is critical when you want to avoid .git, node_modules, or mounted volumes. A census tool that does not control traversal will be slow and unpredictable.
At a practical level, find is your query planner. The way you structure tests is analogous to a database query: if you filter early, you reduce work later. If you misplace -prune or forget parentheses, the query still runs but returns the wrong result. For a census tool, this can produce false confidence because the report “looks” correct but silently skipped or included paths. Understanding the evaluation model is the difference between a reliable audit and a misleading one.
Deep Dive into the concept
The find expression language is deceptively simple but full of edge cases. Tests evaluate to true or false; actions usually return true and have side effects. -print is implied if no action is specified, but the moment you add an action, you must explicitly include -print or -printf. This is one of the most common pitfalls in audit scripts.
Operator precedence is another trap. -a (AND) binds tighter than -o (OR), so A -o B -a C is parsed as A -o (B -a C) unless you add parentheses. Because the shell interprets parentheses, you must escape them as \( ... \) or quote them. If your pruning logic is wrong, you will still traverse subtrees you intended to skip, which can explode runtime and include sensitive data.
Traversal order is depth-first by default. If you need to delete directories, you want -depth so files are processed before directories, but for an inventory report you usually want default order and a full traversal. To keep the census within a device boundary, -xdev limits traversal to a single filesystem. This avoids crossing into /proc, network mounts, or external drives where metadata semantics differ.
find provides rich metadata output via -printf. Each % token maps to a field: %s size, %u user, %g group, %m mode, %TY-%Tm-%Td date fields, and %p path. These come from the same inode metadata described earlier. You can emit CSV directly, but you must handle delimiter collisions. The path field can contain commas or newlines; if you need strict CSV, you may choose a safer delimiter or implement escaping.
Performance and correctness depend on test ordering. Place cheap tests (like -type or -name) before expensive actions. If you are pruning, place -prune early in the expression with an -o that continues evaluation for non-pruned paths. A reliable inventory command typically looks like this:
find root \( -path './.git' -o -path './node_modules' \) -prune -o -type f -printf '...'
This form ensures that the prune takes effect before any expensive work. Once you internalize find as a boolean expression evaluator, you can reason about correctness instead of trial-and-error.
There is also a subtle interaction between -prune and -o. The common pruning pattern works because -prune returns true, causing the OR branch to short-circuit, which prevents descending into the directory. If you remove the -o, you still prune but you also stop evaluating the rest of the expression, so you produce no output for non-pruned paths. This is why the template -prune -o -type f -print is canonical. It encodes the control flow explicitly. Once you see this, you can build more complex filters with confidence.
The -path and -name tests also have nuanced semantics. -name matches only the basename, so -name '*.log' does not match /var/log/app/error.log.gz unless the basename matches exactly. -path matches the whole path, including directories, and its globbing can match /. This is often used for pruning paths. If your pruning pattern is too broad, you might exclude files you intended to include. A safe approach is to make prune paths explicit and anchored (for example -path './node_modules') and to test the command on a fixture tree before running on production data.
Time predicates are another source of surprises. -mtime +7 means “more than 7 * 24 hours ago,” but -mtime 7 means “between 7 and 8 days ago,” because find rounds down to whole days. For a census, you typically want exact timestamps rather than day buckets. That is why -printf is used to export exact timestamps and why -mmin or -newermt are used for precise cutoff logic. If you mix these predicates without understanding them, your selection window will be wrong.
Finally, portability matters. GNU find and BSD find differ in supported -printf directives and in the semantics of -regex and -regextype. For a project that should work on macOS and Linux, document which implementation you expect, or detect and use gfind when available. A census tool should be explicit about its dependencies because the output format can differ subtly between platforms.
How this fit on projects
Your census tool uses pruning to skip irrelevant paths and uses -printf to emit the structured inventory rows that the summary report consumes.
Definitions & key terms
- predicate/test: a condition that returns true or false (e.g.,
-type f). - action: a side effect like
-printor-exec. - prune: skip an entire subtree from traversal.
- short-circuit: stop evaluation early once truth value is known.
- depth-first traversal:
findvisits a directory before its children by default.
Mental model diagram (ASCII)
root/
.git/ <- prune
src/
a.txt
b.txt
Expression:
(path .git) -prune OR (type f) -printf
How it works (step-by-step)
findvisitsroot/.gitand sees it matches the prune path.-prunereturns true, so the-oshort-circuits the rest.finddoes not descend into.git.- For
root/src/a.txt, the prune test is false. -type fis true, so-printfemits a CSV row.- Invariant: prune rules must be evaluated before expensive actions.
- Failure modes: misplaced parentheses, missing
-o, or incorrect glob patterns.
Minimal concrete example
find . \( -path './.git' -o -path './node_modules' \) -prune -o -type f -printf '%s,%u,%g,%m,%TY-%Tm-%Td,%p\n'
Common misconceptions
- “Parentheses are optional” -> False; you must group
-oexpressions explicitly. - “-print is always implied” -> False when any action is present.
- “-name uses regex” -> False; it uses glob patterns.
Check-your-understanding questions
- Why must you escape parentheses in a
findcommand? - When is
-printimplied, and when is it not? - How does
-prunechange traversal behavior? - Why does test ordering affect performance?
Check-your-understanding answers
- Because the shell interprets parentheses unless escaped.
- It is implied only when no action is present.
- It prevents descending into the matched directory subtree.
- Cheap tests filter early, preventing expensive actions later.
Real-world applications
- Large-scale file inventories in build servers.
- Compliance scans that avoid vendor directories.
- Fast metadata reports for incident response.
Where you will apply it
- In this project: see §3.2 (functional requirements) and §5.10 (implementation phases).
- Also used in: P05-the-pipeline.md and P07-stats-engine.md.
References
man find- GNU findutils manual, “Expressions” section
- The Linux Command Line (Shotts), Chapter 17
Key insights
find is a boolean expression engine; traversal control is a first-class design decision.
Summary
A correct census depends on correct traversal. If you can reason about pruning and expression order, your tool will be both safe and fast.
Homework/Exercises to practice the concept
- Write a
findcommand that skips.gitandnode_modules. - Create a directory tree and verify
-maxdepthvs-mindepth. - Demonstrate the difference between
-nameand-path.
Solutions to the homework/exercises
find . \( -path './.git' -o -path './node_modules' \) -prune -o -type f -printfind . -maxdepth 2 -type fvsfind . -mindepth 2 -type ffind . -name '*.log'(basename only) vsfind . -path './logs/*.log'
2.3 Deterministic Reporting and CSV Safety
Fundamentals
A report is only useful if it is reproducible. Filesystem traversal order is not guaranteed, so you must explicitly sort your output. Locale can also change sort order, so using LC_ALL=C ensures deterministic collation. When writing CSV, you must handle delimiter collisions: paths may contain commas or newlines. A census tool should either implement strict CSV escaping or choose a safer delimiter (like tab) and document the schema. Determinism is the difference between a report you can diff over time and one you cannot trust.
Determinism also means consistent schema decisions. If you sometimes emit mtime in local time and other times in UTC, your report will be inconsistent even if the sort order is stable. You must define exactly how timestamps are formatted and whether you normalize them (e.g., ISO-8601). A deterministic census tool is one that produces the same output for the same input, regardless of the machine or locale. That level of predictability is what turns a one-off script into an audit artifact.
Deep Dive into the concept
Determinism has multiple layers. First, the order of rows in your CSV must be stable. find does not guarantee order, and filesystem traversal can vary by OS or filesystem. Sorting by path is the simplest way to stabilize output, but you must ensure sorting behavior is consistent across platforms. LC_ALL=C forces byte-order sorting, which is predictable. Without it, a path containing uppercase letters or accented characters can reorder unexpectedly depending on locale settings.
Second, timestamps must be normalized. If you print timestamps with seconds, repeated runs will differ for files that are modified between runs. For a census snapshot, this is expected. However, if you want to compare two runs, you must know exactly which fields are volatile. A best practice is to include both raw timestamps and a run header that records the scan time so you can reason about changes. Some teams also include a content hash, but that is outside the scope of a metadata census.
Third, CSV requires escaping. Standard CSV uses commas as field separators and double quotes to escape fields that contain commas, quotes, or newlines. A naive -printf will break if any path contains a comma or newline. You can solve this by using a tab delimiter, or by post-processing with a CSV escape function. In Bash, a simple escape can replace " with "" and wrap the field in quotes. If you choose a different delimiter (like pipe), you must ensure it does not appear in paths or you must escape it.
Fourth, summary reports must be reproducible. Counts of file types or permissions should be computed from the sorted inventory rather than from a second traversal to avoid inconsistencies. This means your pipeline should produce a single source of truth (the CSV) and derive all summaries from it. That design creates a clean audit trail.
Finally, deterministic output enables automation. If your report is stable, you can store it in git, diff it across days, or feed it into compliance workflows. If it is not stable, you will produce noise. The small discipline of stable sorting and explicit schema definitions turns an ad-hoc script into an auditable tool.
There is also the problem of numeric and date sorting within CSV. If you need to sort by size or time, you must use numeric sort flags (sort -n or sort -t, -k1,1n for numeric columns). Sorting lexicographically by size will produce incorrect ordering because "100" sorts before "20". A census report might need multiple outputs: one sorted by path for diffability, and another sorted by size for the “top N” list. The key is to document which sort order applies to which output so that users interpret the results correctly.
CSV escaping deserves special attention. The RFC 4180 standard requires fields containing commas, quotes, or newlines to be wrapped in double quotes, and embedded quotes to be doubled. If you do not implement this, your report can break when loaded into spreadsheets or analysis tools. That is why some audit tools choose to emit TSV (tab-separated values) instead: tabs are rare in filenames, making escaping less frequent. If you stick with CSV, you can post-process the path column with a small escape function. The choice should be explicit and recorded in the README or report header.
Determinism can also be improved by capturing the environment context. For example, include uname -a, tool versions, and the find flavor (GNU vs BSD) in the report header. This is important when you compare reports across machines. If a report differs, you need to know whether the difference is due to real filesystem changes or to differences in tooling. The more metadata you include, the easier it is to debug discrepancies.
Lastly, consider the trade-off between determinism and performance. Sorting a million-line CSV can be expensive and may require disk spill (sort uses temporary files). For very large inventories, you may need to tune sort with -T to set the temporary directory or --parallel to speed it up. This is not required for small projects, but it becomes important in production-scale audits.
How this fit on projects
You will explicitly sort your census output, define a CSV schema in Section 3.5, and generate summary statistics from the same dataset to avoid inconsistencies.
Definitions & key terms
- determinism: ability to reproduce identical output given the same inputs.
- collation order: how strings are ordered during sorting.
- CSV escaping: quoting rules for fields containing delimiters or quotes.
- schema: explicit definition of column meanings and types.
Mental model diagram (ASCII)
raw traversal -> unsorted rows -> sort -> stable CSV -> summaries
How it works (step-by-step)
- Emit raw rows from
find -printf. - Set
LC_ALL=Cto fix sort order. - Sort rows by path or size as needed.
- Write the sorted CSV to disk.
- Build summaries from the CSV (not from a second traversal).
- Invariant: summaries are derived from a single source CSV.
- Failure modes: locale-dependent ordering, unescaped delimiters, or inconsistent timestamp formats.
Minimal concrete example
LC_ALL=C find . -type f -printf '%s,%u,%g,%m,%TY-%Tm-%TdT%TH:%TM:%TS,%p\n' \
| sort > census.csv
Common misconceptions
- “
findalways outputs in path order” -> False; order is filesystem-dependent. - “CSV is just comma-separated” -> False; it requires escaping.
- “Sorting is optional” -> False if you want deterministic diffs.
Check-your-understanding questions
- Why can two runs of the same
findcommand output in different orders? - What does
LC_ALL=Cchange in a pipeline? - How do you safely represent a filename containing a comma in CSV?
- Why should summaries be derived from a single source dataset?
Check-your-understanding answers
- Traversal order depends on filesystem and directory entry order.
- It forces bytewise collation for predictable sorting.
- Wrap the field in quotes and escape embedded quotes.
- It prevents inconsistencies between two independent traversals.
Real-world applications
- Compliance audits stored in version control for change tracking.
- Inventory diffs for large monorepos and build caches.
- Incident response reports that must be reproducible.
Where you will apply it
- In this project: see §3.5 (schema) and §3.7 (golden output).
- Also used in: P07-stats-engine.md.
References
- RFC 4180 (CSV format)
- The Linux Command Line (Shotts), Chapter 20
man sort
Key insights
Determinism is a feature. Without it, audits are noise.
Summary
Stable sorting, explicit schema, and safe delimiter handling turn a list of files into a reliable dataset.
Homework/Exercises to practice the concept
- Create two files with names that differ only in case and observe sort order with and without
LC_ALL=C. - Create a filename with a comma and test how your CSV handles it.
- Produce a summary of top 5 largest files from a CSV without re-running
find.
Solutions to the homework/exercises
LC_ALL=C printf 'a\nB\n' | sortvsprintf 'a\nB\n' | sorttouch 'file,with,comma.txt'then check CSV output for quoting.sort -rn -t, -k1,1 census.csv | head -5
3. Project Specification
3.1 What You Will Build
A CLI script that traverses a target directory, collects inode metadata for each file, and outputs:
- a deterministic CSV inventory file
- a plain-text summary report (top sizes, risky permissions, counts by type)
- an error log of inaccessible paths
Included features:
- traversal pruning of irrelevant directories
- explicit CSV schema with stable ordering
- summary statistics derived from the CSV
Excluded features:
- content hashing
- content scanning or regex matching
- network or distributed filesystem inventory
3.2 Functional Requirements
- Inventory CSV: output
size_bytes,owner,group,mode,mtime,pathfor each file. - Pruning: skip user-configured directories (
.git,node_modules, etc.). - Deterministic order: sorted output with fixed collation.
- Summary report: top 10 largest files, total file count, and world-writable files.
- Error handling: permission errors captured to a log file.
3.3 Non-Functional Requirements
- Performance: handle 100k files in under 1 minute on a typical laptop SSD.
- Reliability: never abort on a single permission error.
- Usability: clear CLI flags and a human-readable summary.
3.4 Example Usage / Output
$ ./census.sh ~/Projects --exclude .git --exclude node_modules
[+] Target: /Users/alice/Projects
[+] CSV: census_2026-01-01.csv
[+] Summary: census_summary.txt
[+] Errors: census_errors.txt
3.5 Data Formats / Schemas / Protocols
CSV schema (comma-separated, UTF-8, sorted by path):
size_bytes,owner,group,mode,mtime,path
1048576,alice,staff,0644,2025-12-31T20:11:04,./logs/app.log
Summary report schema:
Total files: 12482
Total size bytes: 912381234
Top 10 largest files:
10485760 ./db/backup.sql
World-writable files:
./tmp/unsafe.txt
3.6 Edge Cases
- Filenames with commas or newlines.
- Directories without read permissions.
- Symlink loops (if
-Lis used, which this tool avoids by default). - Files modified during the scan (timestamps may change mid-run).
3.7 Real World Outcome
A deterministic inventory snapshot suitable for audits and diffs.
3.7.1 How to Run (Copy/Paste)
./census.sh /Users/alice/Projects --exclude .git --exclude node_modules
3.7.2 Golden Path Demo (Deterministic)
Assume a fixed test dataset under ./fixtures/census and a frozen timestamp of 2026-01-01T12:00:00 recorded in the report header.
3.7.3 If CLI: exact terminal transcript
$ ./census.sh ./fixtures/census --exclude .git
[2026-01-01T12:00:00] TARGET=./fixtures/census
[2026-01-01T12:00:00] CSV=census_2026-01-01.csv
[2026-01-01T12:00:00] SUMMARY=census_summary.txt
[2026-01-01T12:00:00] ERRORS=census_errors.txt
[2026-01-01T12:00:00] FILES=6
[2026-01-01T12:00:00] DONE
$ cat census_2026-01-01.csv
size_bytes,owner,group,mode,mtime,path
12,alice,staff,0644,2025-12-30T09:00:00,./fixtures/census/notes/todo.txt
2048,alice,staff,0644,2025-12-29T10:22:12,./fixtures/census/logs/app.log
4096,alice,staff,0755,2025-12-28T18:02:11,./fixtures/census/bin/tool
$ cat census_summary.txt
Total files: 6
Total size bytes: 6156
Top 3 largest files:
4096 ./fixtures/census/bin/tool
2048 ./fixtures/census/logs/app.log
World-writable files:
(none)
Failure demo (missing path):
$ ./census.sh /no/such/path
[2026-01-01T12:00:00] ERROR: target does not exist
EXIT_CODE=2
Exit codes:
- 0: success
- 1: partial success with errors logged
- 2: invalid arguments or missing path
4. Solution Architecture
4.1 High-Level Design
+----------+ +-----------------+ +------------------+
| args | --> | find/printf CSV | --> | sort + summary |
+----------+ +-----------------+ +------------------+
\ |
\--> error log --------------------+
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| CLI parser | parse target and excludes | support multiple --exclude flags |
| Scanner | run find with prune and -printf |
default -P, no symlink following |
| Sorter | ensure deterministic CSV | LC_ALL=C, sort by path |
| Summarizer | compute totals, top N, perms | derive from CSV only |
4.3 Data Structures (No Full Code)
# Conceptual fields per row
size_bytes, owner, group, mode, mtime, path
4.4 Algorithm Overview
Key Algorithm: Inventory Pipeline
- Build prune expression from excludes.
- Run
findwith-type f -printf. - Pipe to
sortwith fixed locale. - Write CSV and compute summary with
awk.
Complexity Analysis:
- Time: O(n log n) due to sorting.
- Space: O(n) for CSV output.
5. Implementation Guide
5.1 Development Environment Setup
# macOS users may install GNU findutils for consistent -printf behavior
brew install findutils
5.2 Project Structure
project-root/
├── census.sh
├── fixtures/
│ └── census/
├── output/
│ ├── census_YYYY-MM-DD.csv
│ └── census_summary.txt
└── README.md
5.3 The Core Question You’re Answering
“How do I turn raw filesystem metadata into a reliable, queryable inventory report?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Inodes and timestamps (what changes and why)
findpruning and expression ordering- Deterministic output and CSV safety
5.5 Questions to Guide Your Design
- Which directories must be excluded by default?
- How will you escape paths that contain commas?
- What is the minimum schema that still supports audits?
- Do you treat permission errors as warnings or failures?
5.6 Thinking Exercise
Sketch a pipeline that outputs a sorted CSV and then computes the top 5 largest files. Identify where you would log errors.
5.7 The Interview Questions They’ll Ask
- “Why is
ctimenot creation time?” - “How does
finddecide which paths to traverse?” - “What is the difference between
-printand-printf?” - “How do you make a report deterministic?”
5.8 Hints in Layers
Hint 1: Start with -printf
find "$TARGET" -type f -printf '%s,%u,%g,%m,%TY-%Tm-%TdT%TH:%TM:%TS,%p\n'
Hint 2: Add pruning
find "$TARGET" \( -path './.git' -o -path './node_modules' \) -prune -o -type f -printf '...'
Hint 3: Sort deterministically
LC_ALL=C sort > "$CSV"
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| File metadata | The Linux Programming Interface (Kerrisk) | Ch. 15 |
| Find expressions | The Linux Command Line (Shotts) | Ch. 17 |
| Sorting and reporting | The Linux Command Line (Shotts) | Ch. 20 |
5.10 Implementation Phases
Phase 1: Foundation (1-2 hours)
Goals:
- Parse CLI args and build exclude list
- Emit raw CSV rows with
find -printf
Tasks:
- Implement target validation and exit codes.
- Build the prune expression from
--excludeflags.
Checkpoint: command prints valid CSV for a small directory.
Phase 2: Core Functionality (2-3 hours)
Goals:
- Deterministic output
- Summary report generation
Tasks:
- Pipe output through
LC_ALL=C sort. - Compute totals and top N using
awk.
Checkpoint: CSV and summary are generated with stable ordering.
Phase 3: Polish & Edge Cases (1-2 hours)
Goals:
- Error logging
- CSV safety and documentation
Tasks:
- Capture permission errors to an error log.
- Document delimiter rules and limitations.
Checkpoint: tool finishes even with permission errors.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| CSV delimiter | comma, tab, pipe | comma with escaping | standard format, compatible with spreadsheets |
| Sorting key | path, size | path | stable, enables diffs |
| Error policy | fail-fast, log and continue | log and continue | audits should be best-effort |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | verify parsing and formatting | arg parser, delimiter escaping |
| Integration Tests | validate pipeline output | fixtures directory inventory |
| Edge Case Tests | handle weird filenames | commas, newlines, permission errors |
6.2 Critical Test Cases
- Missing path: exit code 2 and clear error.
- Comma in filename: CSV must remain parseable.
- Permission denied: error logged, exit code 1, report still produced.
6.3 Test Data
fixtures/census/
notes/todo.txt
logs/app.log
bin/tool
"file,with,comma.txt"
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Missing prune | huge runtime | add -prune for vendor dirs |
| Unstable output | diffs change every run | add LC_ALL=C sort |
| Broken CSV | commas break columns | escape or change delimiter |
7.2 Debugging Strategies
- Use
set -xto trace pipeline stages. - Compare raw
findoutput with sorted output. - Inspect error log to identify permission issues.
7.3 Performance Traps
Scanning large vendor directories or network mounts can dominate runtime. Always prune and use -xdev where appropriate.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a
--jsonoutput mode usingjq. - Add a
--max-sizefilter to exclude large binaries.
8.2 Intermediate Extensions
- Track
ctimeandatimein addition tomtime. - Add a
--xdevflag to restrict to a device.
8.3 Advanced Extensions
- Compare two census reports and generate a diff summary.
- Store reports in SQLite for historical queries.
9. Real-World Connections
9.1 Industry Applications
- Compliance audits for permissions and ownership.
- Pre-migration inventories for large storage systems.
- Build cache analysis and cleanup planning.
9.2 Related Open Source Projects
findutils: reference implementation forfind.ripgrep: content scanning at scale (out of scope but inspirational).
9.3 Interview Relevance
- File metadata and permission bits.
- Deterministic reporting and data quality.
- Safe traversal strategies.
10. Resources
10.1 Essential Reading
- The Linux Programming Interface (Kerrisk) - Chapter 15 (File attributes)
- The Linux Command Line (Shotts) - Chapters 17 and 20
10.2 Video Resources
- “Linux Filesystem Basics” (YouTube) - inode and metadata overview
- “Effective Shell Pipelines” (conference talk) - determinism tips
10.3 Tools & Documentation
man findman statman sort
10.4 Related Projects in This Series
- P02-log-hunter.md - switches to content scanning.
- P07-stats-engine.md - extends reporting to code stats.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain inode vs filename.
- I can explain why
ctimeis not creation time. - I can describe how
findevaluates expressions.
11.2 Implementation
- CSV output matches schema.
- Summary report is generated from CSV.
- Permission errors do not abort the run.
11.3 Growth
- I can identify one improvement to reporting quality.
- I documented at least one edge case discovered.
- I can explain this project in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- CSV file generated with required columns.
- Summary report produced with top 10 largest files.
- Errors logged without crashing.
Full Completion:
- All minimum criteria plus:
- Deterministic ordering validated across two runs.
- CSV handles commas safely.
Excellence (Going Above & Beyond):
- Two reports can be diffed with a clean change summary.
- Report includes optional device restriction and timestamp normalization.