Project 5: The Pipeline

Build a safe, null-delimited batch pipeline that selects files by metadata, archives them, and produces a summary report.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	1 week
Main Programming Language	Bash (Alternatives: Python)
Alternative Programming Languages	Python
Coolness Level	Level 5 - “pipeline architect”
Business Potential	High (automation at scale)
Prerequisites	find predicates, xargs, tar
Key Topics	null-delimited pipelines, batching, archiving, reporting

1. Learning Objectives

By completing this project, you will:

Build a null-delimited pipeline that never breaks on filenames.
Choose between -exec ... + and xargs -0 for batching.
Archive selected files safely with tar --null -T -.
Generate a deterministic report with counts and sizes.
Validate archive integrity and handle errors.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Null-Delimited Pipelines and Safe Filename Handling

Fundamentals

Unix tools assume newline-delimited input, but filenames can contain newlines, spaces, and tabs. This makes naive pipelines unsafe. The safe solution is to use the NUL byte as a delimiter. find -print0 emits NUL-separated paths, and tools like xargs -0 or tar --null -T - consume them safely. A pipeline that uses NUL end-to-end is robust against weird filenames, which is essential for automation and cleanup tasks. The reason this matters is not just convenience; it is correctness and safety. A pipeline that silently drops or splits filenames is a data corruption tool. Once you accept that filenames are arbitrary byte sequences, the only reliable delimiter is NUL, because it cannot appear in filenames. Designing around this fact is the hallmark of a production-safe pipeline.

Deep Dive into the concept

Filenames are arbitrary byte sequences, except for NUL and /. Newlines are allowed, which breaks line-based parsing. If you pipe find output into xargs without -0, a filename containing a newline will be treated as multiple filenames. This can cause the pipeline to skip or mis-handle files, leading to incomplete archives or, worse, accidental deletion of the wrong file. In automation, this is unacceptable.

The safe pipeline uses NUL as the delimiter because NUL cannot appear in filenames. find -print0 is the canonical producer. Consumers include xargs -0, grep -Z, and tar --null -T -. Some tools also use -0 or --null flags. The idea is to preserve exact byte sequences for each path. This is not just about correctness; it also prevents security issues when filenames contain leading dashes or shell metacharacters. Using -- in commands further protects against path injection.

A subtlety is that not every tool supports NUL input. For those that do not, you can use -exec ... + as a safe alternative because find passes filenames directly to the command without parsing. -exec ... + batches filenames efficiently, similar to xargs, but does not require NUL delimiting. The trade-off is that you have less control over batching and may not be able to interleave other pipeline stages easily. For this project, we use find -print0 | tar --null -T - to maintain a fully safe pipeline.

Another subtlety is quoting and -- usage. Even with NUL-delimited input, commands like rm or tar can misinterpret filenames that start with - as flags. The correct approach is to use -- in command invocations or tools that interpret -- to end option parsing. For example: xargs -0 rm -- or tar --null -T - --files-from=- depending on version. The key is to ensure filenames cannot change command semantics.

Finally, your pipeline should be documented as a safety contract: every stage preserves null delimiters until the final action. If you insert a tool that does not preserve NUL (like sed), you have broken the safety chain. This should be part of your design review and testing.

There is also a common misconception that quoting fixes everything. It does not. Quoting protects against shell expansion when you invoke commands, but it does nothing to fix a broken pipeline that already split filenames on whitespace. The problem is not just the shell, it is the delimiter. This is why tools like xargs have -0, and why tar accepts --null lists. If you cannot guarantee null-delimited flow, you cannot guarantee correctness.

In addition, consider how your pipeline handles filenames that start with - or contain --. Even with NUL delimiters, some tools interpret such names as options unless you pass -- to end option parsing. The pattern xargs -0 rm -- is not optional; it is part of safe design. This is a subtle but critical invariant that prevents option injection.

Finally, null-delimited pipelines make it easier to reason about performance. They allow you to batch work safely without the overhead of spawning a process per file, and they allow you to stream data without loading everything into memory. This is essential for large filesystem operations where the file list may contain millions of entries.

How this fit on projects

This project is about building a null-delimited pipeline. Every file selection and archiving step depends on safe delimiter handling.

Definitions & key terms

null-delimited: using NUL (\0) as separator between records.
-print0: find option to emit NUL-separated paths.
xargs -0: consume NUL-separated input safely.
option injection: filenames that look like flags and change command behavior.

Mental model diagram (ASCII)

find -print0 -> [NUL paths] -> xargs -0 -> tar --null -T -

How it works (step-by-step)

find emits NUL-delimited paths.
xargs -0 or tar --null -T - reads exact filenames.
Files are archived without path parsing errors.
Summary is generated from the same file list.
Invariant: delimiters are NUL from selection to final action.
Failure modes: inserting a line-based tool, or omitting -- for option safety.

Minimal concrete example

find /var/log -type f -print0 | tar --null -T - -czf archive.tar.gz

Common misconceptions

“Spaces are the only issue” -> False; newlines are worse.
“Quotes in xargs fix it” -> False; only NUL-delimited input is safe.
“Filename injection is theoretical” -> False; it is a real risk.

Check-your-understanding questions

Why is NUL a safe delimiter for filenames?
What happens if a filename contains a newline in a line-based pipeline?
Why is -- important when deleting files?
When would you choose -exec ... + over xargs?

Check-your-understanding answers

NUL cannot appear in filenames, so it uniquely separates paths.
The filename is split into multiple records, causing wrong processing.
It prevents filenames starting with - from being treated as flags.
When the command does not support NUL input or you want simpler safety.

Real-world applications

Safe cleanup jobs for large temp directories.
Archiving logs without missing weird filenames.
Security-sensitive pipelines in production.

Where you will apply it

In this project: see §3.2 (requirements) and §5.10 (phases).
Also used in: P06-system-janitor.md.

References

man find
man xargs
GNU tar manual (null input)

Key insights

Safety in pipelines is a property you design, not a property you assume.

Summary

Null-delimited pipelines are the only reliable way to process arbitrary filenames.

Homework/Exercises to practice the concept

Create a filename with a newline and compare -print vs -print0.
Build a pipeline with xargs -0 and verify it handles spaces.
Demonstrate why -- is needed with a filename like -rf.

Solutions to the homework/exercises

printf 'bad\nname' > "bad\nname" then compare outputs.
find . -print0 | xargs -0 ls -l.
touch -- -rf; rm -- -rf.

2.2 Batching, Archiving, and Integrity Reporting

Fundamentals

Batch processing means applying an action to many files efficiently. find -exec ... + and xargs both batch arguments to reduce process overhead. Once files are selected, you can archive them with tar, preserving the exact list. Integrity is verified by comparing counts and sizes before and after the archive. A safe pipeline must also produce a report that documents what was archived and why. The important point is that batching changes the cost model. Without batching, every file means a new process, which is expensive. With batching, you amortize that cost. But batching also increases the stakes: if your selection is wrong, you will process many files at once. This is why the selection stage must be exact before you batch and archive.

Deep Dive into the concept

Batching reduces overhead. Running one command per file is slow when thousands of files are involved. find -exec ... + groups as many files as possible into a single command invocation. xargs does the same but provides more flexibility in pipeline composition. The choice depends on the rest of the pipeline: -exec is safer and simpler, while xargs is more composable.

Archiving with tar has its own semantics. tar accepts a list of files and stores them in an archive with directory structure. When you combine tar with a file list, you must decide if you want absolute paths or relative paths. For reproducibility, relative paths are usually better. You also need to decide whether to preserve ownership and permissions (--preserve-permissions) and whether to follow symlinks (-h). For a log archive, you usually do not follow symlinks because you want to archive the links as-is.

Integrity checking is not optional. A pipeline that archives files should verify that the number of files in the archive matches the selection count and that the total size is within expected bounds. You can use tar -tf to list the archive contents, count them, and compare with the selection. You can also compute checksums for a subset of files. The goal is to detect missing files or pipeline failures early.

Reporting is a design element. The report should include the selection criteria (size, mtime), the number of files selected, the total size, the archive name, and the largest files included. This report is useful for audits and for future debugging. It should be deterministic and include a fixed timestamp or run ID.

Finally, error handling: a pipeline can fail at any stage. If tar fails, you should not mark the run as successful. If a file disappears between selection and archiving, you should log it but continue, as long as the report reflects the partial success. Exit codes should reflect these distinctions.

There is also a determinism aspect. If you want reproducible archives, you should consider whether file ordering, timestamps, and metadata are preserved. GNU tar has options like --sort=name and --mtime that can make archives deterministic, which is valuable for reproducible builds. Even if you do not implement this now, be aware that archive ordering can affect checksums and diffs.

Integrity reporting can go beyond counts. You can verify that the sum of file sizes in the archive matches the selection, or generate a hash manifest for the archived files. This is especially useful if the archive will be transferred or stored for long periods. Think of integrity checks as part of the pipeline output, not as an afterthought.

Finally, the selection criteria should be recorded in the report in plain language. “size > 5M and mtime < 7 days” is more useful than a raw find command. This makes the report usable by non-shell experts and provides an audit trail that can be reviewed later.

One more advanced consideration is batch size and argument limits. xargs and -exec ... + will split batches to respect system limits on command-line length. This means your pipeline should not assume a single invocation. If you log actions, include the total counts rather than relying on per-batch outputs. Similarly, if you use tar, be aware that extremely large lists can slow down archive creation; you may need to segment archives by size or time window to keep operations manageable. These constraints are normal in production and should be acknowledged in your design.

How this fit on projects

This project relies on batching for performance and on archiving plus integrity reporting as the main outcome.

Definitions & key terms

batching: grouping multiple file arguments into one command run.
archive: a bundled file, often with metadata preserved.
integrity check: verification that output matches expected input.
selection criteria: rules for which files are included.

Mental model diagram (ASCII)

select -> batch -> archive -> verify -> report

How it works (step-by-step)

Select files with find predicates.
Batch file list into tar using --null -T -.
Create archive with a fixed name.
Verify counts with tar -tf and compare.
Write summary report with totals.
Invariant: the report is derived from the exact selection list.
Failure modes: files disappearing mid-run, archive count mismatch, or partial tar failure.

Minimal concrete example

find /var/log -type f -size +5M -mtime -7 -print0 | tar --null -T - -czf archive.tar.gz

Common misconceptions

“Tar always captures everything” -> False if input list is wrong.
“Verification is optional” -> False in automation.
“Absolute paths are harmless” -> They can be unsafe when extracting.

Check-your-understanding questions

Why is batching important for performance?
What is the risk of archiving absolute paths?
How can you verify archive completeness?
Why should selection criteria be written in the report?

Check-your-understanding answers

It reduces process creation overhead.
Extracting can overwrite files in unexpected locations.
Compare input count to tar -tf count.
So audits can reproduce and trust the selection.

Real-world applications

Automated log archiving in production.
Backup pipelines for compliance data.
Pre-release bundling of artifacts.

Where you will apply it

In this project: see §3.7 (golden output) and §5.10 (phases).
Also used in: P01-digital-census.md.

References

The Linux Command Line (Shotts), Chapter 18
man tar
Effective Shell (Kerr), Chapter 6

Key insights

A pipeline is only complete when its output is verified and reported.

Summary

Batching and integrity checks are what transform a pipeline into a reliable automation tool.

Homework/Exercises to practice the concept

Create an archive from a list of files and verify the count matches.
Try archiving with absolute paths and observe the extraction warning.
Add a report line that shows total selected size.

Solutions to the homework/exercises

tar -tf archive.tar.gz | wc -l vs selection count.
Use tar -tf to inspect stored paths.
Use awk '{sum+=$1} END {print sum}' on size list.

3. Project Specification

3.1 What You Will Build

A CLI tool that selects files by metadata (size, time), archives them safely with a null-delimited pipeline, and produces a deterministic summary report with counts and sizes.

3.2 Functional Requirements

Selection: filter files by size and modification time.
Safety: use null-delimited paths end-to-end.
Archiving: create a compressed tar archive.
Integrity: verify archive contents count.
Reporting: produce a summary report with totals and top files.

3.3 Non-Functional Requirements

Performance: handle thousands of files efficiently.
Reliability: partial failures are reported, not hidden.
Usability: clear output and log paths.

3.4 Example Usage / Output

$ ./pipeline.sh /var/log --min-size 5M --days 7

3.5 Data Formats / Schemas / Protocols

Report format:

ARCHIVE=archive_2026-01-01.tar.gz
FILES=38
TOTAL_SIZE_BYTES=812000000
LARGEST=/var/log/app/error.log (220 MB)

3.6 Edge Cases

Files deleted between selection and archiving.
Filenames with newlines and spaces.
Permission errors on some files.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

./pipeline.sh /var/log --min-size 5M --days 7

3.7.2 Golden Path Demo (Deterministic)

Use a fixed fixture tree and a fixed timestamp of 2026-01-01T12:00:00 in the report header.

3.7.3 If CLI: exact terminal transcript

$ ./pipeline.sh ./fixtures/logs --min-size 1M --days 7
[2026-01-01T12:00:00] TARGET=./fixtures/logs
[2026-01-01T12:00:00] CRITERIA=size>1M mtime<7d
[2026-01-01T12:00:00] ARCHIVE=archive_2026-01-01.tar.gz
[2026-01-01T12:00:00] REPORT=pipeline_report.txt
[2026-01-01T12:00:00] FILES=3
[2026-01-01T12:00:00] DONE

$ cat pipeline_report.txt
Files archived: 3
Total size: 6 MB
Largest file: ./fixtures/logs/error.log (3 MB)

Failure demo (no files selected):

$ ./pipeline.sh ./fixtures/logs --min-size 1G --days 1
[2026-01-01T12:00:00] NO_FILES_SELECTED
EXIT_CODE=1

Exit codes:

0: success with archive
1: no files selected
2: invalid args or archive failure

4. Solution Architecture

4.1 High-Level Design

find -> null list -> tar archive -> verify -> report

4.2 Key Components

Component	Responsibility	Key Decisions
Selector	build `find` predicates	size and time filters
Pipeline	null-delimited flow	`-print0` + `tar --null`
Archiver	create compressed archive	relative paths
Verifier	compare counts	`tar -tf`

4.3 Data Structures (No Full Code)

report: {files, total_size, largest_file}

4.4 Algorithm Overview

Key Algorithm: Safe Archive Pipeline

Select files with find predicates.
Pass NUL-delimited list to tar.
Create archive and verify count.
Generate report from selection data.

Complexity Analysis:

Time: O(n log n) for sorting sizes.
Space: O(n) for report input.

5. Implementation Guide

5.1 Development Environment Setup

# Requires find, tar, sort

5.2 Project Structure

project-root/
├── pipeline.sh
├── fixtures/
│   └── logs/
└── README.md

5.3 The Core Question You’re Answering

“How do I build a safe, scalable pipeline that will not break on real-world filenames?”

5.4 Concepts You Must Understand First

Null-delimited pipelines
Batching and archiving integrity

5.5 Questions to Guide Your Design

What selection criteria define the batch?
How will you verify the archive is complete?
How will you handle missing files during archiving?

5.6 Thinking Exercise

Design a pipeline that selects files by size and age, archives them, and writes a summary of counts and total size.

5.7 The Interview Questions They’ll Ask

“Why should you use -print0 with xargs?”
“What is the difference between -exec ... + and xargs?”
“How do you validate a tar archive?”

5.8 Hints in Layers

Hint 1: Selection

find /var/log -type f -size +5M -mtime -7 -print0

Hint 2: Archive

find /var/log -type f -size +5M -mtime -7 -print0 | tar --null -T - -czf archive.tar.gz

Hint 3: Report

find /var/log -type f -size +5M -mtime -7 -printf '%s %p\n' | sort -rn | head -1

5.9 Books That Will Help

Topic	Book	Chapter
Pipelines	Effective Shell (Kerr)	Ch. 6
Archiving	The Linux Command Line (Shotts)	Ch. 18
Find basics	The Linux Command Line (Shotts)	Ch. 17

5.10 Implementation Phases

Phase 1: Foundation (1-2 days)

Goals:

Build selection predicates
Generate null-delimited list

Tasks:

Parse flags for size and time.
Verify find -print0 output.

Checkpoint: list contains correct files.

Phase 2: Core Functionality (2-3 days)

Goals:

Archive and verify

Tasks:

Create tar archive from list.
Verify counts with tar -tf.

Checkpoint: archive count matches selection.

Phase 3: Polish & Edge Cases (1-2 days)

Goals:

Reporting and error handling

Tasks:

Generate report with totals and largest file.
Handle no-file and partial-failure cases.

Checkpoint: report is deterministic.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Pipeline style	xargs vs -exec	tar –null -T -	safe and simple
Paths in archive	absolute vs relative	relative	safe extraction
Error policy	fail-fast vs partial	partial with log	realistic in real systems

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	flag parsing	size and time args
Integration Tests	full pipeline	fixture archive
Edge Case Tests	weird filenames	newline paths

6.2 Critical Test Cases

No files selected: exit code 1.
Filename with newline: archive still correct.
Deleted file during run: reported in log.

6.3 Test Data

fixtures/logs/error.log
fixtures/logs/info.log

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Using newline delimiters	missing files	switch to `-print0`
Archiving absolute paths	unsafe extraction	use relative paths
No verification	missing files unnoticed	compare counts

7.2 Debugging Strategies

Run tar -tf to inspect the archive.
Compare selection count to archive count.
Use set -x to trace pipeline stages.

7.3 Performance Traps

Large selections can produce huge archives. Consider limiting size or time windows.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a --dry-run mode.
Add a --max-files cap.

8.2 Intermediate Extensions

Add checksum generation for archived files.
Support incremental archives.

8.3 Advanced Extensions

Support parallel compression.
Create a retention policy for old archives.

9. Real-World Connections

9.1 Industry Applications

Log archiving for compliance.
Batch processing in data pipelines.
Safe cleanup and backup jobs.

rsync: incremental file transfers.
borgbackup: deduplicating backups.

9.3 Interview Relevance

Safe pipelines and filename handling.
Archiving and integrity checking.
Performance trade-offs in batch jobs.

10. Resources

10.1 Essential Reading

Effective Shell (Kerr), Chapter 6
The Linux Command Line (Shotts), Chapter 18

10.2 Video Resources

“Unix Pipelines at Scale” (conference talk)
“Tar and Archiving” (YouTube)

10.3 Tools & Documentation

man find
man tar

P01-digital-census.md - metadata inventory
P06-system-janitor.md - cleanup automation

11. Self-Assessment Checklist

11.1 Understanding

I can explain why null delimiters are required.
I can explain the difference between xargs and -exec.
I can explain how to verify a tar archive.

11.2 Implementation

Pipeline is null-delimited end-to-end.
Archive is created and verified.
Report includes counts and sizes.

11.3 Growth

I documented one edge case found during testing.
I can explain this project in an interview.
I can propose a performance optimization.

12. Submission / Completion Criteria

Minimum Viable Completion:

Safe selection pipeline and archive creation.
Summary report with counts and sizes.

Full Completion:

Integrity verification implemented.
Deterministic output and fixed report header.

Excellence (Going Above & Beyond):

Checksums for archived files.
Incremental archive support.