Project 5: Find and Organize Photo/Video Files
Build a safe media organizer that discovers photos and videos, extracts creation dates, and moves files into a clean
YYYY/MMfolder structure without losing data.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 12 to 18 hours |
| Main Programming Language | Bash |
| Alternative Programming Languages | Python, Go |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | Level 2: Micro-SaaS / Pro Tool |
| Prerequisites | Shell scripting, file permissions, basic metadata concepts |
| Key Topics | Safe file discovery, EXIF metadata, idempotent moves, conflict handling |
1. Learning Objectives
By completing this project, you will:
- Safely traverse large directory trees with arbitrary filenames.
- Extract creation dates from EXIF metadata and fall back to filesystem times.
- Design a deterministic folder layout and avoid overwriting files.
- Implement dry-run and logging for safety.
- Build a tool that can be trusted with real data.
2. All Theory Needed (Per-Concept Breakdown)
Concept 1: Safe File Discovery and Filename Handling
Fundamentals
File discovery is deceptively hard. Filenames can contain spaces, tabs, quotes, and even newlines. A script that uses for f in $(find ...) will break in real-world photo libraries. The safe approach is to use find -print0 and consume the results with a NUL-aware loop or xargs -0. This guarantees that each filename is treated as a single unit, regardless of its contents. A media organizer must be robust against all filenames because user photo libraries are messy and unpredictable.
Deep Dive into the Concept
Unix tools traditionally separate records with newlines, which is fine for simple text but dangerous for filenames. A filename can legally contain a newline, which means that a newline-separated list is ambiguous. find -print0 emits NUL characters between filenames, and xargs -0 or while IFS= read -r -d '' can read them safely. This is the gold standard for filename-safe scripting. The downside is that you must use tools that support NUL delimiters, but for this project that is acceptable because find, xargs, and read support it.
Another issue is quoting. When you pass filenames to commands like mv or cp, you must quote the variable: mv -- "$src" "$dest". The -- signals the end of options, which is important if a filename begins with -. Without --, a filename like -rf could be misinterpreted as an option, leading to destructive behavior. This is a crucial safety pattern in file manipulation scripts.
Directory traversal itself can be large. find can traverse millions of files; it will cross filesystem boundaries by default and can descend into mounted volumes you did not intend to include. Use -xdev if you want to stay on the same filesystem. You can also restrict depth with -maxdepth and filter by extensions with -iname. For media organization, you should explicitly list allowed extensions to avoid moving unrelated files. This is a safety boundary.
You must also decide whether to follow symlinks. find does not follow symlinks by default, which is safer; if you follow symlinks, you can end up with duplicate files or infinite loops. For this project, you should not follow symlinks. Document this choice. If you want to support symlinks, you must add cycle detection and clear warnings.
Finally, consider logging and dry runs. A script that moves files should provide a --dry-run mode that prints what it would do without changing anything. This is essential for trust. The output should include source and destination paths, and it should be deterministic so you can compare runs. Dry-run mode is also a test mechanism; it allows you to validate your logic before touching real data.
How This Fits in Projects
This concept informs §3.2 (Functional Requirements) and §5.8 (Hints in Layers). It is directly applied in §5.2 (Project Structure) and §6.2 (Critical Test Cases). Safe file handling is also critical in P03 Automated Backup Script.
Definitions & Key Terms
- NUL delimiter: A zero byte used to separate filenames safely.
- -print0:
findoption that outputs NUL-separated names. - –: End-of-options marker for commands.
- Traversal: Recursively walking a directory tree.
- Dry run: Execute logic without making changes.
Mental Model Diagram (ASCII)
find -print0 -> NUL-safe loop -> quoted file ops
How It Works (Step-by-Step)
- Use
findwith explicit extensions. - Emit filenames with
-print0. - Read with
read -r -d ''orxargs -0. - For each file, quote paths and use
--in commands. - Log actions and optionally dry-run.
Invariants: never split filenames on whitespace; always quote paths. Failure modes: accidental option injection, broken filenames, crossing unwanted filesystems.
Minimal Concrete Example
find "$src" -type f -print0 | while IFS= read -r -d '' f; do
printf 'Found: %s\n' "$f"
done
Common Misconceptions
- “Filenames never contain spaces.” -> They often do.
- “
xargsis always safe.” -> Not without-0. - “
mvwill treat filenames safely.” -> Only if you use--and quoting.
Check-Your-Understanding Questions
- Why is
find ... | xargsunsafe without-0? - What does
--protect against inmv? - Why might you use
-xdevinfind?
Check-Your-Understanding Answers
- Because filenames with spaces or newlines will be split incorrectly.
- It prevents filenames starting with
-from being treated as options. - To avoid crossing filesystem boundaries and moving unintended files.
Real-World Applications
- Photo library organization
- Large file cleanup scripts
- Backup and migration tooling
Where You’ll Apply It
- Project 5: §3.2, §5.2, §6.2
- Also used in: P03 Automated Backup Script
References
- Effective Shell (Kerr), Ch. 8
- The Linux Command Line (Shotts), Ch. 17
Key Insight
Filename safety is the difference between a reliable script and a data-loss event.
Summary
Safe file discovery uses NUL delimiters and careful quoting to avoid surprises.
Homework/Exercises to Practice the Concept
- Create files with spaces and verify your loop handles them.
- Create a file named
-test.jpgand ensuremv --handles it safely. - Use
find -xdevto restrict traversal to one filesystem.
Solutions to the Homework/Exercises
touch "a b.jpg"; find . -print0 | while IFS= read -r -d '' f; do echo "$f"; donetouch -- -test.jpg; mv -- -test.jpg safe.jpgfind / -xdev -maxdepth 1 | head
Concept 2: Metadata-Driven Organization and Idempotent Moves
Fundamentals
Media files often contain metadata that records when they were created. EXIF metadata for photos (and sometimes videos) includes tags like CreateDate. A media organizer should use this metadata to place files into a YYYY/MM directory structure. But metadata can be missing or incorrect, so the script must define a fallback strategy, such as using filesystem modification time. Idempotence is critical: running the script twice should not move files again or overwrite existing files. This concept teaches you how to extract metadata safely, handle conflicts, and design a deterministic organization scheme.
Deep Dive into the Concept
EXIF metadata is embedded in image files and contains camera settings, timestamps, and more. exiftool is the most reliable CLI tool for extracting this data. It can output specific tags, and you can format dates with -d. For example, exiftool -d "%Y/%m" -CreateDate -s -s -s file.jpg prints the creation year and month if available. But not all files contain EXIF data; scans and edited images often lack CreateDate. Videos may use different tags. A robust organizer should attempt multiple tags or fall back to filesystem times. This fallback must be documented so the user knows what to expect.
Date parsing must be deterministic. Use a fixed format like YYYY/MM, and avoid locale-dependent formats. If you use filesystem mtime, be aware that copying files can change it, so it may not reflect the true creation date. The script should therefore prefer EXIF when present and only fall back to mtime if missing. If both are missing or invalid, the script should skip the file and log it. This prevents silent misplacement.
Conflict handling is another critical topic. If two files have the same name and destination directory, a naive mv will overwrite one of them. That is unacceptable. You must define a conflict policy: skip duplicates, add a numeric suffix, or store duplicates in a separate folder. For this project, a safe approach is to use mv -n (no clobber) and log conflicts, or to append a suffix like _1, _2. The important part is that the policy is deterministic and documented. For example, you can compute a hash of the file contents and use it in the filename to avoid collisions, but that is an advanced extension.
Idempotence is the property that running the script twice yields the same result. If you move files into a target structure, the second run should see that the files are already in place and either skip them or detect that the source is empty. This is easiest if you only operate on the source directory and never on the destination, and if you use a dry-run to confirm intended actions. Logging is part of idempotence: your script should record what it moved and what it skipped, so you can audit results. A manifest file listing source and destination paths is a practical safety tool.
Finally, consider directory creation. Use mkdir -p to create the destination YYYY/MM path. This is safe if the path already exists. Use umask to control permissions, and avoid running as root unless necessary. When moving files, use mv -- with quoted variables. If you choose to copy instead of move, you must decide how to handle cleanup and verification. For this project, moving is acceptable if you trust your script and have backups; copying is safer if you are cautious. Provide a --copy flag as an optional extension if you want a safety-first approach.
How This Fits in Projects
This concept defines the core functionality in §3.2 (Functional Requirements) and the real-world demos in §3.7. It also informs conflict handling in §7.1 and extensions in §8. The same conflict-avoidance patterns apply to P03 Automated Backup Script.
Definitions & Key Terms
- EXIF: Metadata standard for photos and some videos.
- CreateDate: EXIF tag for capture time.
- Idempotence: Running the script multiple times yields the same result.
- Conflict: Two files mapping to the same destination name.
- Manifest: A log of source and destination paths.
Mental Model Diagram (ASCII)
file -> metadata date -> YYYY/MM -> conflict policy -> move/copy
How It Works (Step-by-Step)
- For each file, attempt to read
CreateDatewith exiftool. - If missing, fall back to filesystem mtime.
- Build destination directory
YYYY/MM. - Resolve naming conflicts deterministically.
- Move or copy file and log action.
Invariants: EXIF preferred over mtime; no overwrites. Failure modes: missing metadata, duplicate filenames, incorrect date formats.
Minimal Concrete Example
date=$(exiftool -d "%Y/%m" -CreateDate -s -s -s "$file")
[ -z "$date" ] && date=$(date -r "$file" +"%Y/%m")
mkdir -p "$dest/$date"
cp -n -- "$file" "$dest/$date/"
Common Misconceptions
- “All media files have EXIF.” -> Many do not.
- “mtime equals capture date.” -> It can change when files are copied or edited.
- “mv is safe by default.” -> It overwrites unless you prevent it.
Check-Your-Understanding Questions
- Why should EXIF be preferred over mtime?
- What is a safe conflict policy for duplicate filenames?
- How do you make the script idempotent?
Check-Your-Understanding Answers
- EXIF reflects capture time; mtime can change when files are copied.
- Use
mv -nor append a suffix and log the conflict. - Skip files already moved and avoid overwriting.
Real-World Applications
- Photo library cleanup after a trip
- Migrating files from multiple devices
- Building a searchable, chronological archive
Where You’ll Apply It
- Project 5: §3.2, §3.7, §7.1
- Also used in: P03 Automated Backup Script
References
- Effective Shell (Kerr), Ch. 8-9
- The Linux Command Line (Shotts), Ch. 17
Key Insight
A safe organizer is defined by its conflict policy and metadata strategy.
Summary
Metadata-driven organization requires deterministic rules and careful file operations.
Homework/Exercises to Practice the Concept
- Use exiftool to read
CreateDatefrom two photos. - Create two files with the same name in different folders and test your conflict policy.
- Run your script twice in dry-run mode and compare outputs.
Solutions to the Homework/Exercises
exiftool -CreateDate -s -s -s photo.jpgmkdir -p a b; touch a/img.jpg b/img.jpg; simulate moves with suffixes./organize_media.sh --dry-run src dest > run1.txt; ./organize_media.sh --dry-run src dest > run2.txt; diff run1.txt run2.txt
3. Project Specification
3.1 What You Will Build
You will build organize_media.sh, a script that discovers media files, extracts their creation date, and moves them into a YYYY/MM directory tree under a destination folder. It will handle filenames safely, avoid overwriting files, log skipped items, and support --dry-run. It will not perform image conversion, duplicate detection via hashing, or advanced metadata repair.
3.2 Functional Requirements
- Discovery: Search for media files by extension (
.jpg,.png,.mov,.mp4). - Metadata: Extract
CreateDateviaexiftool; fallback to mtime if missing. - Destination: Create
YYYY/MMdirectories under the destination root. - Safety: Never overwrite existing files; use a conflict policy.
- Dry run: Print planned moves without making changes.
- Logging: Record moved and skipped files.
- Exit codes: 0 success, 1 usage error, 2 missing dependency, 3 runtime error.
3.3 Non-Functional Requirements
- Reliability: Safe for large libraries with spaces and special characters.
- Usability: Clear output and summary counts.
- Performance: Streaming file discovery; avoid loading lists into memory.
3.4 Example Usage / Output
$ ./organize_media.sh ~/Pictures ~/Pictures/organized
Moved: IMG_1234.jpg -> 2025/12/IMG_1234.jpg
Moved: trip.mov -> 2024/06/trip.mov
Skipped: oldscan.png (no metadata)
Summary: moved=2 skipped=1 conflicts=0
3.5 Data Formats / Schemas / Protocols
Directory layout:
organized/
├── 2024/
│ └── 06/
│ └── trip.mov
└── 2025/
└── 12/
└── IMG_1234.jpg
Log format:
[ISO8601] ACTION src -> dest
3.6 Edge Cases
- File has no EXIF and no readable mtime.
- Two files map to the same destination name.
- Destination path not writable.
- Filenames containing spaces or leading dashes.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
cd /path/to/organizer
chmod +x organize_media.sh
TZ=UTC SOURCE_DATE_EPOCH=1767225600 ./organize_media.sh ~/Pictures ~/Pictures/organized --dry-run
3.7.2 Golden Path Demo (Deterministic)
Use TZ=UTC and SOURCE_DATE_EPOCH to make log timestamps deterministic for tests.
3.7.3 If CLI: Exact Terminal Transcript
$ ./organize_media.sh ~/Pictures ~/Pictures/organized
Moved: IMG_1234.jpg -> 2025/12/IMG_1234.jpg
Moved: trip.mov -> 2024/06/trip.mov
Skipped: oldscan.png (no metadata)
Summary: moved=2 skipped=1 conflicts=0
Exit code: 0
3.7.4 Failure Demo (Missing Dependency)
$ ./organize_media.sh ~/Pictures ~/Pictures/organized
ERROR: exiftool not found. Please install it first.
Exit code: 2
4. Solution Architecture
4.1 High-Level Design
find -> safe loop -> metadata -> dest path -> conflict check -> move -> log
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Finder | Discover media files | Use find -print0 |
| Metadata Reader | Extract date | EXIF first, mtime fallback |
| Organizer | Build dest path | YYYY/MM structure |
| Conflict Resolver | Prevent overwrites | mv -n or suffix |
| Logger | Record actions | ISO 8601 timestamps |
4.3 Data Structures (No Full Code)
moved=0
skipped=0
conflicts=0
4.4 Algorithm Overview
Key Algorithm: Per-File Organization
- Read file path safely.
- Extract metadata date.
- Determine destination directory.
- Check for conflicts.
- Move file or skip with log.
Complexity Analysis:
- Time: O(n) for n files.
- Space: O(1) streaming processing.
5. Implementation Guide
5.1 Development Environment Setup
which find exiftool
5.2 Project Structure
organize-media/
├── organize_media.sh
├── logs/
└── tests/
└── fixtures/
5.3 The Core Question You’re Answering
“How can I safely refactor a chaotic directory into a clean structure without losing data?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Safe file discovery and quoting (Concept 1)
- Metadata extraction and conflict handling (Concept 2)
5.5 Questions to Guide Your Design
- Which extensions will you include?
- What is your fallback when EXIF is missing?
- How will you handle duplicates?
- Should you copy first or move directly?
5.6 Thinking Exercise
If a file has EXIF date 2012 but its filename suggests 2019, which do you trust and why? What is the least risky choice?
5.7 The Interview Questions They’ll Ask
- “Why use
-print0with find?” - “How do you prevent overwriting files?”
- “What is your fallback when metadata is missing?”
- “How do you make the script idempotent?”
5.8 Hints in Layers
Hint 1: Safe discovery
find "$src" -type f \( -iname "*.jpg" -o -iname "*.png" -o -iname "*.mov" -o -iname "*.mp4" \) -print0
Hint 2: Metadata date
date=$(exiftool -d "%Y/%m" -CreateDate -s -s -s "$file")
Hint 3: Conflict handling
if [ -e "$dest_file" ]; then
conflicts=$((conflicts+1))
echo "Conflict: $dest_file" >&2
continue
fi
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Finding files | The Linux Command Line | Ch. 17 |
| Shell scripting safety | The Linux Command Line | Ch. 24 |
| Robust scripts | Effective Shell | Ch. 6-9 |
5.10 Implementation Phases
Phase 1: Foundation (3 hours)
Goals:
- Discover files safely
- Print planned destination paths
Tasks:
- Implement
find -print0loop. - Print
src -> destmappings.
Checkpoint: Output shows correct paths for sample files.
Phase 2: Core Functionality (4 hours)
Goals:
- Extract metadata and move files
Tasks:
- Use exiftool to read dates.
- Create destination directories.
- Move files with conflict checks.
Checkpoint: Files moved to correct YYYY/MM directories.
Phase 3: Polish & Edge Cases (3 hours)
Goals:
- Dry run and logging
- Handle missing metadata
Tasks:
- Add
--dry-runoption. - Implement fallback to mtime.
- Log skipped files.
Checkpoint: Dry-run produces no changes and logs are correct.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Metadata source | EXIF vs mtime | EXIF then mtime | Most accurate when present |
| Conflict policy | overwrite, skip, suffix | skip + log | safest default |
| Move vs copy | move, copy | move | avoids duplicates by default |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Metadata parsing | EXIF date extraction |
| Integration Tests | Full run on sample tree | moved/skipped counts |
| Edge Case Tests | Files with spaces, duplicates | safe handling |
6.2 Critical Test Cases
- File with EXIF: moved to correct YYYY/MM.
- File without EXIF: fallback to mtime or skipped.
- Duplicate filename: conflict logged, no overwrite.
- Spaces in filename: handled without error.
6.3 Test Data
Pictures/
"IMG 1234.jpg"
"-weird.jpg"
oldscan.png
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Using plain xargs | Files with spaces break | Use -print0 and xargs -0 |
| Overwriting files | Missing duplicates | Use -n or suffix policy |
| Missing metadata | Files skipped unexpectedly | Add mtime fallback |
7.2 Debugging Strategies
- Log every decision with a clear reason (moved, skipped, conflict).
- Run with
--dry-runbefore actual moves.
7.3 Performance Traps
- Running exiftool on huge libraries is slow; consider batching or caching as an extension.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
--copymode. - Add support for
.heicfiles.
8.2 Intermediate Extensions
- Add hash-based duplicate detection.
- Produce a manifest CSV of moves.
8.3 Advanced Extensions
- Add parallel metadata extraction.
- Support video metadata tags beyond CreateDate.
9. Real-World Connections
9.1 Industry Applications
- Digital asset management: organizing large media libraries.
- Forensics: reconstructing timelines from metadata.
9.2 Related Open Source Projects
- ExifTool: industry standard metadata extractor.
- PhotoPrism: photo organization and management system.
9.3 Interview Relevance
- File safety: robust handling of real-world filenames.
- Automation: building idempotent, safe workflows.
10. Resources
10.1 Essential Reading
- The Linux Command Line by William E. Shotts - Ch. 17, 24
- Effective Shell by Dave Kerr - Ch. 8-9
10.2 Video Resources
- “EXIF Metadata Basics” (any digital photography course)
10.3 Tools & Documentation
- exiftool:
man exiftool - find:
man find
10.4 Related Projects in This Series
- Project 3: Automated Backup Script: safe file handling and logging
- Project 2: Log File Analyzer: deterministic parsing and reporting
11. Self-Assessment Checklist
11.1 Understanding
- I can explain why
-print0is necessary. - I can describe my conflict policy.
- I can explain why EXIF is preferred over mtime.
11.2 Implementation
- Script moves files into YYYY/MM directories.
- No overwrites occur.
- Dry-run mode works and produces identical planned output.
11.3 Growth
- I can run the script on a real library confidently.
- I can explain the safety features to someone else.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Files are organized into YYYY/MM directories.
- Filenames with spaces are handled safely.
- Conflicts are detected and not overwritten.
Full Completion:
- Dry-run mode and logging implemented.
- Fallback for missing metadata works.
Excellence (Going Above & Beyond):
- Hash-based duplicate detection and manifest export.
- Parallel processing with stable output ordering.