Project 5: Find and Organize Photo/Video Files

Build a safe media organizer that discovers photos and videos, extracts creation dates, and moves files into a clean YYYY/MM folder structure without losing data.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 12 to 18 hours
Main Programming Language Bash
Alternative Programming Languages Python, Go
Coolness Level Level 3: Genuinely Clever
Business Potential Level 2: Micro-SaaS / Pro Tool
Prerequisites Shell scripting, file permissions, basic metadata concepts
Key Topics Safe file discovery, EXIF metadata, idempotent moves, conflict handling

1. Learning Objectives

By completing this project, you will:

  1. Safely traverse large directory trees with arbitrary filenames.
  2. Extract creation dates from EXIF metadata and fall back to filesystem times.
  3. Design a deterministic folder layout and avoid overwriting files.
  4. Implement dry-run and logging for safety.
  5. Build a tool that can be trusted with real data.

2. All Theory Needed (Per-Concept Breakdown)

Concept 1: Safe File Discovery and Filename Handling

Fundamentals

File discovery is deceptively hard. Filenames can contain spaces, tabs, quotes, and even newlines. A script that uses for f in $(find ...) will break in real-world photo libraries. The safe approach is to use find -print0 and consume the results with a NUL-aware loop or xargs -0. This guarantees that each filename is treated as a single unit, regardless of its contents. A media organizer must be robust against all filenames because user photo libraries are messy and unpredictable.

Deep Dive into the Concept

Unix tools traditionally separate records with newlines, which is fine for simple text but dangerous for filenames. A filename can legally contain a newline, which means that a newline-separated list is ambiguous. find -print0 emits NUL characters between filenames, and xargs -0 or while IFS= read -r -d '' can read them safely. This is the gold standard for filename-safe scripting. The downside is that you must use tools that support NUL delimiters, but for this project that is acceptable because find, xargs, and read support it.

Another issue is quoting. When you pass filenames to commands like mv or cp, you must quote the variable: mv -- "$src" "$dest". The -- signals the end of options, which is important if a filename begins with -. Without --, a filename like -rf could be misinterpreted as an option, leading to destructive behavior. This is a crucial safety pattern in file manipulation scripts.

Directory traversal itself can be large. find can traverse millions of files; it will cross filesystem boundaries by default and can descend into mounted volumes you did not intend to include. Use -xdev if you want to stay on the same filesystem. You can also restrict depth with -maxdepth and filter by extensions with -iname. For media organization, you should explicitly list allowed extensions to avoid moving unrelated files. This is a safety boundary.

You must also decide whether to follow symlinks. find does not follow symlinks by default, which is safer; if you follow symlinks, you can end up with duplicate files or infinite loops. For this project, you should not follow symlinks. Document this choice. If you want to support symlinks, you must add cycle detection and clear warnings.

Finally, consider logging and dry runs. A script that moves files should provide a --dry-run mode that prints what it would do without changing anything. This is essential for trust. The output should include source and destination paths, and it should be deterministic so you can compare runs. Dry-run mode is also a test mechanism; it allows you to validate your logic before touching real data.

How This Fits in Projects

This concept informs §3.2 (Functional Requirements) and §5.8 (Hints in Layers). It is directly applied in §5.2 (Project Structure) and §6.2 (Critical Test Cases). Safe file handling is also critical in P03 Automated Backup Script.

Definitions & Key Terms

  • NUL delimiter: A zero byte used to separate filenames safely.
  • -print0: find option that outputs NUL-separated names.
  • : End-of-options marker for commands.
  • Traversal: Recursively walking a directory tree.
  • Dry run: Execute logic without making changes.

Mental Model Diagram (ASCII)

find -print0 -> NUL-safe loop -> quoted file ops

How It Works (Step-by-Step)

  1. Use find with explicit extensions.
  2. Emit filenames with -print0.
  3. Read with read -r -d '' or xargs -0.
  4. For each file, quote paths and use -- in commands.
  5. Log actions and optionally dry-run.

Invariants: never split filenames on whitespace; always quote paths. Failure modes: accidental option injection, broken filenames, crossing unwanted filesystems.

Minimal Concrete Example

find "$src" -type f -print0 | while IFS= read -r -d '' f; do
  printf 'Found: %s\n' "$f"
done

Common Misconceptions

  • “Filenames never contain spaces.” -> They often do.
  • xargs is always safe.” -> Not without -0.
  • mv will treat filenames safely.” -> Only if you use -- and quoting.

Check-Your-Understanding Questions

  1. Why is find ... | xargs unsafe without -0?
  2. What does -- protect against in mv?
  3. Why might you use -xdev in find?

Check-Your-Understanding Answers

  1. Because filenames with spaces or newlines will be split incorrectly.
  2. It prevents filenames starting with - from being treated as options.
  3. To avoid crossing filesystem boundaries and moving unintended files.

Real-World Applications

  • Photo library organization
  • Large file cleanup scripts
  • Backup and migration tooling

Where You’ll Apply It

References

  • Effective Shell (Kerr), Ch. 8
  • The Linux Command Line (Shotts), Ch. 17

Key Insight

Filename safety is the difference between a reliable script and a data-loss event.

Summary

Safe file discovery uses NUL delimiters and careful quoting to avoid surprises.

Homework/Exercises to Practice the Concept

  1. Create files with spaces and verify your loop handles them.
  2. Create a file named -test.jpg and ensure mv -- handles it safely.
  3. Use find -xdev to restrict traversal to one filesystem.

Solutions to the Homework/Exercises

  1. touch "a b.jpg"; find . -print0 | while IFS= read -r -d '' f; do echo "$f"; done
  2. touch -- -test.jpg; mv -- -test.jpg safe.jpg
  3. find / -xdev -maxdepth 1 | head

Concept 2: Metadata-Driven Organization and Idempotent Moves

Fundamentals

Media files often contain metadata that records when they were created. EXIF metadata for photos (and sometimes videos) includes tags like CreateDate. A media organizer should use this metadata to place files into a YYYY/MM directory structure. But metadata can be missing or incorrect, so the script must define a fallback strategy, such as using filesystem modification time. Idempotence is critical: running the script twice should not move files again or overwrite existing files. This concept teaches you how to extract metadata safely, handle conflicts, and design a deterministic organization scheme.

Deep Dive into the Concept

EXIF metadata is embedded in image files and contains camera settings, timestamps, and more. exiftool is the most reliable CLI tool for extracting this data. It can output specific tags, and you can format dates with -d. For example, exiftool -d "%Y/%m" -CreateDate -s -s -s file.jpg prints the creation year and month if available. But not all files contain EXIF data; scans and edited images often lack CreateDate. Videos may use different tags. A robust organizer should attempt multiple tags or fall back to filesystem times. This fallback must be documented so the user knows what to expect.

Date parsing must be deterministic. Use a fixed format like YYYY/MM, and avoid locale-dependent formats. If you use filesystem mtime, be aware that copying files can change it, so it may not reflect the true creation date. The script should therefore prefer EXIF when present and only fall back to mtime if missing. If both are missing or invalid, the script should skip the file and log it. This prevents silent misplacement.

Conflict handling is another critical topic. If two files have the same name and destination directory, a naive mv will overwrite one of them. That is unacceptable. You must define a conflict policy: skip duplicates, add a numeric suffix, or store duplicates in a separate folder. For this project, a safe approach is to use mv -n (no clobber) and log conflicts, or to append a suffix like _1, _2. The important part is that the policy is deterministic and documented. For example, you can compute a hash of the file contents and use it in the filename to avoid collisions, but that is an advanced extension.

Idempotence is the property that running the script twice yields the same result. If you move files into a target structure, the second run should see that the files are already in place and either skip them or detect that the source is empty. This is easiest if you only operate on the source directory and never on the destination, and if you use a dry-run to confirm intended actions. Logging is part of idempotence: your script should record what it moved and what it skipped, so you can audit results. A manifest file listing source and destination paths is a practical safety tool.

Finally, consider directory creation. Use mkdir -p to create the destination YYYY/MM path. This is safe if the path already exists. Use umask to control permissions, and avoid running as root unless necessary. When moving files, use mv -- with quoted variables. If you choose to copy instead of move, you must decide how to handle cleanup and verification. For this project, moving is acceptable if you trust your script and have backups; copying is safer if you are cautious. Provide a --copy flag as an optional extension if you want a safety-first approach.

How This Fits in Projects

This concept defines the core functionality in §3.2 (Functional Requirements) and the real-world demos in §3.7. It also informs conflict handling in §7.1 and extensions in §8. The same conflict-avoidance patterns apply to P03 Automated Backup Script.

Definitions & Key Terms

  • EXIF: Metadata standard for photos and some videos.
  • CreateDate: EXIF tag for capture time.
  • Idempotence: Running the script multiple times yields the same result.
  • Conflict: Two files mapping to the same destination name.
  • Manifest: A log of source and destination paths.

Mental Model Diagram (ASCII)

file -> metadata date -> YYYY/MM -> conflict policy -> move/copy

How It Works (Step-by-Step)

  1. For each file, attempt to read CreateDate with exiftool.
  2. If missing, fall back to filesystem mtime.
  3. Build destination directory YYYY/MM.
  4. Resolve naming conflicts deterministically.
  5. Move or copy file and log action.

Invariants: EXIF preferred over mtime; no overwrites. Failure modes: missing metadata, duplicate filenames, incorrect date formats.

Minimal Concrete Example

date=$(exiftool -d "%Y/%m" -CreateDate -s -s -s "$file")
[ -z "$date" ] && date=$(date -r "$file" +"%Y/%m")
mkdir -p "$dest/$date"
cp -n -- "$file" "$dest/$date/"

Common Misconceptions

  • “All media files have EXIF.” -> Many do not.
  • “mtime equals capture date.” -> It can change when files are copied or edited.
  • “mv is safe by default.” -> It overwrites unless you prevent it.

Check-Your-Understanding Questions

  1. Why should EXIF be preferred over mtime?
  2. What is a safe conflict policy for duplicate filenames?
  3. How do you make the script idempotent?

Check-Your-Understanding Answers

  1. EXIF reflects capture time; mtime can change when files are copied.
  2. Use mv -n or append a suffix and log the conflict.
  3. Skip files already moved and avoid overwriting.

Real-World Applications

  • Photo library cleanup after a trip
  • Migrating files from multiple devices
  • Building a searchable, chronological archive

Where You’ll Apply It

References

  • Effective Shell (Kerr), Ch. 8-9
  • The Linux Command Line (Shotts), Ch. 17

Key Insight

A safe organizer is defined by its conflict policy and metadata strategy.

Summary

Metadata-driven organization requires deterministic rules and careful file operations.

Homework/Exercises to Practice the Concept

  1. Use exiftool to read CreateDate from two photos.
  2. Create two files with the same name in different folders and test your conflict policy.
  3. Run your script twice in dry-run mode and compare outputs.

Solutions to the Homework/Exercises

  1. exiftool -CreateDate -s -s -s photo.jpg
  2. mkdir -p a b; touch a/img.jpg b/img.jpg; simulate moves with suffixes
  3. ./organize_media.sh --dry-run src dest > run1.txt; ./organize_media.sh --dry-run src dest > run2.txt; diff run1.txt run2.txt

3. Project Specification

3.1 What You Will Build

You will build organize_media.sh, a script that discovers media files, extracts their creation date, and moves them into a YYYY/MM directory tree under a destination folder. It will handle filenames safely, avoid overwriting files, log skipped items, and support --dry-run. It will not perform image conversion, duplicate detection via hashing, or advanced metadata repair.

3.2 Functional Requirements

  1. Discovery: Search for media files by extension (.jpg, .png, .mov, .mp4).
  2. Metadata: Extract CreateDate via exiftool; fallback to mtime if missing.
  3. Destination: Create YYYY/MM directories under the destination root.
  4. Safety: Never overwrite existing files; use a conflict policy.
  5. Dry run: Print planned moves without making changes.
  6. Logging: Record moved and skipped files.
  7. Exit codes: 0 success, 1 usage error, 2 missing dependency, 3 runtime error.

3.3 Non-Functional Requirements

  • Reliability: Safe for large libraries with spaces and special characters.
  • Usability: Clear output and summary counts.
  • Performance: Streaming file discovery; avoid loading lists into memory.

3.4 Example Usage / Output

$ ./organize_media.sh ~/Pictures ~/Pictures/organized
Moved: IMG_1234.jpg -> 2025/12/IMG_1234.jpg
Moved: trip.mov -> 2024/06/trip.mov
Skipped: oldscan.png (no metadata)

Summary: moved=2 skipped=1 conflicts=0

3.5 Data Formats / Schemas / Protocols

Directory layout:

organized/
├── 2024/
│   └── 06/
│       └── trip.mov
└── 2025/
    └── 12/
        └── IMG_1234.jpg

Log format:

[ISO8601] ACTION src -> dest

3.6 Edge Cases

  • File has no EXIF and no readable mtime.
  • Two files map to the same destination name.
  • Destination path not writable.
  • Filenames containing spaces or leading dashes.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

cd /path/to/organizer
chmod +x organize_media.sh
TZ=UTC SOURCE_DATE_EPOCH=1767225600 ./organize_media.sh ~/Pictures ~/Pictures/organized --dry-run

3.7.2 Golden Path Demo (Deterministic)

Use TZ=UTC and SOURCE_DATE_EPOCH to make log timestamps deterministic for tests.

3.7.3 If CLI: Exact Terminal Transcript

$ ./organize_media.sh ~/Pictures ~/Pictures/organized
Moved: IMG_1234.jpg -> 2025/12/IMG_1234.jpg
Moved: trip.mov -> 2024/06/trip.mov
Skipped: oldscan.png (no metadata)

Summary: moved=2 skipped=1 conflicts=0
Exit code: 0

3.7.4 Failure Demo (Missing Dependency)

$ ./organize_media.sh ~/Pictures ~/Pictures/organized
ERROR: exiftool not found. Please install it first.
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

find -> safe loop -> metadata -> dest path -> conflict check -> move -> log

4.2 Key Components

Component Responsibility Key Decisions
Finder Discover media files Use find -print0
Metadata Reader Extract date EXIF first, mtime fallback
Organizer Build dest path YYYY/MM structure
Conflict Resolver Prevent overwrites mv -n or suffix
Logger Record actions ISO 8601 timestamps

4.3 Data Structures (No Full Code)

moved=0
skipped=0
conflicts=0

4.4 Algorithm Overview

Key Algorithm: Per-File Organization

  1. Read file path safely.
  2. Extract metadata date.
  3. Determine destination directory.
  4. Check for conflicts.
  5. Move file or skip with log.

Complexity Analysis:

  • Time: O(n) for n files.
  • Space: O(1) streaming processing.

5. Implementation Guide

5.1 Development Environment Setup

which find exiftool

5.2 Project Structure

organize-media/
├── organize_media.sh
├── logs/
└── tests/
    └── fixtures/

5.3 The Core Question You’re Answering

“How can I safely refactor a chaotic directory into a clean structure without losing data?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Safe file discovery and quoting (Concept 1)
  2. Metadata extraction and conflict handling (Concept 2)

5.5 Questions to Guide Your Design

  1. Which extensions will you include?
  2. What is your fallback when EXIF is missing?
  3. How will you handle duplicates?
  4. Should you copy first or move directly?

5.6 Thinking Exercise

If a file has EXIF date 2012 but its filename suggests 2019, which do you trust and why? What is the least risky choice?

5.7 The Interview Questions They’ll Ask

  1. “Why use -print0 with find?”
  2. “How do you prevent overwriting files?”
  3. “What is your fallback when metadata is missing?”
  4. “How do you make the script idempotent?”

5.8 Hints in Layers

Hint 1: Safe discovery

find "$src" -type f \( -iname "*.jpg" -o -iname "*.png" -o -iname "*.mov" -o -iname "*.mp4" \) -print0

Hint 2: Metadata date

date=$(exiftool -d "%Y/%m" -CreateDate -s -s -s "$file")

Hint 3: Conflict handling

if [ -e "$dest_file" ]; then
  conflicts=$((conflicts+1))
  echo "Conflict: $dest_file" >&2
  continue
fi

5.9 Books That Will Help

Topic Book Chapter
Finding files The Linux Command Line Ch. 17
Shell scripting safety The Linux Command Line Ch. 24
Robust scripts Effective Shell Ch. 6-9

5.10 Implementation Phases

Phase 1: Foundation (3 hours)

Goals:

  • Discover files safely
  • Print planned destination paths

Tasks:

  1. Implement find -print0 loop.
  2. Print src -> dest mappings.

Checkpoint: Output shows correct paths for sample files.

Phase 2: Core Functionality (4 hours)

Goals:

  • Extract metadata and move files

Tasks:

  1. Use exiftool to read dates.
  2. Create destination directories.
  3. Move files with conflict checks.

Checkpoint: Files moved to correct YYYY/MM directories.

Phase 3: Polish & Edge Cases (3 hours)

Goals:

  • Dry run and logging
  • Handle missing metadata

Tasks:

  1. Add --dry-run option.
  2. Implement fallback to mtime.
  3. Log skipped files.

Checkpoint: Dry-run produces no changes and logs are correct.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Metadata source EXIF vs mtime EXIF then mtime Most accurate when present
Conflict policy overwrite, skip, suffix skip + log safest default
Move vs copy move, copy move avoids duplicates by default

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Metadata parsing EXIF date extraction
Integration Tests Full run on sample tree moved/skipped counts
Edge Case Tests Files with spaces, duplicates safe handling

6.2 Critical Test Cases

  1. File with EXIF: moved to correct YYYY/MM.
  2. File without EXIF: fallback to mtime or skipped.
  3. Duplicate filename: conflict logged, no overwrite.
  4. Spaces in filename: handled without error.

6.3 Test Data

Pictures/
  "IMG 1234.jpg"
  "-weird.jpg"
  oldscan.png

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Using plain xargs Files with spaces break Use -print0 and xargs -0
Overwriting files Missing duplicates Use -n or suffix policy
Missing metadata Files skipped unexpectedly Add mtime fallback

7.2 Debugging Strategies

  • Log every decision with a clear reason (moved, skipped, conflict).
  • Run with --dry-run before actual moves.

7.3 Performance Traps

  • Running exiftool on huge libraries is slow; consider batching or caching as an extension.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add --copy mode.
  • Add support for .heic files.

8.2 Intermediate Extensions

  • Add hash-based duplicate detection.
  • Produce a manifest CSV of moves.

8.3 Advanced Extensions

  • Add parallel metadata extraction.
  • Support video metadata tags beyond CreateDate.

9. Real-World Connections

9.1 Industry Applications

  • Digital asset management: organizing large media libraries.
  • Forensics: reconstructing timelines from metadata.
  • ExifTool: industry standard metadata extractor.
  • PhotoPrism: photo organization and management system.

9.3 Interview Relevance

  • File safety: robust handling of real-world filenames.
  • Automation: building idempotent, safe workflows.

10. Resources

10.1 Essential Reading

  • The Linux Command Line by William E. Shotts - Ch. 17, 24
  • Effective Shell by Dave Kerr - Ch. 8-9

10.2 Video Resources

  • “EXIF Metadata Basics” (any digital photography course)

10.3 Tools & Documentation

  • exiftool: man exiftool
  • find: man find

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why -print0 is necessary.
  • I can describe my conflict policy.
  • I can explain why EXIF is preferred over mtime.

11.2 Implementation

  • Script moves files into YYYY/MM directories.
  • No overwrites occur.
  • Dry-run mode works and produces identical planned output.

11.3 Growth

  • I can run the script on a real library confidently.
  • I can explain the safety features to someone else.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Files are organized into YYYY/MM directories.
  • Filenames with spaces are handled safely.
  • Conflicts are detected and not overwritten.

Full Completion:

  • Dry-run mode and logging implemented.
  • Fallback for missing metadata works.

Excellence (Going Above & Beyond):

  • Hash-based duplicate detection and manifest export.
  • Parallel processing with stable output ordering.