Project 3: Automated Backup Script

Build a reliable backup script that creates timestamped archives, enforces retention, and runs safely under cron without surprises.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 10 to 14 hours
Main Programming Language Bash
Alternative Programming Languages Python, Go, Ruby
Coolness Level Level 2: Practical but Valuable
Business Potential Level 3: Service & Support
Prerequisites Shell basics, basic filesystem knowledge
Key Topics Filesystem hierarchy, permissions, tar/gzip, cron, retention policies

1. Learning Objectives

By completing this project, you will:

  1. Design a deterministic, timestamped backup naming scheme.
  2. Create archives with correct permissions and metadata.
  3. Implement retention logic safely and predictably.
  4. Write a cron-safe script with explicit paths and clear logs.
  5. Validate backups with basic integrity checks.

2. All Theory Needed (Per-Concept Breakdown)

Concept 1: Filesystem Hierarchy, Permissions, and Metadata

Fundamentals

Linux organizes files in a predictable hierarchy. Knowing where data lives and how permissions are enforced is essential for backup automation. A backup script touches directories that might contain sensitive data, and it must respect ownership and permissions while preserving metadata. Files are referenced by paths, but the kernel uses inodes to store metadata like owner, permissions, and timestamps. This is why backups should preserve metadata: it is part of the data. A script that ignores permissions might create backups that cannot be restored correctly. Understanding the difference between file permissions and directory permissions is also crucial: directory execute permission controls traversal, while read permission controls listing. These nuances affect whether your script can read data at all and whether it can write backups to the destination.

Deep Dive into the Concept

The filesystem is a single rooted tree starting at /. Standard locations are defined by convention: /etc for configuration, /var for variable data, /home for user data, /srv for service data. When backing up, you must decide what to include and exclude. Backing up /home might be appropriate for user data, but backing up /proc or /sys is nonsensical because those are virtual filesystems. find -xdev helps you avoid crossing filesystem boundaries, which is critical if your source directory contains mounted volumes or network filesystems.

Permissions are represented by three triplets: user, group, and other. Each has read, write, and execute bits. For directories, read allows listing, write allows creating or deleting entries, and execute allows traversing into the directory. This is why you can see a directory but still get “permission denied” when you try to access a file within it: execute permission is required for traversal. Backup scripts must either run with sufficient permissions or skip unreadable files and report them. Running as root simplifies access but introduces risk; a safer pattern is to back up only the data you have permission to read.

Inodes are the underlying metadata records. Filenames are directory entries pointing to inodes, and multiple names can point to the same inode (hard links). This matters when you archive: a naive backup can duplicate hard-linked data, while tar can preserve hard links if you use the right options. Symbolic links are different: they store a path, not data, so your archive must decide whether to follow symlinks or store them as links. This decision affects restore behavior. A backup script should make this choice explicit (e.g., tar -h follows symlinks; default is to archive the link itself).

Timestamps matter for retention and restoration. Files have at least three timestamps: modification time (mtime), status change time (ctime), and access time (atime). When you build retention logic, you should operate on the backup filenames you generate, not on the source files. Retention should be based on backup creation time, not on file mtime. This avoids deleting backups that contain old data. The date command and stat can provide timestamps; find -mtime uses mtime and should be applied to backup archives, not to source directories. For reproducibility, consider using TZ=UTC and ISO 8601 timestamps in filenames.

The umask controls default permissions for new files created by your script. If your script runs under cron, it may inherit a different umask than your interactive shell. This can lead to backups that are unreadable by the expected user. A robust script sets umask explicitly, such as umask 077 for private backups. If you are backing up shared data, you might use umask 027 to allow group access. Document this choice, because it is a security decision.

Finally, consider storage locations. Backups should not be stored in the same directory being backed up, because that creates recursion and can explode archive size. Place backups in a separate destination, ideally on another filesystem. If the backup destination is a network mount, be aware of permissions and ownership mapping. Your script should verify destination existence and writability before starting. These checks are part of responsible automation and prevent partially written archives.

How This Fits in Projects

This concept is required for §3.2 (Functional Requirements) where you define backup locations and for §5.4 (Concepts You Must Understand First). It also informs the retention policy in §5.10 and the debugging advice in §7.1. The same filesystem knowledge is used in P05 Find and Organize Media Files.

Definitions & Key Terms

  • Inode: Metadata record for a file.
  • Umask: Mask controlling default permissions for new files.
  • Hard link: Multiple filenames pointing to one inode.
  • Symlink: A file containing a path to another file.
  • Mount point: Location where a filesystem is attached.

Mental Model Diagram (ASCII)

/path/name -> directory entry -> inode -> data blocks
                         |
                         v
                  permissions/timestamps

How It Works (Step-by-Step)

  1. Identify source and destination directories.
  2. Check permissions for reading source and writing destination.
  3. Decide how to handle symlinks and special files.
  4. Create archive preserving metadata.
  5. Store archive in destination with secure permissions.

Invariants: backups should be outside source tree; permissions must be preserved. Failure modes: permission denied, recursive backups, unreadable archives.

Minimal Concrete Example

umask 077
[ -r "$src" ] || { echo "cannot read source" >&2; exit 1; }
[ -w "$dest" ] || { echo "cannot write dest" >&2; exit 1; }

Common Misconceptions

  • “Directories are just files.” -> Directory permissions behave differently.
  • “Backup location can be inside source tree.” -> This causes recursive growth.
  • “Permissions don’t matter if I can read files now.” -> Restores may fail later.

Check-Your-Understanding Questions

  1. Why is execute permission required on a directory to access its contents?
  2. What is the difference between a hard link and a symlink in backups?
  3. Why should backups be stored outside the source directory?

Check-Your-Understanding Answers

  1. Execute permission allows directory traversal; without it you cannot access entries.
  2. Hard links point to the same inode; symlinks store a path and can break if moved.
  3. Storing backups inside the source causes recursion and growth.

Real-World Applications

  • User home directory backups
  • Configuration snapshots for servers
  • Incident response data collection

Where You’ll Apply It

References

  • How Linux Works (Ward), Ch. 2
  • The Linux Command Line (Shotts), Ch. 8-9

Key Insight

Backups are only as good as the metadata you preserve.

Summary

Filesystem hierarchy and permissions dictate what you can back up and how safely you can restore it.

Homework/Exercises to Practice the Concept

  1. Compare ls -l and stat for a file and identify permissions and timestamps.
  2. Create a hard link and a symlink, then remove the original file and observe behavior.
  3. Use find -xdev and verify it stays within one filesystem.

Solutions to the Homework/Exercises

  1. ls -l file; stat file
  2. ln file hardlink; ln -s file symlink; rm file; ls -l hardlink symlink
  3. find / -xdev -maxdepth 1 | head

Concept 2: Archiving, Retention, and Scheduling

Fundamentals

A backup is more than a copy; it is a repeatable process with verifiable outcomes. Archiving tools like tar package multiple files into a single file while preserving permissions and timestamps. Compression (gzip) reduces storage size. Retention policies define how many backups to keep and when to delete older ones. Scheduling tools like cron ensure backups run automatically, but cron runs in a minimal environment and therefore requires explicit paths and clear logging. Understanding how tar, retention, and cron interact is essential to building a script you can trust.

Deep Dive into the Concept

tar is the standard Unix archiver. When you run tar -czf backup.tar.gz /path, tar reads files, records metadata, and writes a compressed archive. Tar preserves permissions, ownership, timestamps, and symlinks by default. If you want to exclude files, you can use --exclude patterns. Tar can also read a list of files from stdin with --files-from -, which is useful if you want to filter files with find. For backups, this lets you create precise archives without including unwanted paths.

Compression is a trade-off between CPU and storage. gzip is fast and widely supported; xz compresses more but is slower. The default for this project is gzip because it is common and sufficiently efficient. Integrity verification is important: you can list archive contents with tar -tf and compute a checksum with sha256sum to confirm the archive exists and is stable. This simple verification step catches the common mistake of creating empty or partial archives.

Retention policies are a safety mechanism. The simplest policy is “keep the last N days”. You can implement this by naming backups with timestamps (e.g., backup-YYYY-MM-DD_HH-MM-SS.tar.gz) and deleting files older than N days using find -mtime +N -delete. This is safe if you limit the search to the backup directory and match the filename prefix. The key is to avoid a delete command that could target the wrong path. Always compute the path and then delete within that path; never run find / -delete with a variable that could be empty. Defensive scripting here prevents catastrophic mistakes.

Scheduling with cron introduces a new environment. Cron does not run with your interactive shell’s PATH, so commands like tar and date might not be found unless you use absolute paths or set PATH in the script. Cron also runs with a minimal environment, so variables like HOME or TZ may be unset. A robust script sets PATH and uses explicit binaries (/usr/bin/tar, /bin/date). Logging is critical: cron will email output if configured, but you should also write logs to a file with timestamps. This makes it possible to audit backups after the fact.

Finally, consider idempotence. Running your script twice in a row should produce two different archives (timestamps ensure uniqueness), but should not corrupt previous backups. Your script should fail fast if the destination does not exist or is not writable. It should also support a --dry-run mode that prints what it would do without making changes. This allows safe testing in production environments. A deterministic backup naming scheme (using UTC time and SOURCE_DATE_EPOCH for tests) makes your outputs predictable and testable. When you combine these ideas, you get a backup script that is safe, repeatable, and suitable for automation.

How This Fits in Projects

This concept drives §3 (Project Specification), especially the retention and cron requirements, and §5.10 (Implementation Phases). It is also used in the failure demos in §3.7.4. The logging and scheduling ideas also apply to P04 System Resource Monitor.

Definitions & Key Terms

  • Archive: A single file containing multiple files and metadata.
  • Retention policy: Rules for how long to keep backups.
  • Cron: Scheduler that runs commands at specified times.
  • Dry run: A mode that prints actions without making changes.
  • Checksum: A hash used to verify file integrity.

Mental Model Diagram (ASCII)

Source -> tar+gzip -> archive -> retention cleanup -> log
                       ^
                       |
                    cron schedule

How It Works (Step-by-Step)

  1. Generate a timestamped filename.
  2. Create archive with tar and gzip.
  3. Verify archive exists and optionally list contents.
  4. Delete backups older than retention threshold.
  5. Log actions with timestamps.
  6. Schedule via cron with absolute paths.

Invariants: backups are uniquely named; retention only targets backup directory. Failure modes: cron PATH issues, deleting wrong files, partial archives.

Minimal Concrete Example

ts=$(date -u +"%Y-%m-%d_%H-%M-%S")
archive="$dest/backup-$ts.tar.gz"
/usr/bin/tar -czf "$archive" "$src"
/usr/bin/find "$dest" -name "backup-*.tar.gz" -mtime +7 -delete

Common Misconceptions

  • “Cron runs like my shell.” -> It uses a minimal environment.
  • “Deleting by age is safe anywhere.” -> Only safe within a known directory.
  • “If tar exits 0, the archive is valid.” -> You should still verify size or list.

Check-Your-Understanding Questions

  1. Why should you use absolute paths in cron jobs?
  2. How can a retention command delete the wrong files?
  3. Why is a dry-run mode useful?

Check-Your-Understanding Answers

  1. Cron has a minimal PATH; absolute paths ensure the correct binaries are used.
  2. If the backup directory variable is empty, find could run on /.
  3. It allows safe validation without making changes.

Real-World Applications

  • Nightly backups of project directories
  • Snapshotting configuration before deployments
  • Archiving logs with retention

Where You’ll Apply It

References

  • How Linux Works (Ward), Ch. 7
  • The Linux Command Line (Shotts), Ch. 18

Key Insight

Automation is only reliable when retention and scheduling are explicit and testable.

Summary

Archiving, retention, and cron combine into a safe, repeatable backup workflow.

Homework/Exercises to Practice the Concept

  1. Create a tar archive of a small folder and list its contents.
  2. Write a find command that deletes files older than 7 days in a test folder.
  3. Create a cron entry that runs a script every day at 1:00 AM.

Solutions to the Homework/Exercises

  1. tar -czf test.tar.gz folder; tar -tf test.tar.gz | head
  2. find /tmp/backups -name "backup-*.tar.gz" -mtime +7 -delete
  3. 0 1 * * * /path/to/backup.sh >> /path/to/backup.log 2>&1

3. Project Specification

3.1 What You Will Build

You will build backup.sh, a script that takes a source directory and destination directory, creates a timestamped tar.gz archive, logs the operation, and deletes archives older than a configured retention period. It will support a --dry-run mode and be safe to run under cron. It will not implement incremental backups or remote replication by default, but you may add optional rsync sync as an extension.

3.2 Functional Requirements

  1. Arguments: Accept source and destination paths.
  2. Validation: Fail clearly if source does not exist or destination is not writable.
  3. Archive: Create a timestamped tar.gz archive of the source.
  4. Retention: Delete archives older than N days (configurable).
  5. Logging: Append timestamped lines to a log file in the destination.
  6. Dry run: Print intended actions without modifying files.
  7. Exit codes: 0 success, 1 usage error, 2 validation failure, 3 archive failure.

3.3 Non-Functional Requirements

  • Reliability: Script must be idempotent and safe to run multiple times.
  • Security: Use a secure umask (default 077).
  • Usability: Clear usage and log output.

3.4 Example Usage / Output

$ ./backup.sh ~/Projects /mnt/backups/projects --keep-days 7
[2025-12-31T01:00:00Z] Backing up /home/user/Projects
[2025-12-31T01:00:04Z] Created /mnt/backups/projects/backup-2025-12-31_01-00-00.tar.gz
[2025-12-31T01:00:04Z] Retention: deleted 2 old backups

3.5 Data Formats / Schemas / Protocols

Archive naming:

backup-YYYY-MM-DD_HH-MM-SS.tar.gz

Log line format:

[ISO8601] message

3.6 Edge Cases

  • Destination directory missing or read-only.
  • Source directory contains symlinks or special files.
  • Retention command deletes the wrong files if path is empty.
  • Cron environment missing PATH.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

cd /path/to/backup
chmod +x backup.sh
TZ=UTC SOURCE_DATE_EPOCH=1767229200 ./backup.sh ~/Projects /mnt/backups/projects --keep-days 7

3.7.2 Golden Path Demo (Deterministic)

Use TZ=UTC and SOURCE_DATE_EPOCH=1767229200 (2025-12-31T01:00:00Z) to produce stable timestamps in logs.

3.7.3 If CLI: Exact Terminal Transcript

$ TZ=UTC SOURCE_DATE_EPOCH=1767229200 ./backup.sh ~/Projects /mnt/backups/projects --keep-days 7
[2025-12-31T01:00:00Z] Backing up /home/user/Projects
[2025-12-31T01:00:03Z] Created /mnt/backups/projects/backup-2025-12-31_01-00-00.tar.gz
[2025-12-31T01:00:03Z] Retention: deleted 1 old backups
Exit code: 0

$ echo $?
0

3.7.4 Failure Demo (Bad Input)

$ ./backup.sh /no/such/path /mnt/backups/projects
ERROR: source not found: /no/such/path
Exit code: 2

4. Solution Architecture

4.1 High-Level Design

validate -> archive -> verify -> retention -> log -> exit

4.2 Key Components

Component Responsibility Key Decisions
Validator Check source/dest Fail fast on missing paths
Archiver Create tar.gz Use UTC timestamps
Retention Delete old archives Restrict to dest folder
Logger Append log entries ISO 8601 timestamps

4.3 Data Structures (No Full Code)

src=""
dest=""
keep_days=7
log_file="$dest/backup.log"

4.4 Algorithm Overview

Key Algorithm: Backup Run

  1. Validate args and paths.
  2. Build timestamped filename.
  3. Create archive with tar.
  4. Verify archive exists and size > 0.
  5. Run retention cleanup.
  6. Log results and exit.

Complexity Analysis:

  • Time: O(n) for archiving all files.
  • Space: O(1) extra besides archive output.

5. Implementation Guide

5.1 Development Environment Setup

which tar gzip find date

5.2 Project Structure

backup/
├── backup.sh
├── README.md
└── tests/
    └── fixtures/

5.3 The Core Question You’re Answering

“How do I create repeatable backups that I can schedule and trust?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Filesystem permissions and metadata (Concept 1)
  2. Archiving and cron scheduling (Concept 2)

5.5 Questions to Guide Your Design

  1. What should happen if the destination is full?
  2. How will you avoid overwriting backups?
  3. How do you ensure cron uses the correct PATH?
  4. What retention period is reasonable for your storage budget?

5.6 Thinking Exercise

You have 30 GB of data and 50 GB of backup storage. If each daily backup is 5 GB, how many days can you keep? How will you enforce that in your script?

5.7 The Interview Questions They’ll Ask

  1. “Why use tar instead of copying files directly?”
  2. “How do you make cron jobs reliable?”
  3. “How would you verify backups are valid?”
  4. “What is a safe retention policy?”

5.8 Hints in Layers

Hint 1: Timestamped name

ts=$(date -u +"%Y-%m-%d_%H-%M-%S")
archive="$dest/backup-$ts.tar.gz"

Hint 2: Validation

[ -d "$src" ] || { echo "missing source" >&2; exit 2; }
[ -d "$dest" ] || { echo "missing dest" >&2; exit 2; }

Hint 3: Retention

find "$dest" -name "backup-*.tar.gz" -mtime +"$keep_days" -delete

5.9 Books That Will Help

Topic Book Chapter
Filesystem basics How Linux Works Ch. 2
Archiving The Linux Command Line Ch. 18
Scheduling How Linux Works Ch. 7

5.10 Implementation Phases

Phase 1: Foundation (3 hours)

Goals:

  • Parse arguments
  • Create a tar.gz archive

Tasks:

  1. Implement validation.
  2. Create a basic archive.

Checkpoint: Archive exists and contains files.

Phase 2: Core Functionality (3 hours)

Goals:

  • Add logging and retention

Tasks:

  1. Append log lines with timestamps.
  2. Delete old backups by age.

Checkpoint: Old backups are cleaned up correctly.

Phase 3: Polish & Edge Cases (2 hours)

Goals:

  • Dry-run mode
  • Cron-safe environment

Tasks:

  1. Add --dry-run support.
  2. Set PATH and umask explicitly.

Checkpoint: Script runs under cron without errors.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Timestamp format local time vs UTC UTC Avoid timezone confusion
Retention method count vs age age (-mtime) Simpler and reliable
Symlink handling follow vs archive links archive links Safer restores

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Validate path checks missing source, read-only dest
Integration Tests Run full backup create archive, verify contents
Edge Case Tests Symlinks and empty dirs ensure preserved

6.2 Critical Test Cases

  1. Valid run: Archive created and logged.
  2. Missing source: Exits 2 with error.
  3. Retention: Old archives removed.
  4. Dry run: No files created.

6.3 Test Data

/tmp/backup-test/src
/tmp/backup-test/dest

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
PATH issues in cron Script works manually, fails in cron Use absolute paths
Retention deletes wrong files Missing variable expansion Validate dest before find
Empty archives Wrong source path Log source path before tar

7.2 Debugging Strategies

  • Log every step to a file with timestamps.
  • Use tar -tf to inspect archives after creation.

7.3 Performance Traps

  • Compressing huge directories during peak hours can slow systems; schedule off-hours.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add --exclude patterns.
  • Add --keep-days argument.

8.2 Intermediate Extensions

  • Add rsync sync to remote host.
  • Add checksum verification.

8.3 Advanced Extensions

  • Implement incremental backups with tar --listed-incremental.
  • Add email notifications on failure.

9. Real-World Connections

9.1 Industry Applications

  • Server backups: nightly snapshots of configuration and data.
  • Developer workflows: backup of code and notes.
  • restic: modern backup tool with encryption and dedup.
  • borg: efficient backup with incremental storage.

9.3 Interview Relevance

  • Reliability: designing safe automation.
  • Filesystem knowledge: permissions and metadata handling.

10. Resources

10.1 Essential Reading

  • How Linux Works by Brian Ward - Ch. 2, 7
  • The Linux Command Line by William E. Shotts - Ch. 18

10.2 Video Resources

  • “Backups 101” (any sysadmin basics course)

10.3 Tools & Documentation

  • tar: man tar
  • cron: man 5 crontab

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain how tar preserves metadata.
  • I can describe why cron needs absolute paths.
  • I can explain how retention works with find -mtime.

11.2 Implementation

  • Script creates timestamped archives.
  • Retention deletes only backups.
  • Logs are clear and timestamped.

11.3 Growth

  • I can restore a backup and verify integrity.
  • I can explain my retention policy.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Script accepts source/destination arguments and creates a tar.gz archive.
  • Logs actions with timestamps.
  • Exits non-zero on invalid input.

Full Completion:

  • All minimum criteria plus:
  • Retention policy implemented and safe.
  • Dry-run mode supported.

Excellence (Going Above & Beyond):

  • Remote sync and checksum verification.
  • Incremental backups with documented restore steps.