Project 4: Automated Backup System with Timers

Build a timer-driven backup system that replaces cron with persistence, jitter, integrity checks, and failure notifications.

Quick Reference

Attribute	Value
Difficulty	Level 1: Beginner
Time Estimate	6-12 hours
Main Programming Language	Bash (Alternatives: Python)
Alternative Programming Languages	Python, Go
Coolness Level	Level 1: Practical and Useful
Business Potential	Level 3: Service and Support
Prerequisites	Shell scripting, tar/rsync basics, systemd unit files
Key Topics	systemd timers, backup integrity, logging and OnFailure

1. Learning Objectives

By completing this project, you will:

Replace cron with systemd timers using OnCalendar and Persistent.
Add jitter (RandomizedDelaySec) to prevent thundering herd.
Implement daily incrementals and weekly full backups.
Verify backups using checksums and restore tests.
Trigger alerts on failure with OnFailure units.

2. All Theory Needed (Per-Concept Breakdown)

Concept 1: systemd Timer Semantics

Fundamentals

systemd timers are first-class units that schedule services. Unlike cron, timers are part of the systemd dependency graph and provide better observability. A timer can trigger a service on calendar schedules, at boot, or at fixed intervals. The Persistent=true option runs missed jobs on the next boot, which is crucial for laptops and intermittently powered systems. RandomizedDelaySec adds jitter so that large fleets do not run at the same moment. AccuracySec controls how precise the scheduler is. Understanding these options lets you build reliable schedules that are robust to reboots and load spikes. Timers also provide clear audit trails via systemctl list-timers.

Deep Dive into the Concept

OnCalendar uses a rich calendar syntax that supports ranges, lists, and names. It can express schedules like Mon..Fri 02:00, *-*-01 01:00 (first day of the month), or daily. A single timer can include multiple OnCalendar lines, which act as OR conditions. Timers also support OnBootSec and OnUnitActiveSec, which are relative schedules (e.g., run 10 minutes after boot or 24 hours after last run). These options enable advanced scheduling patterns that cron cannot express easily.

Persistent=true is the defining feature for reliability. systemd records the last trigger time in /var/lib/systemd/timers/. If the machine was off or asleep when the timer would have fired, systemd runs the service as soon as possible after boot. This ensures that critical tasks are not skipped. In a fleet, this behavior reduces the risk of silent backup gaps, but it can also produce a startup spike. Pairing Persistent=true with RandomizedDelaySec spreads the load after boot.

RandomizedDelaySec adds a random delay to each scheduled run. This is essential for distributed systems. If 500 machines all run backups at 02:00, they can overload storage or networks. Jitter spreads the load across a window. AccuracySec is the scheduler’s precision. Larger values allow the kernel and systemd to coalesce timers, reducing wakeups and power usage. For backups, a precision of minutes is often fine, and it saves power on laptops.

Timers are associated with service units. When the timer triggers, systemd starts the associated service. Failures are logged in journald and can trigger OnFailure units. You can inspect timers with systemctl list-timers to see the last and next run. This observability is a huge advantage over cron, where the schedule is a file and the runtime history is opaque.

Finally, timers are still units. They can be enabled, disabled, and ordered. You can specify dependencies such as After=network.target to ensure that backups only run when the network is available. This makes timers part of the same dependency system you use for services.

Time zones and clock changes are another subtlety. systemd evaluates calendar expressions in the system’s local time zone unless you specify otherwise. During DST transitions, a scheduled time may occur twice or not at all. systemd handles these cases by choosing the next valid time, but you should be aware of it when auditing schedules. Use systemd-analyze calendar to preview upcoming trigger times and verify that your schedule behaves as expected. For deterministic testing, consider fixed UTC schedules or mock time in a test environment. These details matter when you need reliable, predictable backups across environments.

How this fit on projects

Timers are the control plane of this backup system. You will use them in Section 3.2, Section 3.7, and Section 5.10 Phase 2.

Definitions & key terms

OnCalendar -> Calendar-based schedule expression.
Persistent -> Run missed jobs on next boot.
RandomizedDelaySec -> Jitter added after scheduled time.
AccuracySec -> Scheduling precision.
Timer unit -> systemd unit that triggers a service.

Mental model diagram (ASCII)

schedule -> backup.timer -> backup.service -> backup.sh
               |
               +--> persistence store (last run time)

How it works (step-by-step)

systemd evaluates the timer schedule and computes the next trigger.
If the system was off, Persistent schedules a catch-up run.
RandomizedDelaySec adds jitter.
The timer starts the backup service.
The service logs to journald.

Invariants: A timer always triggers its linked service; last run time is stored.
Failure modes: Timer not enabled, clock misconfiguration, or missing dependency targets.

Minimal concrete example

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
RandomizedDelaySec=30m
AccuracySec=5m

Common misconceptions

“Timers are just cron with new syntax” -> They integrate with systemd and persistence.
“Persistent runs missed jobs forever” -> It runs at most once on boot.

Check-your-understanding questions

Why is RandomizedDelaySec valuable in fleets?
What happens if a machine is off during a scheduled run?
How does AccuracySec affect power usage?
Why is list-timers useful?

Check-your-understanding answers

It spreads load to avoid thundering herd.
With Persistent=true, the job runs after boot.
Lower precision allows wakeup coalescing and saves power.
It shows last and next run times for auditing.

Real-world applications

Backup scheduling on laptops and servers.
Log cleanup and update tasks with jitter.

Where you’ll apply it

This project: Section 3.2, Section 3.7, Section 5.10 Phase 2.
Also used in: P03-socket-activated-server.md as an activation parallel.

References

systemd.timer documentation.

Key insights

Timers provide persistence and observability that cron lacks.

Summary

Use OnCalendar, Persistent, and RandomizedDelaySec to build reliable schedules.

Homework/exercises to practice the concept

Create a timer that runs every 5 minutes.
Add jitter and observe next run time.
Disable and re-enable a timer and compare list-timers output.

Solutions to the homework/exercises

Set OnCalendar=*:0/5.
Add RandomizedDelaySec=2m and inspect list-timers.
systemctl disable --now, then enable --now.

Concept 2: Backup Integrity, Incrementals, and Retention

Fundamentals

Backups are only useful if they are restorable. A reliable system must include integrity checks, retention policies, and restore validation. Incremental backups capture changes since the last full backup and reduce storage usage. A common pattern is daily incrementals and weekly full backups. Checksums verify that data was not corrupted. Retention policies define how many backups to keep and prevent disks from filling up. This concept makes your backup system trustworthy rather than merely scheduled. It also forces you to think about naming, metadata, and reproducibility.

Deep Dive into the Concept

An incremental strategy reduces storage by reusing unchanged data. rsync --link-dest is a common approach: it creates a new snapshot directory where unchanged files are hard-linked to the previous snapshot. Each snapshot appears complete but consumes space only for changed files. This simplifies restores because you can copy a snapshot directory without applying patch chains. Another approach is tar incremental archives with a snapshot file that tracks changed files. This works but is harder to restore and manage.

Integrity checks should happen at two levels. First, verify that the backup command succeeded (exit code, size thresholds). Second, compute a checksum manifest (for example, SHA256 hashes of archives or snapshot metadata) and store it alongside the backup. On restore, verify checksums before extracting data. For critical systems, also perform periodic restore tests on sample files. This is often neglected but is essential for real reliability.

Retention policies must be explicit. A simple rule: keep 7 daily incrementals and 4 weekly full backups. Implement cleanup by deleting older snapshots beyond those limits. You should also ensure cleanup occurs after a successful backup, not before, to avoid losing all backups if a run fails.

Data consistency is another consideration. If you back up a live database by copying files, you may get a corrupted snapshot. In this project, you can focus on file-level backups and note that databases require special tools (e.g., pg_dump). But your script should be structured so that a database dump step could be inserted later.

Metadata is also part of integrity. Store a small manifest file with snapshot name, timestamp, source path, and backup tool version. This makes restores reproducible and helps with audits. If you compress archives, record the compression method and level so restores can be automated. If you encrypt backups, ensure key management is documented and test restores regularly. Finally, consider the trade-off between snapshot-style backups and archive-style backups: snapshots are easy to browse, while archives are easy to transfer. Your script can support both or choose one and document why.

Finally, space estimation matters. For reliability, estimate disk usage before running. If the destination filesystem is near full, the backup should fail gracefully and emit a clear error, rather than corrupting existing snapshots. This ties into your failure handling and alerting logic.

How this fit on projects

The backup script and retention policy are central to this project. You will implement them in Section 3.2 and Section 5.10 Phase 2.

Definitions & key terms

Incremental backup -> Captures changes since last full backup.
Snapshot -> Point-in-time view of files.
Retention policy -> Rules for how many backups to keep.
Checksum -> Hash used to verify integrity.
Restore test -> Validation that a backup can be used.

Mental model diagram (ASCII)

Weekly Full: F1 ----- F2 ----- F3
Daily Inc:    i i i i   i i i   i i
Restore: pick F2 + its incrementals

How it works (step-by-step)

Determine last full snapshot.
Create new snapshot via rsync with link-dest.
Generate checksum manifest.
Enforce retention by deleting old snapshots.

Invariants: Each snapshot is self-contained; manifests match snapshot contents.
Failure modes: disk full, corrupted snapshots, or retention deleting too aggressively.

Minimal concrete example

rsync -a --delete --link-dest=/backups/latest /data/ /backups/2026-01-01/
sha256sum /backups/2026-01-01.tar.gz > /backups/2026-01-01.sha256

Common misconceptions

“Incrementals are always small” -> They grow if many files change.
“Backup success means data is safe” -> Without checksum, you do not know.

Check-your-understanding questions

Why use hard links for incrementals?
How do you verify a backup without restoring everything?
What happens if retention is not enforced?
Why should cleanup happen after a successful backup?

Check-your-understanding answers

It saves space by reusing unchanged files.
Verify checksums and perform a sample restore.
Disk fills and backups fail.
It avoids deleting the last good backup if the new one fails.

Real-world applications

File server snapshots.
Developer laptop backup systems.

Where you’ll apply it

This project: Section 3.2, Section 5.10 Phase 2, Section 7.1.
Also used in: P05-systemd-controlled-development-environment-manager.md for persistence patterns.

References

rsync manual.
GNU tar incremental backup documentation.

Key insights

A backup is trustworthy only when integrity and retention are enforced.

Summary

Incrementals plus checksums and retention equal a real backup system.

Homework/exercises to practice the concept

Create a snapshot with rsync --link-dest.
Modify one file and run again; compare disk usage.
Verify a checksum manifest with sha256sum -c.

Solutions to the homework/exercises

Use a new snapshot directory and --link-dest.
Only changed files should increase disk usage.
sha256sum -c backup.sha256.

Concept 3: Logging, OnFailure, and Alerting

Fundamentals

A scheduled backup is useless if failures are silent. systemd integrates logging through journald and supports OnFailure units that run when a service fails. Your backup service should log structured messages about start time, duration, bytes copied, and errors. When the backup fails, systemd should trigger an alert service that sends a notification or writes an error summary. Clear exit codes allow operators to see why the backup failed. This concept is about observability and reliability, not just scheduling. It also reduces mean time to detection in real incidents.

Deep Dive into the Concept

When a systemd service exits with a non-zero status, systemd marks the unit as failed. If the unit has OnFailure=backup-alert.service, systemd starts the alert service immediately. This allows a clean separation: the backup script focuses on data handling, while the alert unit handles notification. The alert unit could send email, push to Slack, or write to a monitoring log. For a learning project, a simple alert that writes to /var/log/backup-alerts.log is enough, but the system should make it easy to extend.

Logging should be structured and consistent. Use systemd-cat or prefix messages with clear tags. Include key metrics: start time, end time, duration, bytes copied, and checksum result. This provides data for debugging slow backups or failures. For example, log a message like BACKUP_OK bytes=219430912 duration=32s rather than just “Backup complete”.

Exit codes should be meaningful. Define distinct exit codes for disk full, source missing, and checksum mismatch. This allows the alert service to display a human-friendly message. It also makes test cases deterministic: you can simulate a disk full error and expect exit code 5.

Permissions also matter. If your backup runs as root, logs will be accessible to root. If it runs as a user service, logs may be restricted. For a system-level backup, root is typical. Ensure that permissions are explicit and documented.

Finally, consider log rate limiting. If your backup script logs too frequently, journald may rate limit. Keep log volume low and meaningful. In your dashboard project (Project 1), you can verify that log messages appear correctly and that failures are visible.

Alerting should be layered. A local log file is good for debugging, but you also want an external signal for operational awareness. You can implement a simple backup-alert.service that sends mail or writes to a webhook. Provide the last failure reason in a small state file so the alert service can include it in the message. Distinguish between transient errors (network outage) and permanent errors (missing source). This allows different escalation paths. Over time, these details turn a simple backup script into an operationally reliable system.

How this fit on projects

Alerting and logging are required for real reliability. You will use them in Section 3.2, Section 3.7, and Section 5.10 Phase 3.

Definitions & key terms

OnFailure -> Unit that runs when another unit fails.
Structured log -> Log with key/value fields.
Exit code -> Numeric code indicating failure reason.
Alert unit -> A service that sends notifications.

Mental model diagram (ASCII)

backup.service -> fails -> OnFailure=backup-alert.service
        |
        +--> journald logs (success/failure)

How it works (step-by-step)

Backup script logs start metadata.
Script exits with success or error code.
systemd marks unit success/failure.
OnFailure unit triggers on failure.
Alert service sends notification.

Invariants: Non-zero exit means failed; OnFailure always triggers on failure.
Failure modes: silent errors if exit codes are wrong, or alert service misconfigured.

Minimal concrete example

[Unit]
OnFailure=backup-alert.service

Common misconceptions

“Logs are enough” -> Without alerts, failures are silent.
“OnFailure is only for system units” -> It works for user units too.

Check-your-understanding questions

How do you trigger an OnFailure unit?
Why use journald instead of a plain file?
What should exit codes communicate?
Why keep logs structured?

Check-your-understanding answers

Exit non-zero from the main service.
journald provides structured querying and rotation.
The failure category, not just “failed”.
Structured logs are easier to search and parse.

Real-world applications

Automated alerts for failed backups, deployments, or scheduled jobs.

Where you’ll apply it

This project: Section 3.7, Section 5.10 Phase 3, Section 7.
Also used in: P01-service-health-dashboard.md for log viewing.

References

systemd.service documentation (OnFailure).
systemd-cat manual.

Key insights

Backups are reliable only if failures are visible and actionable.

Summary

Use journald for structured logs and OnFailure for alerts.

Homework/exercises to practice the concept

Write a service that always fails and confirm OnFailure runs.
Use systemd-cat to tag log lines.
Filter logs by priority.

Solutions to the homework/exercises

ExecStart=/bin/false and check status.
echo test | systemd-cat -t backup.
journalctl -u backup.service -p err.

Concept 4: Backup Integrity, Atomicity, and Verification

Fundamentals

A backup that cannot be trusted is worse than no backup. Integrity means that a backup is complete, uncorrupted, and restorable. Atomicity means that each backup snapshot is internally consistent; you should not capture a half-written file or a partially updated database. Verification ensures that your backups can actually be restored. These concepts are not unique to systemd, but they are essential to building a reliable timer-driven backup system. systemd timers provide scheduling, but they do not guarantee that your data is consistent or recoverable. That responsibility is yours. If you understand integrity, atomicity, and verification, you will design a backup system that operators can trust.

Deep Dive into the Concept

Integrity begins with defining what data must be backed up and what constitutes a successful backup. For file-based backups, integrity means capturing all files in scope with correct contents and metadata (permissions, ownership, timestamps). For databases, integrity means capturing a consistent snapshot, often via a database-specific dump or snapshot mechanism. A naive tar of a live database directory can produce a corrupt backup. The rule is: use application-aware tools where consistency matters, and consider filesystem snapshots when possible.

Atomicity is about ensuring that a single backup run produces a coherent state. One technique is to stage the backup into a temporary directory and then atomically move it into place when complete. Another technique is to use filesystem snapshots (e.g., LVM, btrfs, ZFS) to capture a point-in-time view, then back up from the snapshot. Even without snapshots, you can reduce inconsistency by ordering operations (e.g., stop a service briefly, flush buffers, or use application-specific freeze commands). In a timer-based system, this should be encoded in the service unit, which can perform pre-backup hooks to quiesce services and post-backup hooks to resume them.

Verification is the most neglected part of backup systems. At a minimum, you should validate that the archive or sync completes without errors and that checksums match. For file archives, you can compute a manifest of SHA-256 hashes and store it alongside the backup. For rsync-based backups, you can run a follow-up verification pass with --checksum or compare manifest files. For database dumps, verification can include loading the dump into a test database or at least checking that the dump is non-empty and has the expected schema markers. In a timer-driven system, you can schedule a weekly verification job (another timer) that checks recent backups and alerts on failures.

Retention policies are part of integrity: you must ensure that backups are rotated correctly so you do not accidentally delete your last good backup. A robust system keeps at least one known-good full backup and a chain of incrementals. When deleting old backups, use deterministic rules (e.g., keep last 7 daily, last 4 weekly, last 12 monthly). This should be implemented carefully, with guardrails to prevent deleting everything due to a bug or clock issue. A good practice is to require that at least one backup remains before deletion proceeds.

Finally, backups must be observable. Logs should record backup size, duration, checksum verification results, and any skipped files. If a backup fails due to disk space, you need that to be explicit. For integrity, it is not enough to say “backup complete”; you need to say “backup complete, 2.1G, checksum verified”. This information becomes the basis for audit and incident response.

How this fit on projects

This concept informs your backup script design, log output, and validation strategy. You will apply it in Section 3.2 (Functional Requirements: integrity checks), Section 3.5 (Data formats: manifest schema), Section 5.4 (Concepts you must understand first), and Section 6.2 (Critical test cases). It also affects your failure demo in Section 3.7.3.

Definitions & key terms

Integrity -> Assurance that backup data is complete and uncorrupted.
Atomicity -> A backup snapshot represents a single consistent point in time.
Verification -> Steps that confirm a backup is restorable or valid.
Manifest -> A file listing checksums and metadata for backup contents.
Retention policy -> Rules for how long backups are kept and rotated.

Mental model diagram (ASCII)

Live data -> snapshot/stage -> archive -> verify -> rotate
     |          |               |         |
     +-- quiesce +-- atomic move +-- hash  +-- retention

How it works (step-by-step)

Quiesce or snapshot the data source.
Copy data to a staging directory.
Create archive or sync to destination.
Generate checksum manifest and verify.
Atomically move completed backup into place.
Apply retention policy and log results.

Invariants: A backup is only marked successful after verification.
Failure modes: Inconsistent snapshots, partial archives, or silent corruption.

Minimal concrete example

# Create a manifest of SHA-256 checksums
find /backups/latest -type f -print0 | xargs -0 sha256sum > /backups/latest/manifest.sha256
# Verify later
sha256sum -c /backups/latest/manifest.sha256

Common misconceptions

“If the backup command exits 0 it is valid” -> It might still be incomplete or corrupt.
“File copy is enough for databases” -> Many databases require consistent snapshot tools.
“Verification is too expensive” -> Periodic verification is cheaper than data loss.

Check-your-understanding questions

Why is a database directory copy often an invalid backup?
How does atomic move improve backup reliability?
What is the minimal verification you should perform after a backup?
How can retention policies accidentally delete all backups?

Check-your-understanding answers

The database files can change during copy, producing an inconsistent snapshot.
It ensures only fully completed backups are visible as “latest”.
Validate the archive and checksum manifest or equivalent.
If rules are based on a wrong clock or bug, they may remove everything.

Real-world applications

Production backup systems for databases and file servers.
Compliance-driven retention and audit trails.
Disaster recovery planning and restore drills.

Where you’ll apply it

This project: Section 3.2, Section 3.5, Section 5.4, Section 6.2.
Also used in: P01-service-health-dashboard.md for log and audit reporting.

References

“The Linux Command Line” archiving chapters.
rsync and tar documentation.
Database-specific backup guides (PostgreSQL pg_dump, MySQL mysqldump).

Key insights

A backup is only as good as its verification and restore path.

Summary

Integrity requires consistent snapshots, verification, and careful retention; timers only schedule the work.

Homework/exercises to practice the concept

Create a backup manifest and verify it after modifying a file to see failure.
Implement a retention script that keeps 7 daily and 4 weekly backups.
Simulate a partial backup and ensure your script marks it as failed.

Solutions to the homework/exercises

Change a file and rerun sha256sum -c; it should fail.
Use date-based folder names and delete older ones after counting.
Exit with non-zero status and emit a failure log line.

3. Project Specification

3.1 What You Will Build

A backup system that:

Runs daily incremental backups and weekly full backups.
Uses systemd timers with persistence and jitter.
Verifies integrity with checksums and restore tests.
Sends alerts on failures.

Included: backup script, timer/service units, retention logic, logging.
Excluded: full database snapshot support, multi-region replication.

3.2 Functional Requirements

Timer Units: daily incremental + weekly full.
Backup Script: run rsync or tar with manifest.
Integrity Check: generate SHA256 manifest and verify.
Retention: keep last 7 incrementals and 4 full backups.
OnFailure: trigger alert service.
Restore Test: validate a sample restore weekly.

3.3 Non-Functional Requirements

Reliability: missed runs execute after reboot.
Performance: run incrementals in under 10 minutes for 10GB.
Usability: clear logs and exit codes.

3.4 Example Usage / Output

$ systemctl list-timers backup.timer
NEXT                         LAST                         UNIT
Thu 2026-01-02 02:00:00 UTC  Wed 2026-01-01 02:00:00 UTC  backup.timer

$ journalctl -u backup.service -n 2
Jan 01 02:00:01 host backup[1234]: BACKUP_OK bytes=219430912 duration=32s

3.5 Data Formats / Schemas / Protocols

Backup layout:

/backups/
  full/2026-01-05/
  inc/2026-01-06/
  manifests/2026-01-06.sha256

3.6 Edge Cases

Disk full during backup.
Source directory missing.
Missed run due to laptop sleep.
Checksum mismatch on restore test.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

sudo cp backup.service backup.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now backup.timer

3.7.2 Golden Path Demo (Deterministic)

Use fixed directories: /data -> /backups.
Use BACKUP_FAKE_TIME=2026-01-01T02:00:00Z for logs.

3.7.3 If CLI: exact terminal transcript

$ BACKUP_FAKE_TIME=2026-01-01T02:00:00Z sudo systemctl start backup.service
Jan 01 02:00:00 host backup[1001]: BACKUP_OK bytes=219430912 duration=32s

Failure demo:

$ sudo systemctl start backup.service
Jan 01 02:00:01 host backup[1001]: BACKUP_FAILED reason=source_missing
exit code: 6

Exit codes:

0 success
5 disk full
6 source missing
7 checksum mismatch

4. Solution Architecture

4.1 High-Level Design

backup.timer -> backup.service -> backup.sh
                         |
                         +--> checksum + retention
                         +--> OnFailure=backup-alert.service

4.2 Key Components

Component	Responsibility	Key Decisions
Timer Units	Scheduling	OnCalendar + Persistent
Backup Script	Data copy + checksums	rsync vs tar
Alert Unit	Notify on failure	email vs log file

4.3 Data Structures (No Full Code)

sha256sum filename

4.4 Algorithm Overview

Key Algorithm: Incremental Backup

Determine last full snapshot.
Run rsync with link-dest into new snapshot dir.
Generate checksum manifest.
Rotate old snapshots.

Complexity Analysis:

Time: O(files changed)
Space: O(changed data)

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y rsync coreutils

5.2 Project Structure

backup/
├── backup.sh
├── backup.service
├── backup.timer
├── backup-alert.service
└── README.md

5.3 The Core Question You’re Answering

“How can I schedule reliable jobs that survive reboots and avoid stampedes?”

5.4 Concepts You Must Understand First

Timer semantics and persistence.
Incremental backup strategy.
Failure notification with OnFailure.

5.5 Questions to Guide Your Design

How will you verify that backups are restorable?
How will you avoid all machines backing up at the same time?
How do you handle disk-full errors?

5.6 Thinking Exercise

Design a weekly schedule for 100 machines such that backups spread across 2 hours.

5.7 The Interview Questions They’ll Ask

“Why are systemd timers better than cron for laptops?”
“What does Persistent=true do?”
“How do you verify backup integrity?”

5.8 Hints in Layers

Hint 1: Write the backup script first.
Hint 2: Wrap it in a .service unit.
Hint 3: Add a .timer with persistence.
Hint 4: Add OnFailure alerts.

5.9 Books That Will Help

Topic	Book	Chapter
Shell scripting	“The Linux Command Line”	archive chapters
System admin	“How Linux Works”	scheduling sections
Reliability	“Site Reliability Engineering”	monitoring and alerting

5.10 Implementation Phases

Phase 1: Foundation (2-3 hours)

Goals: backup script runs manually.
Checkpoint: backup.sh creates a snapshot.

Phase 2: Timers (2-3 hours)

Goals: schedule via systemd timers.
Checkpoint: systemctl list-timers shows next run.

Phase 3: Integrity and Alerts (2-4 hours)

Goals: checksums + OnFailure.
Checkpoint: forced failure triggers alert.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Backup tool	rsync vs tar	rsync + link-dest	incremental snapshots
Alerting	email vs log file	log + email	minimal and visible

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Missing source returns exit 6.
Disk full triggers failure and OnFailure.
RandomizedDelaySec changes next run time.

6.3 Test Data

source: /data
backup: /backups

7. Common Pitfalls and Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

systemctl list-timers for schedule inspection.
journalctl -u backup.service for logs.

7.3 Performance Traps

Running full backups daily wastes time and storage.

8. Extensions and Challenges

8.1 Beginner Extensions

Add config file for paths and retention.
Add dry-run mode.

8.2 Intermediate Extensions

Add remote sync (rsync over SSH).
Encrypt backups with GPG.

8.3 Advanced Extensions

Integrate with LVM snapshots.
Add restore verification on a schedule.

9. Real-World Connections

9.1 Industry Applications

Server backup automation for SMBs.
Laptop backup for remote teams.

borgbackup, restic (advanced backup systems).

9.3 Interview Relevance

Explain why timers are more reliable than cron.

10. Resources

10.1 Essential Reading

systemd timer docs.
rsync manual.

10.2 Video Resources

systemd scheduling talks.

10.3 Tools and Documentation

systemctl, journalctl, rsync.

P01-service-health-dashboard.md for log monitoring.
P03-socket-activated-server.md for activation parallels.

11. Self-Assessment Checklist

11.1 Understanding

I can explain Persistent=true.
I can describe incremental backups.
I can explain OnFailure.

11.2 Implementation

Timers run on schedule.
Checksums are generated and verified.
Failures trigger alerts.

11.3 Growth

I can restore a file from backup.

12. Submission / Completion Criteria

Minimum Viable Completion:

Timer triggers backups.
Logs show success/failure.

Full Completion:

Integrity checks and retention policies work.

Excellence (Going Above and Beyond):

Remote, encrypted backups with periodic restore tests.