Project 17: Field-Ready Environmental Sentinel (Capstone)

Build a fully integrated field device that senses, captures images, publishes telemetry, serves a local dashboard, and self-heals under power constraints.

Quick Reference

Attribute Value
Difficulty Expert
Time Estimate 4–6 weeks (part-time)
Main Programming Language Python (Alternatives: Go, Rust, C)
Alternative Programming Languages Go, Rust, C
Coolness Level Very High
Business Potential High
Prerequisites Projects 1–16, Linux ops, networking
Key Topics System integration, resource budgeting, reliability, telemetry pipelines

1. Learning Objectives

By completing this project, you will:

  1. Integrate sensors, camera, MQTT, dashboard, and health monitoring into one device.
  2. Manage power, storage, and network reliability under real constraints.
  3. Implement a single health report that reflects system state.
  4. Validate device stability over multi-day runs.

2. All Theory Needed (Per-Concept Breakdown)

Concept 1: System Integration Under Resource Constraints

Fundamentals

A field device is a system of subsystems: sensors, storage, network, and user interfaces. Integration is the process of making these subsystems work together without interfering with each other. On a Pi Zero 2 W, resources are limited: CPU, memory, storage bandwidth, and power. A successful integration balances these resources and defines clear data flows, schedules, and failure boundaries. This capstone is less about any single subsystem and more about coordinating them safely.

Deep Dive into the concept

Subsystem integration starts with data flow. A sensor loop produces data, a camera pipeline produces images, and a telemetry pipeline publishes data to MQTT and a local dashboard. Each pipeline has its own timing requirements. If you run them all without coordination, they will compete for CPU and I/O and cause missed deadlines or dropped data. The correct approach is to design a schedule: for example, sensor reads every 10 seconds, camera captures every 60 seconds, MQTT publishes every 30 seconds, and health checks every 15 seconds. Each task should be non-blocking and use its own buffer.

Resource budgeting is the second pillar. The Pi Zero 2 W has limited CPU and a microSD card that can be easily saturated by writes. You must set explicit limits: maximum log size, image retention, MQTT queue size, and CPU usage. These limits should be enforced by code. For example, if the MQTT queue exceeds 1000 messages, drop the oldest. If disk usage exceeds 80%, delete old images. These policies prevent catastrophic failure.

Integration also requires unified configuration. Each subsystem should read from a shared configuration file or environment variables to avoid conflicting settings. For example, the sensor loop and dashboard should agree on sampling interval; the timelapse pipeline and retention policy should agree on storage limits. A single configuration file makes deployments reproducible and reduces human error.

Failure boundaries are crucial. If the camera fails, the rest of the system should continue. If Wi-Fi drops, the device should buffer telemetry and continue local logging. This requires explicit isolation: each subsystem runs independently and reports health to the central monitor. The health monitor then decides whether to restart a subsystem or keep running. This design prevents a single failure from cascading.

How this fit on projects

This concept ties together all prior projects and appears in §3 (requirements) and §5.10 (implementation phases).

Definitions & key terms

  • Subsystem: Independent functional component (sensor, camera, network).
  • Resource budget: Explicit limit on CPU, memory, storage, or network use.
  • Failure boundary: A place where failure is contained and isolated.

Mental model diagram (ASCII)

Sensors -> Log -> MQTT
Camera  -> Storage -> Dashboard
Health Monitor -> Recovery Actions

How it works (step-by-step, with invariants and failure modes)

  1. Define schedules for each subsystem.
  2. Enforce resource limits for storage and queues.
  3. Isolate failures with independent processes.
  4. Report health and recover when needed.

Failure modes:

  • Unbounded logging -> disk full.
  • Blocking camera -> sensor delays.
  • Network outages -> data loss without buffering.

Minimal concrete example

if disk_usage > 0.8:
    delete_oldest_images()

Common misconceptions

  • “Subsystems can share resources without coordination.” They will compete.
  • “One big loop is simpler.” It couples failures and timing.

Check-your-understanding questions

  1. Why do you need explicit resource budgets?
  2. What happens if the camera pipeline blocks the main loop?
  3. Why isolate subsystems into independent processes?

Check-your-understanding answers

  1. Without budgets, resource use grows until failure.
  2. Sensor data and MQTT publishing will lag or fail.
  3. To prevent a single failure from crashing the entire system.

Real-world applications

  • Remote environmental monitoring, agriculture sensors, and field cameras.

Where you’ll apply it

References

  • “Making Embedded Systems” — system integration
  • SRE practices for reliability

Key insights

Integration succeeds when resource limits and failure boundaries are explicit.

Summary

A field device is a coordinated set of subsystems; schedules and budgets keep it stable.

Homework/Exercises to practice the concept

  1. Create a schedule table for all subsystems.
  2. Define storage and queue limits.
  3. Simulate a camera failure and ensure other subsystems continue.

Solutions to the homework/exercises

  1. Assign intervals and durations for each task.
  2. Enforce limits in code (max files, max queue).
  3. Kill camera process and verify sensor loop continues.

Concept 2: Reliability, Observability, and Field Operations

Fundamentals

Field devices must run unattended, so they need observability (logs, metrics, health reports) and reliable recovery paths. Reliability means that the device can handle failures and continue operating. Observability means you can understand what happened from logs and telemetry. Field operations include safe updates, rollback, and clear “ready” signals. This concept ensures your device is not just functional, but operationally trustworthy.

Deep Dive into the concept

Reliability is about managing known failure modes: power loss, network loss, storage exhaustion, and software crashes. Each failure should have a defined recovery. For power loss, ensure the filesystem is robust (Project 12) and logs are flushed. For network loss, buffer MQTT data and retry. For storage exhaustion, enforce retention policies. For software crashes, use systemd to restart services with backoff (Project 16).

Observability is achieved through structured logs and health endpoints. A structured log is machine-readable (e.g., JSON or CSV) and includes timestamps, severity, and subsystem identifiers. For this project, you should create a /health endpoint on the local dashboard that reports overall status and recent errors. This endpoint should be deterministic for tests (support fixed timestamps). The health report should include: uptime, battery estimate, disk usage, sensor status, camera status, MQTT connection state, and recovery count.

Field operations include update strategy. A simple approach is to deploy updates via SSH and restart services. A more advanced approach uses A/B partitions or staged updates. For this project, document a safe update checklist: stop services, deploy files, restart services, verify health. Rollback should be possible by keeping the previous version. You should also include a “ready” signal: a file or HTTP endpoint that indicates the device is operational. This allows external systems to validate health after boot or update.

Testing for reliability requires long runs. A 72-hour run with simulated failures (power loss, network loss) demonstrates robustness. During the run, logs should show recoveries and no unbounded growth in storage usage. This is the key output of the capstone: a field-ready device that can survive in the real world.

How this fit on projects

This concept integrates Projects 12, 13, and 16 and is reflected in §3.7 and §6.

Definitions & key terms

  • Observability: Ability to infer system state from outputs.
  • Health endpoint: API that reports device status.
  • Rollback: Reverting to a previous version after failure.

Mental model diagram (ASCII)

Logs + Metrics -> Health Report -> Recovery -> Stable Operation

How it works (step-by-step, with invariants and failure modes)

  1. Collect metrics from subsystems.
  2. Publish health report locally and via MQTT.
  3. Trigger recovery on failures with backoff.
  4. Verify readiness after recovery.

Failure modes:

  • No health visibility -> silent failure.
  • No rollback -> bad updates brick device.
  • No retention -> storage exhaustion.

Minimal concrete example

health = {"mqtt":"ok","camera":"ok","disk_pct":72,"recoveries":1}

Common misconceptions

  • “If it works once, it’s ready.” Field devices need endurance tests.
  • “Logs are enough.” Without structured health, logs are too noisy.

Check-your-understanding questions

  1. Why is a health endpoint better than logs alone?
  2. What is a safe update sequence for a field device?
  3. Why should you simulate failures before deployment?

Check-your-understanding answers

  1. Health endpoints provide a summary without manual log parsing.
  2. Stop services, deploy, restart, verify health, rollback if needed.
  3. Real failures reveal weaknesses that tests miss.

Real-world applications

  • Remote monitoring stations, edge AI devices, critical sensors.

Where you’ll apply it

References

  • “Site Reliability Engineering” — health and monitoring

Key insights

Reliability is proven by recovery behavior, not by feature lists.

Summary

A field device must expose health, recover safely, and support safe updates.

Homework/Exercises to practice the concept

  1. Design a health JSON schema.
  2. Simulate network loss and verify buffering.
  3. Define a rollback checklist.

Solutions to the homework/exercises

  1. Include metrics and status for each subsystem.
  2. Disconnect Wi-Fi and check queued MQTT messages.
  3. Document steps and verify they restore previous version.

3. Project Specification

3.1 What You Will Build

A fully integrated environmental sentinel with sensors, camera, MQTT telemetry, local dashboard, and self-healing.

3.2 Functional Requirements

  1. Collect sensor data and log it locally.
  2. Capture images on schedule.
  3. Publish telemetry to MQTT.
  4. Serve a local dashboard with health status.
  5. Monitor and recover from failures.

3.3 Non-Functional Requirements

  • Performance: Sensor loop must meet interval ±5%.
  • Reliability: 72-hour run without unrecovered failure.
  • Usability: Health report accessible locally.

3.4 Example Usage / Output

$ ./sentinel_status
Device: sentinel-01  Uptime: 3d 4h
Battery: 62%  Runtime estimate: 18h
Sensors: Temp 21.9 C  Humidity 46%
Camera: last image 10:20:00
MQTT: connected  RSSI: -61 dBm
Health: OK  Recoveries: 1

3.5 Data Formats / Schemas / Protocols

Unified health JSON:

{"device":"sentinel-01","uptime":"3d4h","battery":62,"mqtt":"ok","camera":"ok","errors":1}

3.6 Edge Cases

  • Wi-Fi offline for 12 hours.
  • Camera failure mid-run.
  • SD card near full.

3.7 Real World Outcome

A field-ready device runs unattended, recovers from failures, and reports health consistently.

3.7.1 How to Run (Copy/Paste)

python3 sentinel.py --config config.yaml

3.7.2 Golden Path Demo (Deterministic)

export FIXED_TIME="2026-01-01T15:00:00Z"
python3 sentinel.py --simulate --healthy

Expected output:

[2026-01-01T15:00:00Z] Health OK, recoveries=0

3.7.3 Failure Demo (Deterministic)

python3 sentinel.py --simulate --camera-fail

Expected output:

[ERROR] Camera failure detected, restarting camera service

Exit code: 171

3.7.4 CLI Exit Codes

  • 0: Success
  • 170: Config invalid
  • 171: Critical subsystem failure

4. Solution Architecture

4.1 High-Level Design

Sensors -> Logger -> MQTT
Camera  -> Storage -> Dashboard
Health Monitor -> Recovery -> Report

4.2 Key Components

| Component | Responsibility | Key Decisions | |—|—|—| | Sensor Service | Periodic readings | Interval and caching | | Camera Service | Timelapse capture | Resolution + retention | | MQTT Service | Publish telemetry | QoS + buffering | | Dashboard | Local UI | Polling interval | | Health Monitor | Recovery actions | Backoff policy |

4.3 Data Structures (No Full Code)

config = {"intervals": {"sensor":10, "camera":60}, "retain_days":7}

4.4 Algorithm Overview

Key Algorithm: Coordinated Scheduler

  1. Load config and start services.
  2. Run each subsystem on its own schedule.
  3. Aggregate health and expose status.

Complexity Analysis:

  • Time: O(n) subsystems per cycle
  • Space: O(1) plus buffers

5. Implementation Guide

5.1 Development Environment Setup

pip install paho-mqtt flask

5.2 Project Structure

project-root/
├── sentinel.py
├── services/
│   ├── sensor.py
│   ├── camera.py
│   ├── mqtt.py
│   └── health.py
└── config.yaml

5.3 The Core Question You’re Answering

“Can you design a device that survives real-world conditions without babysitting?”

5.4 Concepts You Must Understand First

  1. System integration and scheduling.
  2. Power and storage budgeting.
  3. Reliability and recovery.

5.5 Questions to Guide Your Design

  1. What are your critical health metrics?
  2. How will you update safely in the field?

5.6 Thinking Exercise

Draw the data pipeline and mark where data can be lost.

5.7 The Interview Questions They’ll Ask

  1. What are top failure modes for field devices?
  2. How do you design safe update and rollback?
  3. How do you balance power, storage, network constraints?

5.8 Hints in Layers

Hint 1: Integrate subsystems one at a time.

Hint 2: Add a unified health endpoint.

Hint 3: Simulate failures before field deployment.

5.9 Books That Will Help

| Topic | Book | Chapter | |—|—|—| | System integration | Making Embedded Systems | Ch. 10 | | Reliability | Site Reliability Engineering | Ch. 5 |

5.10 Implementation Phases

Phase 1: Subsystem integration (1 week)

  • Integrate sensors and logging.

Phase 2: Telemetry + dashboard (1 week)

  • Add MQTT and local UI.

Phase 3: Reliability (1 week)

  • Add health monitoring and recovery.

Phase 4: Field test (1 week)

  • Run 72-hour test with simulated failures.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |—|—|—|—| | Scheduling | Single loop / multi-service | Multi-service | Isolation | | Buffering | Memory / Disk | Disk for telemetry | Survive reboots |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |—|—|—| | Unit Tests | Service logic | Sensor read errors | | Integration Tests | Full pipeline | MQTT + dashboard | | Endurance Tests | 72-hour run | Failure simulations |

6.2 Critical Test Cases

  1. Wi-Fi loss -> MQTT buffering.
  2. Camera failure -> recovery without crash.
  3. Storage retention prevents disk overflow.

6.3 Test Data

72-hour run with 60s camera interval

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |—|—|—| | No retention | Disk full | Enforce limits | | Blocking tasks | Missed schedules | Separate services | | No health report | Hard to debug | Add health endpoint |

7.2 Debugging Strategies

  • Use structured logs per subsystem.
  • Simulate failures one at a time.

7.3 Performance Traps

  • Too frequent camera captures saturate storage.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add GPS module for location tagging.

8.2 Intermediate Extensions

  • Add cellular modem for remote connectivity.

8.3 Advanced Extensions

  • Implement A/B updates with rollback.

9. Real-World Connections

9.1 Industry Applications

  • Environmental monitoring stations and wildlife cameras.
  • balenaOS fleet management.

9.3 Interview Relevance

  • System integration and reliability are key interview themes.

10. Resources

10.1 Essential Reading

  • Raspberry Pi OS docs on services and networking.

10.2 Video Resources

  • Field device architecture talks.

10.3 Tools & Documentation

  • systemd docs, MQTT broker docs.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain the integration strategy for all subsystems.
  • I can explain health and recovery logic.

11.2 Implementation

  • Device runs 72 hours without unrecovered failure.
  • Health report is consistent with telemetry.

11.3 Growth

  • I can present the project as a full system in interviews.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Sensors + MQTT + local dashboard working.

Full Completion:

  • 72-hour run with recovery logs.

Excellence (Going Above & Beyond):

  • A/B update system and remote rollout strategy.