Orchestrate Your Pipeline
Goal: Build a scheduled, observable pipeline that extracts data, transforms it with dbt, and publishes trusted outputs. You will learn orchestration concepts like DAGs, retries, idempotency, and backfills. By the end, you can reason about pipeline reliability and failure recovery.
Context and Problem
- Real-world scenario: A nightly pipeline must refresh analytics before business hours.
- Stakeholders and constraints: Analysts need fresh data; Ops needs reliable scheduling. Constraints include limited compute and dependency ordering.
- What happens if this system fails? Reports are stale, leading to missed business decisions.
Real World Outcome
- You see a DAG with tasks that succeed in order.
- Example DAG run log:
Task extract_data: SUCCESS (2024-06-02 00:01)
Task dbt_run: SUCCESS (2024-06-02 00:04)
Task dbt_test: SUCCESS (2024-06-02 00:06)
Task notify: SUCCESS (2024-06-02 00:06)
- You can trigger a backfill and verify results for a previous date.
Core Concepts
- DAG orchestration: Dependency graphs for tasks.
- Idempotency: Safe re-runs without duplicating data.
- Retries and alerting: Handling transient failures.
- Data quality dimensions: Accuracy, completeness, consistency, timeliness, traceability.
Architecture
+--------------------+ +-----------------+ +--------------------+
| Scheduler (Airflow)| ---> | Extract Script | ---> | dbt run + dbt test |
+--------------------+ +-----------------+ +--------------------+
| |
+-------------------- notify -------------------->
Data Model
- Uses warehouse models from Projects 3 and 5.
- Pipeline metadata table (optional): pipeline_runs(run_id, start_time, status, rows_loaded).
- Example record:
run_id=20240602_0000 start_time=2024-06-02T00:00:01Z status=SUCCESS rows_loaded=12417
Implementation Plan
- Choose an orchestrator (Airflow, Prefect, or Dagster).
- Create a DAG with tasks: extract, dbt run, dbt test, notify.
- Add retry rules and timeouts.
- Implement idempotent extract (date-partitioned files or tables).
- Add logging and alerts for failures.
- Schedule daily runs at 00:00.
Validation and Data Quality
- Accuracy: dbt tests pass after each run.
- Completeness: Extract task loads expected row counts.
- Consistency: Each run writes to the correct partition.
- Timeliness: Pipeline finishes before 06:00.
- Traceability: Store run metadata and logs.
Failure Modes and Debugging
- Duplicate loads: Retries re-ingest the same data.
- Symptom: row counts double. Fix by using partition overwrite or idempotent keys.
- Missing upstream data: Extract task runs before source is ready.
- Symptom: empty partitions. Fix by adding sensor tasks or data availability checks.
- Partial failure: dbt run succeeds but dbt test fails.
- Symptom: models built but tests fail. Fix by blocking publish until tests pass.
Definition of Done
- DAG runs successfully on schedule for three consecutive days.
- Backfill works for a prior date without duplicates.
- Logs and alerts show task status and runtime.
- Data quality checks run automatically after transformations.
- Failures are recoverable with documented steps.