Build a Full Local Data Stack with Docker Compose

Goal: Assemble a full local data stack (warehouse, orchestrator, dbt, BI, API simulator) using Docker Compose. You will learn how services connect, how to manage configuration, and how to validate end-to-end pipelines. By the end, you can spin up a reproducible data platform on any machine.

Context and Problem

  • Real-world scenario: Teams need a consistent local environment to develop and test pipelines.
  • Stakeholders and constraints: Engineers need reproducibility; Analysts need a realistic sandbox. Constraints include local compute limits and network ports.
  • What happens if this system fails? Onboarding is slow and pipeline testing becomes unreliable.

Real World Outcome

  • You run docker compose up and all services become healthy.
  • Example service list:
    • Postgres or DuckDB container (warehouse)
    • Airflow or Prefect (orchestrator)
    • dbt container (transformations)
    • Metabase or Superset (BI)
    • Mock API service for ingestion
  • Example CLI transcript:
$ docker compose up -d
[+] Running 6/6
$ docker compose ps
NAME                 STATUS
warehouse             running
orchestrator          running
bi                    running
mock-api              running

Core Concepts

  • Service orchestration: Linking containers via networks and volumes.
  • Configuration management: Environment variables and secrets.
  • Reproducibility: Deterministic builds with fixed versions.
  • Data quality dimensions: Accuracy, completeness, consistency, timeliness, traceability.

Architecture

+-----------+     +-------------+     +-----------+
| Mock API  | --> | Ingest Task | --> | Warehouse |
+-----------+     +-------------+     +-----------+
                                        |
                                        +--> dbt models
                                        |
                                        +--> BI tool

Data Model

  • Warehouse schema from Projects 3 and 5.
  • Orchestrator metadata database (if required).
  • Mock API provides JSON records compatible with ingestion scripts.
  • Example record:
fact_sales: sales_key=5012 date_key=20240512 customer_key=92 product_key=18 quantity=3 revenue=76.50

Implementation Plan

  1. Create a docker-compose.yml with pinned image versions.
  2. Define services, networks, and volumes for persistence.
  3. Add environment variables for credentials and ports.
  4. Provide a make up or ./scripts/start.sh helper.
  5. Seed the warehouse with sample data.
  6. Run the pipeline and validate dashboards.

Validation and Data Quality

  • Accuracy: dbt tests pass after pipeline run.
  • Completeness: ingestion loads expected row counts.
  • Consistency: services use the same schema versions.
  • Timeliness: end-to-end pipeline completes within a set SLA.
  • Traceability: logs available for each container and run.

Failure Modes and Debugging

  • Port conflicts: Services fail to start.
    • Symptom: “address already in use”. Fix by changing ports or stopping conflicting services.
  • Volume permission issues: Data not persisted.
    • Symptom: missing database files. Fix by adjusting volume permissions.
  • Service dependency timing: Orchestrator starts before warehouse is ready.
    • Symptom: connection errors. Fix by adding health checks and depends_on.

Definition of Done

  • docker compose up brings all services to healthy state.
  • Pipeline runs end-to-end with no manual intervention.
  • BI dashboard shows expected metrics.
  • All services are documented with ports and credentials.
  • Environment is reproducible on a fresh machine.