Build a Full Local Data Stack with Docker Compose
Goal: Assemble a full local data stack (warehouse, orchestrator, dbt, BI, API simulator) using Docker Compose. You will learn how services connect, how to manage configuration, and how to validate end-to-end pipelines. By the end, you can spin up a reproducible data platform on any machine.
Context and Problem
- Real-world scenario: Teams need a consistent local environment to develop and test pipelines.
- Stakeholders and constraints: Engineers need reproducibility; Analysts need a realistic sandbox. Constraints include local compute limits and network ports.
- What happens if this system fails? Onboarding is slow and pipeline testing becomes unreliable.
Real World Outcome
- You run
docker compose upand all services become healthy. - Example service list:
- Postgres or DuckDB container (warehouse)
- Airflow or Prefect (orchestrator)
- dbt container (transformations)
- Metabase or Superset (BI)
- Mock API service for ingestion
- Example CLI transcript:
$ docker compose up -d
[+] Running 6/6
$ docker compose ps
NAME STATUS
warehouse running
orchestrator running
bi running
mock-api running
Core Concepts
- Service orchestration: Linking containers via networks and volumes.
- Configuration management: Environment variables and secrets.
- Reproducibility: Deterministic builds with fixed versions.
- Data quality dimensions: Accuracy, completeness, consistency, timeliness, traceability.
Architecture
+-----------+ +-------------+ +-----------+
| Mock API | --> | Ingest Task | --> | Warehouse |
+-----------+ +-------------+ +-----------+
|
+--> dbt models
|
+--> BI tool
Data Model
- Warehouse schema from Projects 3 and 5.
- Orchestrator metadata database (if required).
- Mock API provides JSON records compatible with ingestion scripts.
- Example record:
fact_sales: sales_key=5012 date_key=20240512 customer_key=92 product_key=18 quantity=3 revenue=76.50
Implementation Plan
- Create a docker-compose.yml with pinned image versions.
- Define services, networks, and volumes for persistence.
- Add environment variables for credentials and ports.
- Provide a
make upor./scripts/start.shhelper. - Seed the warehouse with sample data.
- Run the pipeline and validate dashboards.
Validation and Data Quality
- Accuracy: dbt tests pass after pipeline run.
- Completeness: ingestion loads expected row counts.
- Consistency: services use the same schema versions.
- Timeliness: end-to-end pipeline completes within a set SLA.
- Traceability: logs available for each container and run.
Failure Modes and Debugging
- Port conflicts: Services fail to start.
- Symptom: “address already in use”. Fix by changing ports or stopping conflicting services.
- Volume permission issues: Data not persisted.
- Symptom: missing database files. Fix by adjusting volume permissions.
- Service dependency timing: Orchestrator starts before warehouse is ready.
- Symptom: connection errors. Fix by adding health checks and depends_on.
Definition of Done
docker compose upbrings all services to healthy state.- Pipeline runs end-to-end with no manual intervention.
- BI dashboard shows expected metrics.
- All services are documented with ports and credentials.
- Environment is reproducible on a fresh machine.