Automate Transformations with dbt
Goal: Build a versioned, testable transformation layer using dbt. You will learn how to express transformations as models, enforce data quality with tests, and document lineage. By the end, you can run
dbt runanddbt testwith confidence in repeatable results.
Context and Problem
- Real-world scenario: Manual SQL transformations are inconsistent and hard to maintain.
- Stakeholders and constraints: Analysts want stable models; Engineers want tests and version control. Constraints include limited compute and evolving business logic.
- What happens if this system fails? Data models drift and dashboards break.
Real World Outcome
- You run dbt and produce the warehouse models with tests.
- Example CLI transcript:
$ dbt run
Finished running 6 models in 12.3s
$ dbt test
PASS not_null_dim_customer_customer_key (124 rows)
PASS unique_dim_customer_customer_id (124 rows)
PASS relationships_fact_sales_customer_key (124 rows)
- Example generated model:
models/marts/fact_sales.sql
Core Concepts
- dbt models: SQL transformations as versioned files.
- Tests: Schema and data quality checks.
- Lineage graph: Dependency tracking between models.
- Data quality dimensions: Accuracy, completeness, consistency, timeliness, traceability.
Architecture
+--------------------+ +------------------+ +-------------------+
| Staging tables | --> | dbt models | --> | Marts (fact/dims) |
+--------------------+ +------------------+ +-------------------+
| |
| +--> tests + docs
+--> sources
Data Model
- Source tables: stg_customers, stg_products, stg_orders
- Models:
stg_*models normalize raw datadim_customer,dim_product,dim_datefact_sales
- Example record:
dim_customer: customer_key=92 customer_id=C00091 customer_name="Aline Ortiz" signup_date=2023-02-14
Implementation Plan
- Initialize a dbt project and connect to your warehouse.
- Define sources for staging tables.
- Create staging models to clean and standardize fields.
- Build dimension and fact models using
ref(). - Add schema tests:
not_null,unique,relationships. - Add a custom test for revenue accuracy.
- Generate docs with
dbt docs generate.
Validation and Data Quality
- Accuracy: Custom test compares fact revenue to staging calculation.
- Completeness:
not_nulltests on required fields. - Consistency:
relationshipstests on foreign keys. - Timeliness: Model run time and freshness checks (if supported).
- Traceability: dbt lineage graph and model docs.
Failure Modes and Debugging
- Model drift: A renamed column breaks downstream models.
- Symptom: dbt compile errors. Fix by updating references and adding tests.
- Duplicate rows: Incorrect join in fact model.
- Symptom: revenue spikes. Fix by checking join keys and deduping.
- Test failures: Not null or relationship tests fail.
- Symptom: dbt test fails. Fix by tracing to staging model and source data.
Definition of Done
- dbt project builds all models with
dbt run. dbt testpasses for all schema and custom tests.- Lineage graph is generated and reviewed.
- Model SQL is committed to version control.
- Documentation explains each model and metric.