Set Up a Data Warehouse and Model Data

Goal: Design and build a local analytical warehouse with a star schema that supports fast, consistent analytics. You will learn dimensional modeling, surrogate keys, and the trade-offs between staging and modeled layers. By the end, you can explain why your model supports accurate, scalable reporting.

Context and Problem

Real-world scenario: The analytics team needs a durable warehouse to replace ad-hoc CSV analysis.
Stakeholders and constraints: Analysts need fast queries; Finance needs consistent numbers; Engineering needs maintainable SQL. Constraints include limited compute and changing business rules.
What happens if this system fails? Dashboards show inconsistent metrics and trust erodes.

Real World Outcome

You load raw CSVs into a staging schema and build a star schema in your warehouse.
Example CLI transcript (DuckDB):

$ duckdb analytics.db < sql/01_load_staging.sql
$ duckdb analytics.db < sql/02_build_dimensions.sql
$ duckdb analytics.db < sql/03_build_fact_sales.sql

Example query:

SELECT d.month, SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
GROUP BY d.month
ORDER BY d.month;

Expected metrics:
- fact_sales row count equals valid orders
- no null foreign keys in fact_sales
- sum(revenue) equals sum(price * quantity) from staging

Core Concepts

Dimensional modeling: Star schema with fact and dimension tables.
Surrogate keys: Stable integer keys for fast joins and history tracking.
Staging vs modeled layers: Separation of raw data and curated data.
Data quality dimensions: Accuracy, completeness, consistency, timeliness, traceability.

Architecture

+-------------------+     +-------------------+     +--------------------+
| raw CSVs          | --> | staging tables    | --> | star schema (marts) |
+-------------------+     +-------------------+     +--------------------+
                                     |                        |
                                     |                        +--> BI / queries
                                     +--> data quality checks

Data Model

Staging tables:
- stg_customers, stg_products, stg_orders
Dimension tables:
- dim_customer(customer_key, customer_id, customer_name, email, signup_date)
- dim_product(product_key, product_id, product_name, category, price)
- dim_date(date_key, date, day, month, quarter, year)
Fact table:
- fact_sales(sales_key, date_key, customer_key, product_key, quantity, revenue)
Example fact row:

(sales_key=5012, date_key=20240512, customer_key=92, product_key=18, quantity=3, revenue=76.50)

Implementation Plan

Choose warehouse (DuckDB or Postgres) and create a database.
Load raw CSVs into staging tables with minimal transformation.
Create dim tables, ensuring one row per natural key.
Create a date dimension spanning your data range.
Build fact_sales by joining staging orders to dimensions.
Add indexes or clustering keys for common queries.
Validate row counts and revenue totals.

Validation and Data Quality

Accuracy: SUM(fact_sales.revenue) equals SUM(stg_orders.quantity * stg_products.price).
Completeness: No nulls in fact foreign keys.
Consistency: Each fact row references valid dimension keys.
Timeliness: Date dimension covers all order dates.
Traceability: Store lineage mapping in comments or a lineage table.

Sample SQL checks:

SELECT COUNT(*) FROM fact_sales WHERE customer_key IS NULL OR product_key IS NULL;
SELECT SUM(revenue) FROM fact_sales;

Failure Modes and Debugging

Double-counting: Join explosions due to non-unique dimension keys.
- Symptom: fact row count > orders row count. Fix by deduping dimension natural keys.
Missing date keys: Date dimension not spanning full range.
- Symptom: null date_key. Fix by extending date dimension.
Incorrect revenue: Price updated after order date.
- Symptom: revenue differs from source. Fix by capturing price-at-order in staging or using SCD.

Definition of Done

Star schema exists with dimensions and fact table.
Fact row count matches valid orders.
All validation checks pass with zero null FKs.
Query performance is acceptable for typical analytics.
Model is documented with a brief README or SQL comments.