Set Up a Data Warehouse and Model Data
Goal: Design and build a local analytical warehouse with a star schema that supports fast, consistent analytics. You will learn dimensional modeling, surrogate keys, and the trade-offs between staging and modeled layers. By the end, you can explain why your model supports accurate, scalable reporting.
Context and Problem
- Real-world scenario: The analytics team needs a durable warehouse to replace ad-hoc CSV analysis.
- Stakeholders and constraints: Analysts need fast queries; Finance needs consistent numbers; Engineering needs maintainable SQL. Constraints include limited compute and changing business rules.
- What happens if this system fails? Dashboards show inconsistent metrics and trust erodes.
Real World Outcome
- You load raw CSVs into a staging schema and build a star schema in your warehouse.
- Example CLI transcript (DuckDB):
$ duckdb analytics.db < sql/01_load_staging.sql
$ duckdb analytics.db < sql/02_build_dimensions.sql
$ duckdb analytics.db < sql/03_build_fact_sales.sql
- Example query:
SELECT d.month, SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
GROUP BY d.month
ORDER BY d.month;
- Expected metrics:
- fact_sales row count equals valid orders
- no null foreign keys in fact_sales
- sum(revenue) equals sum(price * quantity) from staging
Core Concepts
- Dimensional modeling: Star schema with fact and dimension tables.
- Surrogate keys: Stable integer keys for fast joins and history tracking.
- Staging vs modeled layers: Separation of raw data and curated data.
- Data quality dimensions: Accuracy, completeness, consistency, timeliness, traceability.
Architecture
+-------------------+ +-------------------+ +--------------------+
| raw CSVs | --> | staging tables | --> | star schema (marts) |
+-------------------+ +-------------------+ +--------------------+
| |
| +--> BI / queries
+--> data quality checks
Data Model
- Staging tables:
- stg_customers, stg_products, stg_orders
- Dimension tables:
- dim_customer(customer_key, customer_id, customer_name, email, signup_date)
- dim_product(product_key, product_id, product_name, category, price)
- dim_date(date_key, date, day, month, quarter, year)
- Fact table:
- fact_sales(sales_key, date_key, customer_key, product_key, quantity, revenue)
- Example fact row:
(sales_key=5012, date_key=20240512, customer_key=92, product_key=18, quantity=3, revenue=76.50)
Implementation Plan
- Choose warehouse (DuckDB or Postgres) and create a database.
- Load raw CSVs into staging tables with minimal transformation.
- Create dim tables, ensuring one row per natural key.
- Create a date dimension spanning your data range.
- Build fact_sales by joining staging orders to dimensions.
- Add indexes or clustering keys for common queries.
- Validate row counts and revenue totals.
Validation and Data Quality
- Accuracy:
SUM(fact_sales.revenue)equalsSUM(stg_orders.quantity * stg_products.price). - Completeness: No nulls in fact foreign keys.
- Consistency: Each fact row references valid dimension keys.
- Timeliness: Date dimension covers all order dates.
- Traceability: Store lineage mapping in comments or a lineage table.
Sample SQL checks:
SELECT COUNT(*) FROM fact_sales WHERE customer_key IS NULL OR product_key IS NULL;
SELECT SUM(revenue) FROM fact_sales;
Failure Modes and Debugging
- Double-counting: Join explosions due to non-unique dimension keys.
- Symptom: fact row count > orders row count. Fix by deduping dimension natural keys.
- Missing date keys: Date dimension not spanning full range.
- Symptom: null date_key. Fix by extending date dimension.
- Incorrect revenue: Price updated after order date.
- Symptom: revenue differs from source. Fix by capturing price-at-order in staging or using SCD.
Definition of Done
- Star schema exists with dimensions and fact table.
- Fact row count matches valid orders.
- All validation checks pass with zero null FKs.
- Query performance is acceptable for typical analytics.
- Model is documented with a brief README or SQL comments.