Set Up a Data Warehouse and Model Data

Goal: Design and build a local analytical warehouse with a star schema that supports fast, consistent analytics. You will learn dimensional modeling, surrogate keys, and the trade-offs between staging and modeled layers. By the end, you can explain why your model supports accurate, scalable reporting.

Context and Problem

  • Real-world scenario: The analytics team needs a durable warehouse to replace ad-hoc CSV analysis.
  • Stakeholders and constraints: Analysts need fast queries; Finance needs consistent numbers; Engineering needs maintainable SQL. Constraints include limited compute and changing business rules.
  • What happens if this system fails? Dashboards show inconsistent metrics and trust erodes.

Real World Outcome

  • You load raw CSVs into a staging schema and build a star schema in your warehouse.
  • Example CLI transcript (DuckDB):
$ duckdb analytics.db < sql/01_load_staging.sql
$ duckdb analytics.db < sql/02_build_dimensions.sql
$ duckdb analytics.db < sql/03_build_fact_sales.sql
  • Example query:
SELECT d.month, SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
GROUP BY d.month
ORDER BY d.month;
  • Expected metrics:
    • fact_sales row count equals valid orders
    • no null foreign keys in fact_sales
    • sum(revenue) equals sum(price * quantity) from staging

Core Concepts

  • Dimensional modeling: Star schema with fact and dimension tables.
  • Surrogate keys: Stable integer keys for fast joins and history tracking.
  • Staging vs modeled layers: Separation of raw data and curated data.
  • Data quality dimensions: Accuracy, completeness, consistency, timeliness, traceability.

Architecture

+-------------------+     +-------------------+     +--------------------+
| raw CSVs          | --> | staging tables    | --> | star schema (marts) |
+-------------------+     +-------------------+     +--------------------+
                                     |                        |
                                     |                        +--> BI / queries
                                     +--> data quality checks

Data Model

  • Staging tables:
    • stg_customers, stg_products, stg_orders
  • Dimension tables:
    • dim_customer(customer_key, customer_id, customer_name, email, signup_date)
    • dim_product(product_key, product_id, product_name, category, price)
    • dim_date(date_key, date, day, month, quarter, year)
  • Fact table:
    • fact_sales(sales_key, date_key, customer_key, product_key, quantity, revenue)
  • Example fact row:
(sales_key=5012, date_key=20240512, customer_key=92, product_key=18, quantity=3, revenue=76.50)

Implementation Plan

  1. Choose warehouse (DuckDB or Postgres) and create a database.
  2. Load raw CSVs into staging tables with minimal transformation.
  3. Create dim tables, ensuring one row per natural key.
  4. Create a date dimension spanning your data range.
  5. Build fact_sales by joining staging orders to dimensions.
  6. Add indexes or clustering keys for common queries.
  7. Validate row counts and revenue totals.

Validation and Data Quality

  • Accuracy: SUM(fact_sales.revenue) equals SUM(stg_orders.quantity * stg_products.price).
  • Completeness: No nulls in fact foreign keys.
  • Consistency: Each fact row references valid dimension keys.
  • Timeliness: Date dimension covers all order dates.
  • Traceability: Store lineage mapping in comments or a lineage table.

Sample SQL checks:

SELECT COUNT(*) FROM fact_sales WHERE customer_key IS NULL OR product_key IS NULL;
SELECT SUM(revenue) FROM fact_sales;

Failure Modes and Debugging

  • Double-counting: Join explosions due to non-unique dimension keys.
    • Symptom: fact row count > orders row count. Fix by deduping dimension natural keys.
  • Missing date keys: Date dimension not spanning full range.
    • Symptom: null date_key. Fix by extending date dimension.
  • Incorrect revenue: Price updated after order date.
    • Symptom: revenue differs from source. Fix by capturing price-at-order in staging or using SCD.

Definition of Done

  • Star schema exists with dimensions and fact table.
  • Fact row count matches valid orders.
  • All validation checks pass with zero null FKs.
  • Query performance is acceptable for typical analytics.
  • Model is documented with a brief README or SQL comments.