Implementing Slowly Changing Dimensions (SCD Type 2)

Goal: Track historical changes to dimension data by implementing SCD Type 2 logic. You will learn how to preserve history, define validity windows, and prevent overlapping records. By the end, you can explain and verify how changes are captured over time.

Context and Problem

  • Real-world scenario: Customer attributes (status, tier, region) change, but analytics must preserve historical truth.
  • Stakeholders and constraints: Finance needs accurate historical reporting; Sales needs current segmentation. Constraints include late-arriving data and backfills.
  • What happens if this system fails? Historical reports become incorrect and audits fail.

Real World Outcome

  • A dimension table contains multiple rows per customer, each with validity windows.
  • Example record set:
customer_key,customer_id,tier,valid_from,valid_to,is_current
101,C00091,Silver,2023-02-14,2024-04-30,false
214,C00091,Gold,2024-05-01,9999-12-31,true
  • Example query:
SELECT tier FROM dim_customer_scd2
WHERE customer_id='C00091' AND date('2024-03-15') BETWEEN valid_from AND valid_to;
-- Silver

Core Concepts

  • SCD Type 2: Preserve history with new rows for changes.
  • Validity windows: valid_from, valid_to, is_current.
  • Surrogate keys: unique row identifiers for history.
  • Data quality dimensions: Accuracy, completeness, consistency, timeliness, traceability.

Architecture

+-------------------+     +-------------------+     +------------------------+
| stg_customers     | --> | change detection  | --> | dim_customer_scd2      |
+-------------------+     +-------------------+     +------------------------+
           |                         |
           +-- snapshot/history      +-- update valid_to

Data Model

  • dim_customer_scd2:
    • customer_key (surrogate)
    • customer_id (natural key)
    • tier
    • region
    • valid_from (date)
    • valid_to (date)
    • is_current (boolean)

Implementation Plan

  1. Identify change-tracked fields (tier, region, status).
  2. Compare incoming staging data to current records.
  3. If changes detected, close current record (set valid_to) and insert new row.
  4. If no changes, keep current row unchanged.
  5. Support backfills for late-arriving changes.

Validation and Data Quality

  • Accuracy: Changes recorded only when attributes differ.
  • Completeness: Every customer has at least one row.
  • Consistency: No overlapping validity windows per customer.
  • Timeliness: valid_from reflects change effective date.
  • Traceability: store source_run_id or batch_id.

Example checks:

SELECT customer_id FROM dim_customer_scd2
GROUP BY customer_id
HAVING SUM(CASE WHEN is_current THEN 1 ELSE 0 END) != 1;

Failure Modes and Debugging

  • Overlapping windows: Multiple current rows.
    • Symptom: duplicate current records. Fix by enforcing a unique is_current constraint.
  • Missed changes: Comparisons ignore a field.
    • Symptom: historical tiers never change. Fix by including all tracked fields in comparison.
  • Late-arriving data: Change arrives after a newer change.
    • Symptom: incorrect valid_to ordering. Fix by using effective dates and backfill logic.

Definition of Done

  • SCD table captures history for all tracked fields.
  • Exactly one current row per customer.
  • No overlapping validity windows.
  • Backfill scenario tested and documented.
  • Queries return correct historical values.