Build a Reverse ETL Sync

Goal: Push curated warehouse data back into operational systems (CRM) in a safe, incremental way. You will learn how to export, map, and synchronize records while preserving idempotency and auditability. By the end, you can prove that every synced record is correct and traceable.

Context and Problem

  • Real-world scenario: Marketing needs updated customer segments in a CRM to trigger campaigns.
  • Stakeholders and constraints: Growth teams need timely segments; Engineering needs safe writes. Constraints include API rate limits and schema mismatches.
  • What happens if this system fails? Campaigns target wrong users or miss high-value customers.

Real World Outcome

  • You run a sync job that updates CRM profiles with latest segments.
  • Example CLI transcript:
$ python reverse_etl.py --segment gold_customers --dry-run
Would update 412 customers
Sample payload: {"id":"cust_001","segment":"gold","lifetime_value":12842.50}

$ python reverse_etl.py --segment gold_customers
Updated=412 Failed=3
  • Example API request:
PATCH /api/customers/cust_001
{"segment":"gold","lifetime_value":12842.50}

Core Concepts

  • Reverse ETL: Operationalizing analytics data.
  • Idempotent updates: Safe replays and retries.
  • Schema mapping: Aligning warehouse fields to API payloads.
  • Data quality dimensions: Accuracy, completeness, consistency, timeliness, traceability.

Architecture

+------------------+     +-------------------+     +------------------+
| Data Warehouse   | --> | Export + Mapper   | --> | CRM API          |
+------------------+     +-------------------+     +------------------+
           |                         |
           +--> sync_state table     +--> retry + dead-letter

Data Model

  • Export query output:
    • customer_id, segment, lifetime_value, last_order_date
  • Sync state table:
    • customer_id, last_synced_at, status, error_message
  • Example record:
sync_state: customer_id=cust_001 last_synced_at=2024-06-01T00:12:05Z status=SUCCESS error_message=

Implementation Plan

  1. Define a warehouse query for the segment.
  2. Map fields to CRM payload schema.
  3. Implement dry-run mode to preview updates.
  4. Send updates in batches with retry/backoff.
  5. Record sync results in sync_state table.
  6. Support incremental syncs based on last_synced_at.

Validation and Data Quality

  • Accuracy: payload values match warehouse query results.
  • Completeness: all segment members are synced or logged as failed.
  • Consistency: one update per customer per run.
  • Timeliness: sync lag <= 24 hours.
  • Traceability: each update stored with run_id and response status.

Failure Modes and Debugging

  • API schema mismatch: CRM rejects payload.
    • Symptom: 400 errors. Fix by aligning field names and types.
  • Partial failures: Some records fail mid-run.
    • Symptom: inconsistent segment coverage. Fix by retrying only failed records.
  • Rate limits: API throttles updates.
    • Symptom: 429 responses. Fix by throttling and backoff.

Definition of Done

  • Dry-run mode shows accurate payloads.
  • Successful run updates all eligible customers.
  • Failures are logged and retryable.
  • Sync state table records each update.
  • Data quality checks pass for accuracy and completeness.