Learn Change Data Capture (CDC): From Zero to WAL Master

Goal: Deeply understand the mechanics of Change Data Capture (CDC) by building systems that tail database transaction logs (Postgres WAL). You will move from simple polling to mastering logical decoding, exactly-once delivery, and streaming changes into Kafka. By the end, you’ll understand why “the log is the database” and how to build production-grade real-time data pipelines.

Why CDC Matters

In 1982, Write-Ahead Logging (WAL) was introduced to solve database durability. For decades, these logs were internal secrets of the database engine used only for crash recovery. Change Data Capture (CDC) liberated these logs, turning the database’s internal history into a first-class event stream.

Modern architectures face the “Dual Write” problem: how do you update a database AND a search index (Elasticsearch) AND a cache (Redis) without them getting out of sync? CDC is the answer. It treats the database as the source of truth and asynchronously propagates every single bit of history to the rest of the world.

The Evolution of Capturing Changes

1. The “Stone Age”: Dual Writes

The application tries to write to two places at once. The Struggle: If the DB write succeeds but the Kafka write fails, you are inconsistent forever.

┌─────────────┐      1. SQL Update      ┌───────────┐
│ Application │────────────────────────▶│ Database  │
└──────┬──────┘                         └───────────┘
       │             2. Send Event      ┌───────────┐
       └────────────────────────────────▶│   Kafka   │ (What if this fails?)
                                        └───────────┘

2. The “Bronze Age”: Polling

The system periodically runs SELECT * FROM table WHERE updated_at > last_seen. The Struggle: You can’t see DELETES (they are gone), and you put a heavy load on the DB.

┌───────────┐      SELECT WHERE...      ┌───────────┐
│  Poller   │──────────────────────────▶│ Database  │
└───────────┘                           └───────────┘

3. The “Golden Age”: Log-Based CDC

We tail the Write-Ahead Log (WAL). The Outcome: Low overhead, zero missed changes (including deletes), and transactional integrity.

┌───────────┐      Read binary log      ┌───────────┐
│ CDC Agent │◀──────────────────────────│ WAL Files │
└─────┬─────┘                           └───────────┘
      │            Stream Events        ┌───────────┐
      └────────────────────────────────▶│   Kafka   │
                                        └───────────┘

Core Concept Analysis

1. The Write-Ahead Log (WAL)

Before Postgres changes a single byte in a data file, it writes the change to the WAL. It’s an append-only sequence of binary records. If the power goes out, Postgres replays this log to recover.

WAL SEGMENT FILE
[LSN 0/1A2B] -> BEGIN Transaction 101
[LSN 0/1A2C] -> INSERT INTO users (id, name) VALUES (1, 'Douglas')
[LSN 0/1A2D] -> COMMIT Transaction 101
[LSN 0/1A2E] -> BEGIN Transaction 102
...

2. Logical Decoding vs. Physical Replication

Physical Replication: Copies raw disk blocks. Great for clones, useless for humans.
Logical Decoding: Translates binary WAL records into “Logical” changes (e.g., “User 1 was inserted”). This is what makes CDC possible.

3. Replication Slots (The Bookmark)

A replication slot is a pointer managed by Postgres. It ensures that Postgres never deletes WAL files until the CDC consumer has confirmed it read them. DANGER: If your CDC consumer stops running, the replication slot will hold onto WAL files forever, eventually filling up your disk and crashing the database!

WAL STREAM: [Record 1][Record 2][Record 3][Record 4][Record 5]
                                   ^
                                   │
                          REPLICATION SLOT (LSN 3)
                          (Postgres keeps Records 4-5)

4. Output Plugins

Postgres provides the “Decoding Engine,” but you need a “Plugin” to format the data:

test_decoding: Human-readable text (good for debugging).
wal2json: Converts changes to JSON.
pgoutput: Postgres’s native binary format (most efficient, used by Debezium).

Concept Summary Table

Concept Cluster	What You Need to Internalize
WAL & LSN	The Write-Ahead Log is the true history. The Log Sequence Number (LSN) is your unique pointer in time.
Logical Decoding	The mechanism that turns binary junk into “INSERT/UPDATE/DELETE” events.
Replication Slots	The “contract” between DB and Consumer. DB promises to keep logs; Consumer promises to read them.
Output Plugins	The formatter. You decide if you want JSON, Protocol Buffers, or raw text.
At-Least-Once	CDC usually guarantees you get the message, but you might get it twice if you crash and restart.

Deep Dive Reading by Concept

Foundations of Storage & Streams

Concept	Book & Chapter
Log-Structured Storage	Designing Data-Intensive Applications by Martin Kleppmann — Ch. 3: “Storage and Retrieval”
Replication Mechanics	Designing Data-Intensive Applications by Martin Kleppmann — Ch. 5: “Replication”
The Log as a DB	Designing Data-Intensive Applications by Martin Kleppmann — Ch. 11: “Stream Processing”

Postgres Specifics

Concept	Book & Chapter
WAL Internals	PostgreSQL 14 Internals by Egor Rogov — Ch. 10: “WAL”
Logical Decoding	PostgreSQL Documentation — Ch. 49: “Logical Decoding”

Essential Reading Order

The “Why” (Day 1):
- DDIA Ch. 11 (Stream Processing) - The section on “Change Data Capture”.
The “How” (Theory) (Day 2):
- DDIA Ch. 3 (Storage Engines) - Focus on “SSTables and LSM-Trees” to understand log-based storage.
The “How” (Practice) (Day 3):

Project 1: The “Old Way” - Building a Manual Poller

File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
Main Programming Language: Python
Alternative Programming Languages: Go, Node.js
Coolness Level: Level 1: Pure Corporate Snoozefest
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: SQL / Polling Patterns
Software or Tool: PostgreSQL, Python
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann (Ch. 11)

What you’ll build: A Python script that polls a database table every 5 seconds to find new or updated rows using an updated_at column and a last_seen_timestamp file.

Why it teaches CDC: You need to feel the pain of polling before you can appreciate the beauty of logs. You’ll realize that if a row is deleted, your poller never sees it. If a row is updated twice between polls, you lose the first update.

Core challenges you’ll face:

Handling Deletes → maps to The visibility gap in query-based CDC
Database Load → maps to The cost of repetitive SELECT queries
Precision → maps to Milliseconds/Clock skew issues

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python, Postgres installed.

Real World Outcome

You’ll have a script that “detects” changes. You will perform a DELETE in SQL and watch your script stay completely silent—demonstrating the fatal flaw of polling.

Example Output:

$ python poller.py
[2025-12-28 10:00:00] Polling... No changes.
[2025-12-28 10:00:05] Polling... Found 1 change: User(id=5, name='Bob')
[2025-12-28 10:00:10] Polling... No changes.
# User deletes 'Bob' in psql...
[2025-12-28 10:00:15] Polling... No changes. (Bob is gone, but we don't know!)

The Core Question You’re Answering

“If a record disappears in the forest and no one is polling, does it make an event?”

Polling is “State-based” capture. You only see what the state is now. Log-based CDC is “Event-based” capture. You see what happened.

Project 2: The WAL Inspector (Looking at the Binary Heart)

File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
Main Programming Language: Bash
Alternative Programming Languages: C (to parse binary)
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Filesystems / Binary Data
Software or Tool: pg_waldump
Main Book: “PostgreSQL 14 Internals” by Egor Rogov (Ch. 10)

What you’ll build: You won’t “code” a tool yet; you’ll use the low-level pg_waldump utility to extract and interpret the raw binary WAL segments from your disk. You will map a SQL statement to its physical representation on disk.

Why it teaches CDC: This forces you to see that “Memory” and “Tables” are just abstractions. The WAL is the physical truth. You’ll see LSN numbers, XIDs (Transaction IDs), and Resource Managers (RMGRs).

Core challenges you’ll face:

Finding the WAL directory → maps to Understanding $PGDATA layout
Filtering Noise → maps to Separating background Postgres tasks from user data
LSN Mapping → maps to Understanding how Postgres pointers work

Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 1, root/postgres access to your DB files.

Real World Outcome

You will be able to point to a 16MB binary file on your disk and say: “Line 452 of this dump is where I changed my username.”

Example Output:

$ pg_waldump /var/lib/postgresql/data/pg_wal/000000010000000000000001 | grep INSERT
rmgr: Heap        len (rec/tot):     54/    54, tx:        721, lsn: 0/16B1970, ...
  desc: INSERT off 1 flags 0x00
# You are seeing the physical insertion of a row!

Project 3: The Custom Logical Decoder Client

File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
Main Programming Language: Python
Alternative Programming Languages: C, Go, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Database Internals / Streaming Protocols
Software or Tool: psycopg2 or psycopg3 (Replication Connection)
Main Book: Postgres Docs (Ch. 49 - Logical Decoding)

What you’ll build: A Python script that opens a “Replication Connection” to Postgres, creates a “Replication Slot,” and uses the test_decoding plugin to stream text-based events.

Why it teaches CDC: This is the first time you interact with the Decoding Engine. You aren’t reading files from disk; you are asking Postgres to stream changes to you in real-time as they are decoded. You will have to handle “Feedback Messages” (Keepalives) to tell Postgres you are still alive.

Core challenges you’ll face:

Logical Replication Protocol → maps to Streaming vs. Request/Response
Slot Management → maps to Persistence of the consumer pointer
Keepalives → maps to Preventing connection timeouts

Difficulty: Advanced Time estimate: 1 week Prerequisites: Project 2, basic understanding of Python context managers.

Real World Outcome

A terminal window that stays open and prints every SQL change as it happens, including DELETEs.

Example Output:

$ python logical_streamer.py
Listening on slot 'my_learning_slot'...
BEGIN 722
table public.users: INSERT: id[int]:1 name[text]:'Douglas'
COMMIT 722
BEGIN 723
table public.users: DELETE: id[int]:1
COMMIT 723
# Unlike Project 1, you saw the DELETE!

Project 4: The Kafka Bridge (Shipping Events to the Log)

File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
Main Programming Language: Python
Alternative Programming Languages: Go, Java, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Message Queues / Event Streaming
Software or Tool: Apache Kafka, Python
Main Book: “Kafka: The Definitive Guide” (Ch. 3)

What you’ll build: A Python script that reads from the Postgres Replication Slot and publishes every change as a JSON message to a Kafka topic.

Why it teaches CDC: This is the “Capture” and “Transport” phase. You’ll learn how to map database tables to Kafka topics and how to represent a row change as a JSON payload that downstream apps can understand.

Core challenges you’ll face:

Topic Mapping → maps to Table-per-topic vs. One-big-topic strategies
JSON Schema → maps to Including ‘before’ and ‘after’ images
Throughput → maps to Batching Kafka produce calls

Difficulty: Advanced Time estimate: 1 week Prerequisites: Project 3, Kafka running locally (use Docker).

Real World Outcome

A Kafka console consumer will scroll with JSON events every time you touch your database.

Example Output:

// kafka-console-consumer --topic db_users
{
  "op": "u", 
  "ts_ms": 1703764800000,
  "before": {"id": 1, "name": "Douglas"},
  "after": {"id": 1, "name": "Douglas (Expert)"},
  "source": {"lsn": "0/16B20A0", "xid": 725}
}

Project 5: The Exactly-Once Producer (Mastering Idempotency)

File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
Main Programming Language: Python
Alternative Programming Languages: Java, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Distributed Consistency / Transactional Messaging
Software or Tool: Kafka, Postgres
Main Book: “Designing Data-Intensive Applications” (Ch. 8 - Distributed Systems)

What you’ll build: You’ll upgrade your Project 4 bridge to handle crashes. If your script dies after sending to Kafka but before confirming to Postgres, it will send duplicates. You will implement a “State Store” (in a local file or Postgres table) to track the last LSN successfully sent to Kafka.

Why it teaches CDC: This is the hardest part of CDC. In a distributed system, “Exactly-Once” is a lie—what we really do is “At-Least-Once” with “Idempotency.” You’ll learn how to make your pipeline resume exactly where it left off.

Core challenges you’ll face:

Atomic Commits → maps to The struggle of distributed snapshots
Producer ID matching → maps to Kafka’s idempotent producer features
LSN Persistence → maps to Where do you store your ‘bookmark’?

Difficulty: Expert Time estimate: 2 weeks Prerequisites: Project 4.

The Core Question You’re Answering

“If my CDC script crashes at 3:00 AM, how do I ensure I don’t lose data OR double-count sales when it restarts?”

Project 6: Snapshot + Stream (The “Full Sync” Problem)

File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
Main Programming Language: Python
Alternative Programming Languages: Go, Java
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Data Synchronization / Database Snapshots
Software or Tool: Postgres, Kafka
Main Book: Debezium Documentation (Architecture Section)

What you’ll build: A system that first performs a SELECT * on the entire table (the “Snapshot”) and then seamlessly switches to reading the WAL (the “Stream”) without missing any changes that happened during the snapshot.

Why it teaches CDC: Real CDC isn’t just about new changes; it’s about getting the initial state right. You’ll learn about “Consistent Snapshots” and how to merge static data with a live stream.

Core challenges you’ll face:

Locking → maps to How to snapshot without blocking the app
LSN Alignment → maps to Finding the exact WAL position where the snapshot ended
Duplicate handling during handoff → maps to The ‘overlap’ period

Difficulty: Expert Time estimate: 2 weeks —

Project 7: The Outbox Pattern (Transactional Messaging)

File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
Main Programming Language: SQL / Python
Alternative Programming Languages: Java, C#
Coolness Level: Level 3: Genuinely Clever
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Microservices / Patterns
Software or Tool: Postgres, Kafka
Main Book: “Microservices Patterns” by Chris Richardson

What you’ll build: You will change your application logic. Instead of capturing changes from the users table directly, you will write events to an outbox table within the same transaction as your business logic. Your CDC script will only tail the outbox table.

Why it teaches CDC: This is how elite engineering teams solve the distributed transaction problem. It ensures that an event is only sent if the database transaction succeeds.

Core challenges you’ll face:

Atomic Writes → maps to Guaranteed delivery
Event Cleaning → maps to Deleting processed outbox entries
JSON Payload generation → maps to Structuring business events vs. raw row changes

Difficulty: Advanced Time estimate: 1 week Prerequisites: Project 4.

Project 8: Schema Evolution (The Schema Registry)

File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
Main Programming Language: Python / Avro
Alternative Programming Languages: Java
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Data Engineering / Governance
Software or Tool: Confluent Schema Registry, Avro
Main Book: “Designing Data-Intensive Applications” (Ch. 4 - Encoding)

What you’ll build: A system that uses Avro instead of JSON for Kafka messages. It will integrate with a Schema Registry. When you add a column to your Postgres table, your CDC script must detect it and update the schema in the registry.

Why it teaches CDC: In production, raw JSON breaks. You need schemas to ensure that downstream consumers don’t crash when you change a database column.

Core challenges you’ll face:

Backward Compatibility → maps to Ensuring old consumers still work
Avro Serialization → maps to Binary encoding benefits
Dynamic Schema Inference → maps to Mapping SQL types to Avro types

Difficulty: Advanced Time estimate: 1 week Prerequisites: Project 4.

Project 9: Debezium - The Industry Standard

File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
Main Programming Language: YAML / JSON
Alternative Programming Languages: Java (to extend)
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 2: Intermediate
Knowledge Area: Infrastructure / DevOps
Software or Tool: Debezium, Kafka Connect, Docker
Main Book: Debezium Official Documentation

What you’ll build: You will set up Debezium using Kafka Connect. You will configure the Postgres connector to do everything you built in Projects 1-8 automatically.

Why it teaches CDC: Now that you’ve built it from scratch, you’ll understand exactly what Debezium is doing. You’ll understand terms like slot.name, publication.name, and plugin.name because you’ve touched them at the protocol level.

Core challenges you’ll face:

Kafka Connect Setup → maps to Managing distributed connectors
Configuration Tuning → maps to LSN committed intervals
Signal Tables → maps to Triggering ad-hoc snapshots

Difficulty: Intermediate Time estimate: 1 week Prerequisites: All previous projects.

Prepare to answer these:

“Why can’t I just use a trigger to send events to Kafka?” (Hint: Performance, atomicity, and transaction rollbacks).
“What happens to Postgres if your logical decoder script crashes and is down for 24 hours?” (Hint: Disk space/WAL retention).
“What is the difference between a Publication and a Replication Slot?”
“How does Postgres know which WAL records to send to which slot?”
“Can you have two scripts reading from the same Replication Slot?” (Hint: No, one slot = one consumer).

Hints in Layers (Project 3)

Hint 1: The Connection You cannot use a standard connection. You must pass replication='database' (for psycopg2) or use the specialized replication connection string.

Hint 2: The Plugin When creating the slot, specify test_decoding. It’s built into Postgres and outputs simple strings. wal2json is better for production, but test_decoding is better for learning.

Hint 3: The Loop You need a while True loop that calls read_message(). Some messages are “Primary Keepalive” messages from Postgres—you must answer them with a “Feedback” message containing the LSN you just read, or Postgres will think you are dead and close the connection.

Learn Change Data Capture (CDC): From Zero to WAL Master

Why CDC Matters

The Evolution of Capturing Changes

1. The “Stone Age”: Dual Writes

2. The “Bronze Age”: Polling

3. The “Golden Age”: Log-Based CDC

Core Concept Analysis

1. The Write-Ahead Log (WAL)

2. Logical Decoding vs. Physical Replication

3. Replication Slots (The Bookmark)

4. Output Plugins

Concept Summary Table

Deep Dive Reading by Concept

Foundations of Storage & Streams

Postgres Specifics

Essential Reading Order

The “How” (Practice) (Day 3):

Project 1: The “Old Way” - Building a Manual Poller

Real World Outcome

The Core Question You’re Answering

Project 2: The WAL Inspector (Looking at the Binary Heart)

Real World Outcome

Project 3: The Custom Logical Decoder Client

Real World Outcome

Project 4: The Kafka Bridge (Shipping Events to the Log)

Real World Outcome

Project 5: The Exactly-Once Producer (Mastering Idempotency)

The Core Question You’re Answering

Project 6: Snapshot + Stream (The “Full Sync” Problem)

Project 7: The Outbox Pattern (Transactional Messaging)

Project 8: Schema Evolution (The Schema Registry)

Project 9: Debezium - The Industry Standard

Hints in Layers (Project 3)