LEARN CDC POSTGRES WAL KAFKA
In 1982, Write-Ahead Logging (WAL) was introduced to solve database durability. For decades, these logs were internal secrets of the database engine used only for crash recovery. Change Data Capture (CDC) liberated these logs, turning the database's internal history into a first-class event stream.
Learn Change Data Capture (CDC): From Zero to WAL Master
Goal: Deeply understand the mechanics of Change Data Capture (CDC) by building systems that tail database transaction logs (Postgres WAL). You will move from simple polling to mastering logical decoding, exactly-once delivery, and streaming changes into Kafka. By the end, you’ll understand why “the log is the database” and how to build production-grade real-time data pipelines.
Why CDC Matters
In 1982, Write-Ahead Logging (WAL) was introduced to solve database durability. For decades, these logs were internal secrets of the database engine used only for crash recovery. Change Data Capture (CDC) liberated these logs, turning the database’s internal history into a first-class event stream.
Modern architectures face the “Dual Write” problem: how do you update a database AND a search index (Elasticsearch) AND a cache (Redis) without them getting out of sync? CDC is the answer. It treats the database as the source of truth and asynchronously propagates every single bit of history to the rest of the world.
The Evolution of Capturing Changes
1. The “Stone Age”: Dual Writes
The application tries to write to two places at once. The Struggle: If the DB write succeeds but the Kafka write fails, you are inconsistent forever.
┌─────────────┐ 1. SQL Update ┌───────────┐
│ Application │────────────────────────▶│ Database │
└──────┬──────┘ └───────────┘
│ 2. Send Event ┌───────────┐
└────────────────────────────────▶│ Kafka │ (What if this fails?)
└───────────┘
2. The “Bronze Age”: Polling
The system periodically runs SELECT * FROM table WHERE updated_at > last_seen.
The Struggle: You can’t see DELETES (they are gone), and you put a heavy load on the DB.
┌───────────┐ SELECT WHERE... ┌───────────┐
│ Poller │──────────────────────────▶│ Database │
└───────────┘ └───────────┘
3. The “Golden Age”: Log-Based CDC
We tail the Write-Ahead Log (WAL). The Outcome: Low overhead, zero missed changes (including deletes), and transactional integrity.
┌───────────┐ Read binary log ┌───────────┐
│ CDC Agent │◀──────────────────────────│ WAL Files │
└─────┬─────┘ └───────────┘
│ Stream Events ┌───────────┐
└────────────────────────────────▶│ Kafka │
└───────────┘
Core Concept Analysis
1. The Write-Ahead Log (WAL)
Before Postgres changes a single byte in a data file, it writes the change to the WAL. It’s an append-only sequence of binary records. If the power goes out, Postgres replays this log to recover.
WAL SEGMENT FILE
[LSN 0/1A2B] -> BEGIN Transaction 101
[LSN 0/1A2C] -> INSERT INTO users (id, name) VALUES (1, 'Douglas')
[LSN 0/1A2D] -> COMMIT Transaction 101
[LSN 0/1A2E] -> BEGIN Transaction 102
...
2. Logical Decoding vs. Physical Replication
- Physical Replication: Copies raw disk blocks. Great for clones, useless for humans.
- Logical Decoding: Translates binary WAL records into “Logical” changes (e.g., “User 1 was inserted”). This is what makes CDC possible.
3. Replication Slots (The Bookmark)
A replication slot is a pointer managed by Postgres. It ensures that Postgres never deletes WAL files until the CDC consumer has confirmed it read them. DANGER: If your CDC consumer stops running, the replication slot will hold onto WAL files forever, eventually filling up your disk and crashing the database!
WAL STREAM: [Record 1][Record 2][Record 3][Record 4][Record 5]
^
│
REPLICATION SLOT (LSN 3)
(Postgres keeps Records 4-5)
4. Output Plugins
Postgres provides the “Decoding Engine,” but you need a “Plugin” to format the data:
test_decoding: Human-readable text (good for debugging).wal2json: Converts changes to JSON.pgoutput: Postgres’s native binary format (most efficient, used by Debezium).
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| WAL & LSN | The Write-Ahead Log is the true history. The Log Sequence Number (LSN) is your unique pointer in time. |
| Logical Decoding | The mechanism that turns binary junk into “INSERT/UPDATE/DELETE” events. |
| Replication Slots | The “contract” between DB and Consumer. DB promises to keep logs; Consumer promises to read them. |
| Output Plugins | The formatter. You decide if you want JSON, Protocol Buffers, or raw text. |
| At-Least-Once | CDC usually guarantees you get the message, but you might get it twice if you crash and restart. |
Deep Dive Reading by Concept
Foundations of Storage & Streams
| Concept | Book & Chapter |
|---|---|
| Log-Structured Storage | Designing Data-Intensive Applications by Martin Kleppmann — Ch. 3: “Storage and Retrieval” |
| Replication Mechanics | Designing Data-Intensive Applications by Martin Kleppmann — Ch. 5: “Replication” |
| The Log as a DB | Designing Data-Intensive Applications by Martin Kleppmann — Ch. 11: “Stream Processing” |
Postgres Specifics
| Concept | Book & Chapter |
|---|---|
| WAL Internals | PostgreSQL 14 Internals by Egor Rogov — Ch. 10: “WAL” |
| Logical Decoding | PostgreSQL Documentation — Ch. 49: “Logical Decoding” |
Essential Reading Order
- The “Why” (Day 1):
- DDIA Ch. 11 (Stream Processing) - The section on “Change Data Capture”.
- The “How” (Theory) (Day 2):
- DDIA Ch. 3 (Storage Engines) - Focus on “SSTables and LSM-Trees” to understand log-based storage.
-
The “How” (Practice) (Day 3):
Project 1: The “Old Way” - Building a Manual Poller
- File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Node.js
- Coolness Level: Level 1: Pure Corporate Snoozefest
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: SQL / Polling Patterns
- Software or Tool: PostgreSQL, Python
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann (Ch. 11)
What you’ll build: A Python script that polls a database table every 5 seconds to find new or updated rows using an updated_at column and a last_seen_timestamp file.
Why it teaches CDC: You need to feel the pain of polling before you can appreciate the beauty of logs. You’ll realize that if a row is deleted, your poller never sees it. If a row is updated twice between polls, you lose the first update.
Core challenges you’ll face:
- Handling Deletes → maps to The visibility gap in query-based CDC
- Database Load → maps to The cost of repetitive SELECT queries
- Precision → maps to Milliseconds/Clock skew issues
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python, Postgres installed.
Real World Outcome
You’ll have a script that “detects” changes. You will perform a DELETE in SQL and watch your script stay completely silent—demonstrating the fatal flaw of polling.
Example Output:
$ python poller.py
[2025-12-28 10:00:00] Polling... No changes.
[2025-12-28 10:00:05] Polling... Found 1 change: User(id=5, name='Bob')
[2025-12-28 10:00:10] Polling... No changes.
# User deletes 'Bob' in psql...
[2025-12-28 10:00:15] Polling... No changes. (Bob is gone, but we don't know!)
The Core Question You’re Answering
“If a record disappears in the forest and no one is polling, does it make an event?”
Polling is “State-based” capture. You only see what the state is now. Log-based CDC is “Event-based” capture. You see what happened.
Project 2: The WAL Inspector (Looking at the Binary Heart)
- File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
- Main Programming Language: Bash
- Alternative Programming Languages: C (to parse binary)
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Filesystems / Binary Data
- Software or Tool:
pg_waldump - Main Book: “PostgreSQL 14 Internals” by Egor Rogov (Ch. 10)
What you’ll build: You won’t “code” a tool yet; you’ll use the low-level pg_waldump utility to extract and interpret the raw binary WAL segments from your disk. You will map a SQL statement to its physical representation on disk.
Why it teaches CDC: This forces you to see that “Memory” and “Tables” are just abstractions. The WAL is the physical truth. You’ll see LSN numbers, XIDs (Transaction IDs), and Resource Managers (RMGRs).
Core challenges you’ll face:
- Finding the WAL directory → maps to Understanding $PGDATA layout
- Filtering Noise → maps to Separating background Postgres tasks from user data
- LSN Mapping → maps to Understanding how Postgres pointers work
Difficulty: Intermediate Time estimate: Weekend Prerequisites: Project 1, root/postgres access to your DB files.
Real World Outcome
You will be able to point to a 16MB binary file on your disk and say: “Line 452 of this dump is where I changed my username.”
Example Output:
$ pg_waldump /var/lib/postgresql/data/pg_wal/000000010000000000000001 | grep INSERT
rmgr: Heap len (rec/tot): 54/ 54, tx: 721, lsn: 0/16B1970, ...
desc: INSERT off 1 flags 0x00
# You are seeing the physical insertion of a row!
Project 3: The Custom Logical Decoder Client
- File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
- Main Programming Language: Python
- Alternative Programming Languages: C, Go, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Database Internals / Streaming Protocols
- Software or Tool:
psycopg2orpsycopg3(Replication Connection) - Main Book: Postgres Docs (Ch. 49 - Logical Decoding)
What you’ll build: A Python script that opens a “Replication Connection” to Postgres, creates a “Replication Slot,” and uses the test_decoding plugin to stream text-based events.
Why it teaches CDC: This is the first time you interact with the Decoding Engine. You aren’t reading files from disk; you are asking Postgres to stream changes to you in real-time as they are decoded. You will have to handle “Feedback Messages” (Keepalives) to tell Postgres you are still alive.
Core challenges you’ll face:
- Logical Replication Protocol → maps to Streaming vs. Request/Response
- Slot Management → maps to Persistence of the consumer pointer
- Keepalives → maps to Preventing connection timeouts
Difficulty: Advanced Time estimate: 1 week Prerequisites: Project 2, basic understanding of Python context managers.
Real World Outcome
A terminal window that stays open and prints every SQL change as it happens, including DELETEs.
Example Output:
$ python logical_streamer.py
Listening on slot 'my_learning_slot'...
BEGIN 722
table public.users: INSERT: id[int]:1 name[text]:'Douglas'
COMMIT 722
BEGIN 723
table public.users: DELETE: id[int]:1
COMMIT 723
# Unlike Project 1, you saw the DELETE!
Project 4: The Kafka Bridge (Shipping Events to the Log)
- File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Java, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Message Queues / Event Streaming
- Software or Tool: Apache Kafka, Python
- Main Book: “Kafka: The Definitive Guide” (Ch. 3)
What you’ll build: A Python script that reads from the Postgres Replication Slot and publishes every change as a JSON message to a Kafka topic.
Why it teaches CDC: This is the “Capture” and “Transport” phase. You’ll learn how to map database tables to Kafka topics and how to represent a row change as a JSON payload that downstream apps can understand.
Core challenges you’ll face:
- Topic Mapping → maps to Table-per-topic vs. One-big-topic strategies
- JSON Schema → maps to Including ‘before’ and ‘after’ images
- Throughput → maps to Batching Kafka produce calls
Difficulty: Advanced Time estimate: 1 week Prerequisites: Project 3, Kafka running locally (use Docker).
Real World Outcome
A Kafka console consumer will scroll with JSON events every time you touch your database.
Example Output:
// kafka-console-consumer --topic db_users
{
"op": "u",
"ts_ms": 1703764800000,
"before": {"id": 1, "name": "Douglas"},
"after": {"id": 1, "name": "Douglas (Expert)"},
"source": {"lsn": "0/16B20A0", "xid": 725}
}
Project 5: The Exactly-Once Producer (Mastering Idempotency)
- File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
- Main Programming Language: Python
- Alternative Programming Languages: Java, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Distributed Consistency / Transactional Messaging
- Software or Tool: Kafka, Postgres
- Main Book: “Designing Data-Intensive Applications” (Ch. 8 - Distributed Systems)
What you’ll build: You’ll upgrade your Project 4 bridge to handle crashes. If your script dies after sending to Kafka but before confirming to Postgres, it will send duplicates. You will implement a “State Store” (in a local file or Postgres table) to track the last LSN successfully sent to Kafka.
Why it teaches CDC: This is the hardest part of CDC. In a distributed system, “Exactly-Once” is a lie—what we really do is “At-Least-Once” with “Idempotency.” You’ll learn how to make your pipeline resume exactly where it left off.
Core challenges you’ll face:
- Atomic Commits → maps to The struggle of distributed snapshots
- Producer ID matching → maps to Kafka’s idempotent producer features
- LSN Persistence → maps to Where do you store your ‘bookmark’?
Difficulty: Expert Time estimate: 2 weeks Prerequisites: Project 4.
The Core Question You’re Answering
“If my CDC script crashes at 3:00 AM, how do I ensure I don’t lose data OR double-count sales when it restarts?”
Project 6: Snapshot + Stream (The “Full Sync” Problem)
- File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Data Synchronization / Database Snapshots
- Software or Tool: Postgres, Kafka
- Main Book: Debezium Documentation (Architecture Section)
What you’ll build: A system that first performs a SELECT * on the entire table (the “Snapshot”) and then seamlessly switches to reading the WAL (the “Stream”) without missing any changes that happened during the snapshot.
Why it teaches CDC: Real CDC isn’t just about new changes; it’s about getting the initial state right. You’ll learn about “Consistent Snapshots” and how to merge static data with a live stream.
Core challenges you’ll face:
- Locking → maps to How to snapshot without blocking the app
- LSN Alignment → maps to Finding the exact WAL position where the snapshot ended
- Duplicate handling during handoff → maps to The ‘overlap’ period
Difficulty: Expert Time estimate: 2 weeks —
Project 7: The Outbox Pattern (Transactional Messaging)
- File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
- Main Programming Language: SQL / Python
- Alternative Programming Languages: Java, C#
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Microservices / Patterns
- Software or Tool: Postgres, Kafka
- Main Book: “Microservices Patterns” by Chris Richardson
What you’ll build: You will change your application logic. Instead of capturing changes from the users table directly, you will write events to an outbox table within the same transaction as your business logic. Your CDC script will only tail the outbox table.
Why it teaches CDC: This is how elite engineering teams solve the distributed transaction problem. It ensures that an event is only sent if the database transaction succeeds.
Core challenges you’ll face:
- Atomic Writes → maps to Guaranteed delivery
- Event Cleaning → maps to Deleting processed outbox entries
- JSON Payload generation → maps to Structuring business events vs. raw row changes
Difficulty: Advanced Time estimate: 1 week Prerequisites: Project 4.
Project 8: Schema Evolution (The Schema Registry)
- File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
- Main Programming Language: Python / Avro
- Alternative Programming Languages: Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Data Engineering / Governance
- Software or Tool: Confluent Schema Registry, Avro
- Main Book: “Designing Data-Intensive Applications” (Ch. 4 - Encoding)
What you’ll build: A system that uses Avro instead of JSON for Kafka messages. It will integrate with a Schema Registry. When you add a column to your Postgres table, your CDC script must detect it and update the schema in the registry.
Why it teaches CDC: In production, raw JSON breaks. You need schemas to ensure that downstream consumers don’t crash when you change a database column.
Core challenges you’ll face:
- Backward Compatibility → maps to Ensuring old consumers still work
- Avro Serialization → maps to Binary encoding benefits
- Dynamic Schema Inference → maps to Mapping SQL types to Avro types
Difficulty: Advanced Time estimate: 1 week Prerequisites: Project 4.
Project 9: Debezium - The Industry Standard
- File: LEARN_CDC_POSTGRES_WAL_KAFKA.md
- Main Programming Language: YAML / JSON
- Alternative Programming Languages: Java (to extend)
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Infrastructure / DevOps
- Software or Tool: Debezium, Kafka Connect, Docker
- Main Book: Debezium Official Documentation
What you’ll build: You will set up Debezium using Kafka Connect. You will configure the Postgres connector to do everything you built in Projects 1-8 automatically.
Why it teaches CDC: Now that you’ve built it from scratch, you’ll understand exactly what Debezium is doing. You’ll understand terms like slot.name, publication.name, and plugin.name because you’ve touched them at the protocol level.
Core challenges you’ll face:
- Kafka Connect Setup → maps to Managing distributed connectors
- Configuration Tuning → maps to LSN committed intervals
- Signal Tables → maps to Triggering ad-hoc snapshots
Difficulty: Intermediate Time estimate: 1 week Prerequisites: All previous projects.
Prepare to answer these:
- “Why can’t I just use a trigger to send events to Kafka?” (Hint: Performance, atomicity, and transaction rollbacks).
- “What happens to Postgres if your logical decoder script crashes and is down for 24 hours?” (Hint: Disk space/WAL retention).
- “What is the difference between a Publication and a Replication Slot?”
- “How does Postgres know which WAL records to send to which slot?”
- “Can you have two scripts reading from the same Replication Slot?” (Hint: No, one slot = one consumer).
Hints in Layers (Project 3)
Hint 1: The Connection
You cannot use a standard connection. You must pass replication='database' (for psycopg2) or use the specialized replication connection string.
Hint 2: The Plugin
When creating the slot, specify test_decoding. It’s built into Postgres and outputs simple strings. wal2json is better for production, but test_decoding is better for learning.
Hint 3: The Loop
You need a while True loop that calls read_message(). Some messages are “Primary Keepalive” messages from Postgres—you must answer them with a “Feedback” message containing the LSN you just read, or Postgres will think you are dead and close the connection.