Project 10: Logical Replication Pipeline

Stream changes from a publisher to a subscriber and validate lag.

Quick Reference

Attribute	Value
Difficulty	Level 4
Time Estimate	24-40 hours
Main Programming Language	SQL
Alternative Programming Languages	Shell for automation
Coolness Level	See REFERENCE.md
Business Potential	See REFERENCE.md
Prerequisites	PostgreSQL installed, psql available
Key Topics	Replication, Consistency

1. Learning Objectives

Explain the core ideas behind Logical Replication Pipeline and why they matter.
Demonstrate the workflow with a repeatable PostgreSQL session transcript.
Identify and explain key metrics or evidence from the run.
Document findings in a short operational report.

2. All Theory Needed (Per-Concept Breakdown)

Backup, Recovery, and Replication

Fundamentals PostgreSQL durability depends on the write-ahead log and backup strategies. Base backups capture a consistent snapshot. WAL archiving allows point-in-time recovery (PITR). Replication can be physical (byte-level) or logical (row-level change streams). Logical replication uses publish and subscribe with replication identity. Understanding these mechanisms is essential for data safety and high availability. Without a tested backup and replication plan, you do not have a reliable system. Reliability is therefore an engineering discipline, not a feature toggle. It requires routine drills and documented procedures, plus clear recovery objectives. It is a core part of operational trust.

Deep Dive Backup and recovery in PostgreSQL revolve around WAL. The server writes all changes to WAL before data files, ensuring crash safety. For disaster recovery, you need base backups and WAL archives. A base backup is a full copy of the database cluster. WAL archives capture changes since that backup. By restoring the base backup and replaying WAL to a target timestamp, you can recover to a precise point in time. This is critical for protection against accidental deletes or corruption.

Replication builds on WAL. Physical replication streams WAL records to a standby, which replays them to keep a binary copy. This is ideal for high availability and read replicas. Logical replication instead extracts row-level changes and applies them on subscribers. It is more flexible: you can replicate selected tables, run different major versions, or filter data. The tradeoff is complexity and limitations: DDL changes are not automatically replicated and must be synchronized manually. Logical replication uses a publish/subscribe model where a publication defines which tables are replicated and a subscription pulls changes from the publisher.

Replication identities matter because logical replication needs a stable way to identify rows. Primary keys are the usual identity. Without them, updates and deletes cannot be replicated cleanly. This is why logical replication reinforces good schema design.

Recovery is not just about restoring data. It is also about ensuring consistent configuration, roles, and extensions. A production-grade backup strategy includes configuration backups, version compatibility checks, and regular recovery drills. The PITR process should be tested, not just assumed to work.

Understanding backup and replication is essential for scaling and reliability. Many outages are not due to data loss but to slow or incorrect recovery. A skilled PostgreSQL engineer can restore a system quickly because they understand WAL, recovery timelines, and replication lag.

Replication introduces operational decisions. Physical replication provides a byte-for-byte copy and is ideal for failover, but it ties you to the same major version and architecture. Logical replication offers flexibility for upgrades and selective data sharing, but it requires careful DDL coordination and monitoring. For example, if you rename a column on the publisher, the subscriber may fail until the change is applied there too. These are not edge cases; they are common operational workflows. Backup strategies also vary by workload. For small databases, nightly base backups may be sufficient. For larger systems, continuous archiving and frequent incremental backups become necessary. You must also consider retention and encryption, because backups often contain the most sensitive data. Recovery time objectives (RTO) and recovery point objectives (RPO) should guide how you design your backup cadence and storage. PostgreSQL gives you the primitives; you must design the policy.

Replication lag is not just a number; it is a risk signal. A replica that is minutes behind can lead to stale reads or slow failovers. Monitoring lag and understanding its causes (network throughput, I/O contention, or long-running transactions on the replica) helps you decide whether to fail over or to wait. This is why replication monitoring is part of operational readiness, not just a dashboard metric.

Definitions and key terms

Base backup: full snapshot of the database cluster.
WAL archiving: storing WAL for recovery.
PITR: point-in-time recovery.
Logical replication: row-level change replication.

Mental model diagram

Base backup + WAL archive -> Restore -> Replay -> Target time
Publisher -> Publication -> Subscriber (logical replication)

How it works (step-by-step, include invariants and failure modes)

Take base backup (invariant: consistent snapshot).
Archive WAL continuously (failure: gaps break recovery).
Restore and replay WAL (failure: wrong timeline or missing WAL).
Configure replication (failure: missing replication identity).

Minimal concrete example (pseudocode, config, protocol transcript, or CLI output)

Backup workflow (pseudo):
- base backup at T0
- WAL archive from T0 onward
- restore base + replay WAL to T1

Common misconceptions

“Logical replication copies schemas” -> DDL is not replicated automatically.
“Backups are enough without testing” -> Untested backups are guesses.

Check-your-understanding questions

Why is WAL archiving required for PITR?
Why does logical replication need replication identity?

Check-your-understanding answers

WAL provides the changes between the base backup and target time.
It needs a stable key to identify rows for updates and deletes.

Real-world applications

High availability failover using physical replication.
Zero-downtime upgrades via logical replication.

Where you will apply it

See Section 3.7 Real World Outcome and Section 5.10 Implementation Phases in this file.
Also used in: P09-backup-pitr-drill

References

PostgreSQL WAL reliability (postgresql.org/docs/current/wal.html)
PostgreSQL PITR (postgresql.org/docs/current/continuous-archiving.html)
PostgreSQL Logical Replication (postgresql.org/docs/current/logical-replication.html)

Key insight Reliability depends on WAL; replication and recovery are structured workflows, not ad-hoc fixes.

Summary PostgreSQL backup and replication are WAL-driven and must be tested to be trusted.

Homework/exercises to practice the concept

Sketch a PITR timeline for a daily backup with hourly WAL archiving.
Explain the difference between physical and logical replication.

Solutions to the homework/exercises

Restore base backup, then replay WAL to the desired hour.
Physical replicates bytes; logical replicates row-level changes.

Relational Model, Schemas, and Constraints

Fundamentals PostgreSQL is a relational database, which means data is stored in tables with rows and columns and related through keys. A schema is a namespace for organizing tables, views, and functions. Constraints enforce data integrity: primary keys ensure uniqueness, foreign keys enforce relationships, unique constraints prevent duplicates, and check constraints enforce domain rules. This model is about correctness first. If you internalize the relational model, you will design schemas that enforce business rules at the database level rather than only in application code. This is critical for preventing inconsistent data, especially in multi-user systems. It also supports auditing and reliable reporting.

Deep Dive The relational model treats data as sets of tuples, where each tuple conforms to a defined structure. In PostgreSQL, tables implement this model. A schema groups related objects and helps with organization and permission boundaries. The database enforces constraints at write time, which prevents invalid states from ever entering the database. For example, a foreign key constraint means you cannot insert a row that references a missing parent. This ensures referential integrity across the system.

Designing schemas is about expressing invariants explicitly. Primary keys identify rows; surrogate keys provide stable identifiers; natural keys reflect domain identity. You must decide which constraints belong in the database and which belong in application logic. PostgreSQL is strong at enforcing constraints, so the default posture should be to enforce as much as possible at the database level. Check constraints provide a powerful way to enforce domain logic, such as non-negative quantities or allowed status transitions.

Schema design also affects performance and maintainability. Normalization reduces redundancy and avoids update anomalies, but too much normalization can lead to complex joins. Denormalization can improve read performance but risks inconsistency. PostgreSQL supports both approaches, but you must choose based on workload patterns. It is common to keep core entities normalized while allowing denormalized read models for specific reporting use cases.

Constraints also interact with concurrency. For example, uniqueness constraints require index checks that can create contention under high write load. Foreign keys require lookups in referenced tables, which can be expensive if indexes are missing. Therefore, relational design cannot be separated from indexing strategy.

Understanding schemas and constraints prepares you to interpret error messages correctly, design safe migrations, and reason about how data integrity is enforced. It also lets you reason about why an insert fails or why a cascade update behaves a certain way. The relational model is not just theory; it is how PostgreSQL ensures data correctness in production systems.

Schema evolution is part of real systems. Adding a column is easy, but changing constraints or data types often requires careful backfills and validation steps. PostgreSQL supports adding constraints as NOT VALID and later validating them, which allows you to introduce integrity rules without blocking traffic. This is a key operational pattern: you evolve the schema while maintaining uptime. If you ignore this, you end up with migrations that lock tables or fail under load.

Naming and dependency management also matter. Schema-qualified names prevent accidental collisions, and consistent naming helps with clarity and tooling. Dependencies between tables and views are tracked by PostgreSQL, which means dropping objects can have cascading effects. A good schema design includes explicit dependency awareness, and a migration plan that preserves compatibility. This is not just about SQL correctness; it is about long-term maintenance and developer sanity.

Another subtlety is search_path. PostgreSQL resolves unqualified names based on search_path, which can lead to confusing behavior if multiple schemas contain similarly named objects. Good practice is to either set search_path explicitly per application role or to always use schema-qualified names in critical paths. This reduces ambiguity and prevents accidental writes to the wrong schema.

Definitions and key terms

Schema: namespace for database objects.
Primary key: unique identifier for rows.
Foreign key: relationship constraint between tables.
Check constraint: rule enforced on column values.

Mental model diagram

Schema
  |
  +-- table A (primary key)
  |
  +-- table B (foreign key -> A)
  |
  +-- constraints (unique, check)

How it works (step-by-step, include invariants and failure modes)

Define tables and keys (invariant: primary key uniqueness).
Insert rows (failure: constraint violation raises error).
Update rows (failure: check constraint fails).
Delete rows (failure: foreign key references prevent deletion unless cascaded).

Minimal concrete example (pseudocode, config, protocol transcript, or CLI output)

SQL-like pseudocode:
TABLE <orders>
  PRIMARY KEY (<order_id>)
  FOREIGN KEY (<customer_id>) REFERENCES <customers>
  CHECK (<total> >= 0)

Common misconceptions

“Constraints slow everything” -> Constraints prevent corruption and often save more time than they cost.
“Foreign keys are optional” -> Without them, data integrity is fragile.

Check-your-understanding questions

Why do foreign keys require indexes to perform well?
What anomaly does normalization help prevent?

Check-your-understanding answers

Foreign key checks must look up parent rows efficiently.
Update anomalies and inconsistent duplicates.

Real-world applications

Designing ecommerce or billing schemas with strict integrity.
Building analytics pipelines that rely on consistent keys.

Where you will apply it

See Section 3.7 Real World Outcome and Section 5.10 Implementation Phases in this file.
Also used in: P02-schema-design-real-domain, P05-transaction-isolation-lab, P11-rls-multi-tenant

References

PostgreSQL Chapter 8 Data Types (postgresql.org/docs/current/datatype.html)
SQL standard overview (ISO/IEC 9075)

Key insight Constraints are not optional; they are your database-level safety net.

Summary The relational model gives you structure and invariants; PostgreSQL enforces them with schemas and constraints.

Homework/exercises to practice the concept

Sketch a schema for a library system with books, authors, and loans.
Identify at least three constraints that prevent invalid data.

Solutions to the homework/exercises

Tables: books, authors, book_authors, loans, members.
Examples: unique ISBN, loan must reference existing member, due_date after checkout_date.

3. Project Specification

3.1 What You Will Build

Stream changes from a publisher to a subscriber and validate lag. You will produce a repeatable transcript and a short evidence report for the workflow.

3.2 Functional Requirements

Provide a deterministic scenario with a known outcome.
Capture at least two key pieces of evidence (metrics, plans, or outputs).
Produce a short report that another engineer could verify.

3.3 Non-Functional Requirements

Performance: Queries should complete within a few seconds on a small dataset.
Reliability: The workflow should be repeatable across multiple runs.
Usability: Outputs should be readable and structured.

3.4 Example Usage / Output

$ psql -d <publisher>
<publisher>=# SELECT <row_count>;
 <count>

$ psql -d <subscriber>
<subscriber>=# SELECT <row_count>;
 <count>

3.5 Data Formats / Schemas / Protocols

Use a simple evidence log format:

SESSION_START
command: <psql or SQL-like command>
result: <summary line>
KEY_EVIDENCE: name=value, name=value
SESSION_END

3.6 Edge Cases

Missing permissions for a required command.
Stale statistics leading to unexpected results.
Misconfiguration that hides expected data.

3.7 Real World Outcome

This section is the gold standard for verification.

3.7.1 How to Run (Copy/Paste)

Use a local PostgreSQL instance or container.
Load a minimal dataset with predictable values.
Execute the command sequence exactly as described.

3.7.2 Golden Path Demo (Deterministic)

Publisher changes appear on subscriber within defined lag. Exit code: 0

3.7.3 If CLI: exact terminal transcript

$ psql -d <publisher>
<publisher>=# SELECT <row_count>;
 <count>

$ psql -d <subscriber>
<subscriber>=# SELECT <row_count>;
 <count>

3.7.4 Failure Demo (Deterministic)

Replication fails due to missing primary key. Exit code: 1

4. Solution Architecture

4.1 High-Level Design

+--------------------+     +----------------------+     +-------------------+
| Dataset + Schema   | --> | PostgreSQL Workflow  | --> | Evidence Report   |
+--------------------+     +----------------------+     +-------------------+

4.2 Key Components

Component	Responsibility	Key Decisions
Dataset	Provide deterministic inputs	Keep small and fixed
Workflow	Execute queries or commands	Use explicit steps
Evidence Report	Capture outcomes	Use consistent format

4.4 Data Structures (No Full Code)

EvidenceRecord:
  - step
  - command
  - output_summary
  - key_metrics

Report:
  - summary
  - evidence
  - conclusions

4.4 Algorithm Overview

Key Algorithm: Evidence Capture Loop

Prepare dataset and baseline metrics.
Execute the workflow step by step.
Capture outputs and compare with expected outcomes.
Write a short report with conclusions.

Complexity Analysis:

Time: O(Q) where Q is number of queries.
Space: O(R) for report size.

5. Implementation Guide

5.1 Development Environment Setup

psql --version

5.2 Project Structure

project-root/
|-- notes/
|   `-- report.md
|-- transcripts/
|   `-- session.txt
`-- README.md

5.3 The Core Question You’re Answering

“How do I stream database changes safely between systems?”

Write the evidence you need before you start.

5.4 Concepts You Must Understand First

Stop and research these before debugging:

Core PostgreSQL mechanics
- What does PostgreSQL guarantee at this layer?
- Book Reference: “PostgreSQL: The Definitive Guide” - Ch. 3
Planning and observation
- Which signals confirm success?
- Book Reference: “SQL Performance Explained” - Ch. 1
Data integrity rules
- Which invariants must always hold?
- Book Reference: “Database System Concepts” - Ch. 2

5.5 Questions to Guide Your Design

Before implementing, think through these:

Evidence capture
- Which outputs prove correctness?
- What is the smallest set of metrics that matters?
Repeatability
- How will you ensure the same results each run?
- Which inputs must be fixed?

5.6 Thinking Exercise

Sketch a timeline of actions and identify the key evidence at each step.

5.7 The Interview Questions They’ll Ask

“How do you verify correctness in PostgreSQL workflows?”
“What evidence shows a query plan change?”
“How do you ensure isolation and consistency?”
“Why do you need VACUUM?”
“How do you validate a backup?”

5.8 Hints in Layers

Hint 1: Start with a small dataset Keep the dataset minimal and predictable.

Hint 2: Record every step Use a transcript so you can compare runs.

Hint 3: Compare against expected outputs Use a golden path to validate correctness.

Hint 4: Use system catalogs for verification Confirm results with pg_stat and catalog views.

5.9 Books That Will Help

Topic	Book	Chapter
PostgreSQL fundamentals	“PostgreSQL: The Definitive Guide”	Ch. 3
SQL performance	“SQL Performance Explained”	Ch. 1-4
Data modeling	“Database System Concepts”	Ch. 2

5.10 Implementation Phases

Phase 1: Foundation (2-4 hours)

Goals:

Set up the environment.
Define the minimal dataset.

Tasks:

Install and connect to PostgreSQL.
Load a deterministic dataset.

Checkpoint: You can run a basic query and see expected output.

Phase 2: Core Workflow (4-8 hours)

Goals:

Execute the core steps.
Capture evidence.

Tasks:

Run the main workflow commands.
Capture outputs and metrics.

Checkpoint: Evidence matches the golden path.

Phase 3: Polish and Edge Cases (2-6 hours)

Goals:

Validate edge cases.
Document failures and fixes.

Tasks:

Trigger a failure scenario.
Document how you detected and fixed it.

Checkpoint: Failure case and fix are documented.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Dataset size	Small or large	Small	Easier to verify outputs
Evidence format	Transcript or report	Both	Transcript for detail, report for summary

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Verify single steps	One query or command
Integration Tests	End-to-end workflow	Full golden path
Edge Case Tests	Validate failures	Missing permissions

6.2 Critical Test Cases

Primary path: expected output is produced.
Failure path: error is detected and documented.
Repeatability: results match across two runs.

6.3 Test Data

Input: deterministic dataset with fixed values
Expected: same outputs each run

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Wrong database	Output references unexpected objects	Verify connection and search_path
Missing privileges	Permission denied errors	Grant least-privilege access for the task
Stale stats	Planner chooses unexpected plan	Run ANALYZE and re-check

7.2 Debugging Strategies

Capture before changing: record outputs first.
Verify connection context: ensure correct database and schema.

7.3 Performance Traps

If dataset size grows, some commands may appear slow. Keep tests small and repeatable.

8. Extensions & Challenges

8.1 Beginner Extensions

Add one extra table or metric and document it.
Create a short checklist for future runs.

8.2 Intermediate Extensions

Add a second scenario and compare results.
Automate part of the workflow with a script.

8.3 Advanced Extensions

Integrate the workflow into a CI or scheduled job.
Create a dashboard or long-term trend report.

9. Real-World Connections

9.1 Industry Applications

Operational runbooks for production PostgreSQL systems.
Data platform reliability and incident response.

PostgreSQL core project - open source database engine.
pg_stat_statements - query statistics extension.

9.3 Interview Relevance

Transaction semantics and MVCC.
Index selection and query planning.

10. Resources

10.1 Essential Reading

PostgreSQL documentation (current) - reference for commands and behavior.
SQL Performance Explained - reasoning about indexes and plans.
Designing Data-Intensive Applications - reliability patterns.

10.2 Video Resources

PostgreSQL conference talks on operations and performance
Database reliability talks from systems conferences

10.3 Tools & Documentation

psql - interactive CLI for PostgreSQL.
pg_stat_* views - built-in monitoring tables.
pg_dump and pg_basebackup - backup tooling.

Previous Project: P09-backup-pitr-drill.md
Next Project: P11-rls-multi-tenant.md

11. Self-Assessment Checklist

11.1 Understanding

I can explain the key concept without notes.
I can justify the evidence I collected.
I understand the tradeoffs in this workflow.

11.2 Implementation

All functional requirements are met.
The golden path transcript matches expected output.
Failure modes are documented with evidence.

11.3 Growth

I documented lessons learned.
I can explain this project in a job interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

A working transcript that matches the golden path.
A short report with evidence and conclusions.
A failure case documented with explanation.

Full Completion:

All minimum criteria plus:
Comparison across at least two runs.
Notes on tradeoffs and possible improvements.

Excellence (Going Above & Beyond):

Automated capture of transcripts.
A reusable checklist for future runs.

Project 10: Logical Replication Pipeline

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

Backup, Recovery, and Replication

Relational Model, Schemas, and Constraints

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 If CLI: exact terminal transcript

3.7.4 Failure Demo (Deterministic)

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.4 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (2-4 hours)

Phase 2: Core Workflow (4-8 hours)

Phase 3: Polish and Edge Cases (2-6 hours)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria