Project 2: Schema Design for a Real Domain

Design a normalized schema with constraints for a real business domain.

Quick Reference

Attribute Value
Difficulty Level 2
Time Estimate 8-12 hours
Main Programming Language SQL
Alternative Programming Languages Any language with migration tooling
Coolness Level See REFERENCE.md
Business Potential See REFERENCE.md
Prerequisites PostgreSQL installed, psql available
Key Topics Relational Model, Data Types

1. Learning Objectives

  1. Explain the core ideas behind Schema Design for a Real Domain and why they matter.
  2. Demonstrate the workflow with a repeatable PostgreSQL session transcript.
  3. Identify and explain key metrics or evidence from the run.
  4. Document findings in a short operational report.

2. All Theory Needed (Per-Concept Breakdown)

Relational Model, Schemas, and Constraints

Fundamentals PostgreSQL is a relational database, which means data is stored in tables with rows and columns and related through keys. A schema is a namespace for organizing tables, views, and functions. Constraints enforce data integrity: primary keys ensure uniqueness, foreign keys enforce relationships, unique constraints prevent duplicates, and check constraints enforce domain rules. This model is about correctness first. If you internalize the relational model, you will design schemas that enforce business rules at the database level rather than only in application code. This is critical for preventing inconsistent data, especially in multi-user systems. It also supports auditing and reliable reporting.

Deep Dive The relational model treats data as sets of tuples, where each tuple conforms to a defined structure. In PostgreSQL, tables implement this model. A schema groups related objects and helps with organization and permission boundaries. The database enforces constraints at write time, which prevents invalid states from ever entering the database. For example, a foreign key constraint means you cannot insert a row that references a missing parent. This ensures referential integrity across the system.

Designing schemas is about expressing invariants explicitly. Primary keys identify rows; surrogate keys provide stable identifiers; natural keys reflect domain identity. You must decide which constraints belong in the database and which belong in application logic. PostgreSQL is strong at enforcing constraints, so the default posture should be to enforce as much as possible at the database level. Check constraints provide a powerful way to enforce domain logic, such as non-negative quantities or allowed status transitions.

Schema design also affects performance and maintainability. Normalization reduces redundancy and avoids update anomalies, but too much normalization can lead to complex joins. Denormalization can improve read performance but risks inconsistency. PostgreSQL supports both approaches, but you must choose based on workload patterns. It is common to keep core entities normalized while allowing denormalized read models for specific reporting use cases.

Constraints also interact with concurrency. For example, uniqueness constraints require index checks that can create contention under high write load. Foreign keys require lookups in referenced tables, which can be expensive if indexes are missing. Therefore, relational design cannot be separated from indexing strategy.

Understanding schemas and constraints prepares you to interpret error messages correctly, design safe migrations, and reason about how data integrity is enforced. It also lets you reason about why an insert fails or why a cascade update behaves a certain way. The relational model is not just theory; it is how PostgreSQL ensures data correctness in production systems.

Schema evolution is part of real systems. Adding a column is easy, but changing constraints or data types often requires careful backfills and validation steps. PostgreSQL supports adding constraints as NOT VALID and later validating them, which allows you to introduce integrity rules without blocking traffic. This is a key operational pattern: you evolve the schema while maintaining uptime. If you ignore this, you end up with migrations that lock tables or fail under load.

Naming and dependency management also matter. Schema-qualified names prevent accidental collisions, and consistent naming helps with clarity and tooling. Dependencies between tables and views are tracked by PostgreSQL, which means dropping objects can have cascading effects. A good schema design includes explicit dependency awareness, and a migration plan that preserves compatibility. This is not just about SQL correctness; it is about long-term maintenance and developer sanity.

Another subtlety is search_path. PostgreSQL resolves unqualified names based on search_path, which can lead to confusing behavior if multiple schemas contain similarly named objects. Good practice is to either set search_path explicitly per application role or to always use schema-qualified names in critical paths. This reduces ambiguity and prevents accidental writes to the wrong schema.

Definitions and key terms

  • Schema: namespace for database objects.
  • Primary key: unique identifier for rows.
  • Foreign key: relationship constraint between tables.
  • Check constraint: rule enforced on column values.

Mental model diagram

Schema
  |
  +-- table A (primary key)
  |
  +-- table B (foreign key -> A)
  |
  +-- constraints (unique, check)

How it works (step-by-step, include invariants and failure modes)

  1. Define tables and keys (invariant: primary key uniqueness).
  2. Insert rows (failure: constraint violation raises error).
  3. Update rows (failure: check constraint fails).
  4. Delete rows (failure: foreign key references prevent deletion unless cascaded).

Minimal concrete example (pseudocode, config, protocol transcript, or CLI output)

SQL-like pseudocode:
TABLE <orders>
  PRIMARY KEY (<order_id>)
  FOREIGN KEY (<customer_id>) REFERENCES <customers>
  CHECK (<total> >= 0)

Common misconceptions

  • “Constraints slow everything” -> Constraints prevent corruption and often save more time than they cost.
  • “Foreign keys are optional” -> Without them, data integrity is fragile.

Check-your-understanding questions

  1. Why do foreign keys require indexes to perform well?
  2. What anomaly does normalization help prevent?

Check-your-understanding answers

  1. Foreign key checks must look up parent rows efficiently.
  2. Update anomalies and inconsistent duplicates.

Real-world applications

  • Designing ecommerce or billing schemas with strict integrity.
  • Building analytics pipelines that rely on consistent keys.

Where you will apply it

References

  • PostgreSQL Chapter 8 Data Types (postgresql.org/docs/current/datatype.html)
  • SQL standard overview (ISO/IEC 9075)

Key insight Constraints are not optional; they are your database-level safety net.

Summary The relational model gives you structure and invariants; PostgreSQL enforces them with schemas and constraints.

Homework/exercises to practice the concept

  1. Sketch a schema for a library system with books, authors, and loans.
  2. Identify at least three constraints that prevent invalid data.

Solutions to the homework/exercises

  1. Tables: books, authors, book_authors, loans, members.
  2. Examples: unique ISBN, loan must reference existing member, due_date after checkout_date.

Data Types and Domain Modeling

Fundamentals PostgreSQL offers a rich set of built-in data types including numeric, text, boolean, date/time, arrays, JSON, ranges, and enums. Choosing the right type is a form of data modeling. It affects storage size, indexing behavior, and query semantics. Types like JSONB allow semi-structured data, while range types model intervals cleanly. Enums enforce a fixed set of states. Arrays allow multi-valued columns but trade off relational normalization. Understanding the available types and their tradeoffs lets you model domains more accurately and avoid misusing generic text fields. This discipline reduces bugs and query complexity later. It also improves validation clarity.

Deep Dive Data types define the shape of your data and the operations that are valid on it. PostgreSQL includes standard SQL types and many PostgreSQL-specific ones. Numeric types include integer, big integer, and exact numeric types for financial values. Text types include fixed and variable length strings. Date/time types support timestamps with and without time zones. PostgreSQL also supports network types, geometric types, and binary types.

More advanced types make PostgreSQL multi-model. JSON and JSONB allow you to store nested data without fully normalizing it. JSONB is generally preferred because it stores a binary representation that supports indexing and faster operations. Arrays allow columns to contain multiple values of any base type. This can simplify some models but can complicate indexing and querying. Range types represent intervals, such as time ranges or numeric ranges, and support overlap queries and constraints. Enums represent ordered sets of values and provide safety when modeling states.

Choosing a type is a design decision. If you use text for everything, you lose the database’s ability to enforce meaning. If you use JSONB everywhere, you may lose relational integrity and make queries harder to optimize. The correct choice depends on access patterns and invariants. For example, storing a list of tags as an array can be efficient for simple membership queries, but a separate table may be better if you need rich relationships.

Types also affect indexing. JSONB can be indexed with GIN for containment queries. Ranges can be indexed with GiST for overlap queries. Arrays can be indexed with GIN. The type you choose determines which index types are useful. This is why modeling and indexing cannot be separated.

Finally, PostgreSQL allows you to define custom types and domains. Domains are especially powerful: they let you create a constrained type that can be reused across columns. This makes invariants reusable, such as “positive_money” or “email_address”. Domain types also centralize validation logic, which reduces duplication and errors.

Type choice also impacts application integration. ORMs and drivers map PostgreSQL types into application types, and those mappings are not always perfect. For example, JSONB maps naturally to document types, but numeric precision can be lost if the driver uses floating point. Date/time types require consistent timezone handling. If you treat these as afterthoughts, you introduce subtle bugs. A good data model includes explicit decisions about how each PostgreSQL type maps to application logic and serialization formats.

Another factor is storage layout and compression. Large JSONB values can be toasted, meaning they are stored out-of-line and compressed. This affects performance when querying or updating those columns. Arrays also have internal overhead, which can be significant for very large arrays. Range and enum types are compact and efficient, but they require clear domain definitions. By understanding these storage implications, you can choose types that balance readability, correctness, and performance in production workloads.

Collation and locale also influence type behavior, especially for text. Sorting and comparison depend on collation rules, which can vary by locale and affect index usage. For text search, PostgreSQL provides dictionaries and configurations that determine how tokens are generated. These decisions may look like application details, but they directly affect database correctness and query results.

Definitions and key terms

  • JSONB: binary JSON type optimized for indexing and operations.
  • Array: multi-valued column type.
  • Range: interval type with operators for overlap and containment.
  • Enum: ordered set of allowed values.

Mental model diagram

Domain -> Type -> Column -> Index
   |        |        |        |
   v        v        v        v
Rules   Storage   Data    Query plan

How it works (step-by-step, include invariants and failure modes)

  1. Choose a type that matches domain semantics (failure: using text for numeric values).
  2. Define constraints or domains (failure: inconsistent validation across columns).
  3. Index the type with the right index class (failure: wrong index type -> slow queries).

Minimal concrete example (pseudocode, config, protocol transcript, or CLI output)

Pseudo-schema:
<status> as enum: ["pending", "paid", "failed"]
<reservation_window> as range: <start_ts> .. <end_ts>
<metadata> as jsonb

Common misconceptions

  • “JSONB replaces relational modeling” -> It complements it; it does not replace constraints and keys.
  • “Arrays are always bad” -> They are useful in narrow cases with stable size and access patterns.

Check-your-understanding questions

  1. Why is JSONB usually preferred over JSON in PostgreSQL?
  2. When would a range type be better than two timestamp columns?

Check-your-understanding answers

  1. JSONB is indexed and has efficient operations; JSON preserves input formatting but is slower.
  2. Range types allow overlap queries and constraints directly, avoiding fragile logic.

Real-world applications

  • Storing event metadata in JSONB while keeping core data normalized.
  • Scheduling systems using range types to prevent overlaps.

Where you will apply it

References

  • PostgreSQL Chapter 8 Data Types (postgresql.org/docs/current/datatype.html)
  • PostgreSQL JSON Types (postgresql.org/docs/current/datatype-json.html)
  • PostgreSQL Arrays (postgresql.org/docs/current/arrays.html)
  • PostgreSQL Range Types (postgresql.org/docs/current/rangetypes.html)
  • PostgreSQL Enumerated Types (postgresql.org/docs/current/datatype-enum.html)

Key insight The right data type is a modeling decision that unlocks correctness and performance.

Summary PostgreSQL data types go far beyond numbers and strings; they encode domain meaning directly.

Homework/exercises to practice the concept

  1. Identify three fields in a system you know that should not be text.
  2. Propose a type strategy for a reservation system.

Solutions to the homework/exercises

  1. Examples: money as numeric, timestamps as timestamptz, status as enum.
  2. Use range types for reservation windows, jsonb for optional metadata.

3. Project Specification

3.1 What You Will Build

Design a normalized schema with constraints for a real business domain. You will produce a repeatable transcript and a short evidence report for the workflow.

3.2 Functional Requirements

  1. Provide a deterministic scenario with a known outcome.
  2. Capture at least two key pieces of evidence (metrics, plans, or outputs).
  3. Produce a short report that another engineer could verify.

3.3 Non-Functional Requirements

  • Performance: Queries should complete within a few seconds on a small dataset.
  • Reliability: The workflow should be repeatable across multiple runs.
  • Usability: Outputs should be readable and structured.

3.4 Example Usage / Output

$ psql -d <database>
<database>=# \d+ <table>
 Table "public.<table>"
 Column | Type | Modifiers
--------+------+-----------
 id     | uuid | not null

3.5 Data Formats / Schemas / Protocols

Use a simple evidence log format:

SESSION_START
command: <psql or SQL-like command>
result: <summary line>
KEY_EVIDENCE: name=value, name=value
SESSION_END

3.6 Edge Cases

  • Missing permissions for a required command.
  • Stale statistics leading to unexpected results.
  • Misconfiguration that hides expected data.

3.7 Real World Outcome

This section is the gold standard for verification.

3.7.1 How to Run (Copy/Paste)

  • Use a local PostgreSQL instance or container.
  • Load a minimal dataset with predictable values.
  • Execute the command sequence exactly as described.

3.7.2 Golden Path Demo (Deterministic)

A schema with enforced primary keys, foreign keys, and check constraints. Exit code: 0

3.7.3 If CLI: exact terminal transcript

$ psql -d <database>
<database>=# \d+ <table>
 Table "public.<table>"
 Column | Type | Modifiers
--------+------+-----------
 id     | uuid | not null

3.7.4 Failure Demo (Deterministic)

A constraint violation example that demonstrates rejection of invalid data. Exit code: 1


4. Solution Architecture

4.1 High-Level Design

+--------------------+     +----------------------+     +-------------------+
| Dataset + Schema   | --> | PostgreSQL Workflow  | --> | Evidence Report   |
+--------------------+     +----------------------+     +-------------------+

4.2 Key Components

Component Responsibility Key Decisions
Dataset Provide deterministic inputs Keep small and fixed
Workflow Execute queries or commands Use explicit steps
Evidence Report Capture outcomes Use consistent format

4.4 Data Structures (No Full Code)

EvidenceRecord:
  - step
  - command
  - output_summary
  - key_metrics

Report:
  - summary
  - evidence
  - conclusions

4.4 Algorithm Overview

Key Algorithm: Evidence Capture Loop

  1. Prepare dataset and baseline metrics.
  2. Execute the workflow step by step.
  3. Capture outputs and compare with expected outcomes.
  4. Write a short report with conclusions.

Complexity Analysis:

  • Time: O(Q) where Q is number of queries.
  • Space: O(R) for report size.

5. Implementation Guide

5.1 Development Environment Setup

psql --version

5.2 Project Structure

project-root/
|-- notes/
|   `-- report.md
|-- transcripts/
|   `-- session.txt
`-- README.md

5.3 The Core Question You’re Answering

“Can my database prevent invalid business states on its own?”

Write the evidence you need before you start.

5.4 Concepts You Must Understand First

Stop and research these before debugging:

  1. Core PostgreSQL mechanics
    • What does PostgreSQL guarantee at this layer?
    • Book Reference: “PostgreSQL: The Definitive Guide” - Ch. 3
  2. Planning and observation
    • Which signals confirm success?
    • Book Reference: “SQL Performance Explained” - Ch. 1
  3. Data integrity rules
    • Which invariants must always hold?
    • Book Reference: “Database System Concepts” - Ch. 2

5.5 Questions to Guide Your Design

Before implementing, think through these:

  1. Evidence capture
    • Which outputs prove correctness?
    • What is the smallest set of metrics that matters?
  2. Repeatability
    • How will you ensure the same results each run?
    • Which inputs must be fixed?

5.6 Thinking Exercise

Sketch a timeline of actions and identify the key evidence at each step.

5.7 The Interview Questions They’ll Ask

  1. “How do you verify correctness in PostgreSQL workflows?”
  2. “What evidence shows a query plan change?”
  3. “How do you ensure isolation and consistency?”
  4. “Why do you need VACUUM?”
  5. “How do you validate a backup?”

5.8 Hints in Layers

Hint 1: Start with a small dataset Keep the dataset minimal and predictable.

Hint 2: Record every step Use a transcript so you can compare runs.

Hint 3: Compare against expected outputs Use a golden path to validate correctness.

Hint 4: Use system catalogs for verification Confirm results with pg_stat and catalog views.

5.9 Books That Will Help

Topic Book Chapter
PostgreSQL fundamentals “PostgreSQL: The Definitive Guide” Ch. 3
SQL performance “SQL Performance Explained” Ch. 1-4
Data modeling “Database System Concepts” Ch. 2

5.10 Implementation Phases

Phase 1: Foundation (2-4 hours)

Goals:

  • Set up the environment.
  • Define the minimal dataset.

Tasks:

  1. Install and connect to PostgreSQL.
  2. Load a deterministic dataset.

Checkpoint: You can run a basic query and see expected output.

Phase 2: Core Workflow (4-8 hours)

Goals:

  • Execute the core steps.
  • Capture evidence.

Tasks:

  1. Run the main workflow commands.
  2. Capture outputs and metrics.

Checkpoint: Evidence matches the golden path.

Phase 3: Polish and Edge Cases (2-6 hours)

Goals:

  • Validate edge cases.
  • Document failures and fixes.

Tasks:

  1. Trigger a failure scenario.
  2. Document how you detected and fixed it.

Checkpoint: Failure case and fix are documented.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Dataset size Small or large Small Easier to verify outputs
Evidence format Transcript or report Both Transcript for detail, report for summary

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Verify single steps One query or command
Integration Tests End-to-end workflow Full golden path
Edge Case Tests Validate failures Missing permissions

6.2 Critical Test Cases

  1. Primary path: expected output is produced.
  2. Failure path: error is detected and documented.
  3. Repeatability: results match across two runs.

6.3 Test Data

Input: deterministic dataset with fixed values
Expected: same outputs each run

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Wrong database Output references unexpected objects Verify connection and search_path
Missing privileges Permission denied errors Grant least-privilege access for the task
Stale stats Planner chooses unexpected plan Run ANALYZE and re-check

7.2 Debugging Strategies

  • Capture before changing: record outputs first.
  • Verify connection context: ensure correct database and schema.

7.3 Performance Traps

If dataset size grows, some commands may appear slow. Keep tests small and repeatable.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add one extra table or metric and document it.
  • Create a short checklist for future runs.

8.2 Intermediate Extensions

  • Add a second scenario and compare results.
  • Automate part of the workflow with a script.

8.3 Advanced Extensions

  • Integrate the workflow into a CI or scheduled job.
  • Create a dashboard or long-term trend report.

9. Real-World Connections

9.1 Industry Applications

  • Operational runbooks for production PostgreSQL systems.
  • Data platform reliability and incident response.
  • PostgreSQL core project - open source database engine.
  • pg_stat_statements - query statistics extension.

9.3 Interview Relevance

  • Transaction semantics and MVCC.
  • Index selection and query planning.

10. Resources

10.1 Essential Reading

  • PostgreSQL documentation (current) - reference for commands and behavior.
  • SQL Performance Explained - reasoning about indexes and plans.
  • Designing Data-Intensive Applications - reliability patterns.

10.2 Video Resources

  • PostgreSQL conference talks on operations and performance
  • Database reliability talks from systems conferences

10.3 Tools & Documentation

  • psql - interactive CLI for PostgreSQL.
  • pg_stat_* views - built-in monitoring tables.
  • pg_dump and pg_basebackup - backup tooling.
  • Previous Project: P01-environment-psql-mastery.md
  • Next Project: P03-data-types-lab.md

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain the key concept without notes.
  • I can justify the evidence I collected.
  • I understand the tradeoffs in this workflow.

11.2 Implementation

  • All functional requirements are met.
  • The golden path transcript matches expected output.
  • Failure modes are documented with evidence.

11.3 Growth

  • I documented lessons learned.
  • I can explain this project in a job interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • A working transcript that matches the golden path.
  • A short report with evidence and conclusions.
  • A failure case documented with explanation.

Full Completion:

  • All minimum criteria plus:
  • Comparison across at least two runs.
  • Notes on tradeoffs and possible improvements.

Excellence (Going Above & Beyond):

  • Automated capture of transcripts.
  • A reusable checklist for future runs.