Project Expansion Automation: Real World Projects

Goal: Build a deep mental model for turning a loose project list into a deterministic, validated, searchable, and extensible collection of high-quality project mini-books. You will understand how Markdown parsing works at the AST level, how JSON Schema creates a contract between content and tooling, how concept graphs expose prerequisite structure, and how deterministic generation makes outputs reproducible across machines and time. By the end, you’ll be able to build a full CLI pipeline that ingests a messy Markdown file, normalizes it into canonical JSON, generates a directory of expanded project files, and enforces quality with linting and provenance checks. This guide is both a systems design exercise and a documentation-engineering toolkit you can reuse for any knowledge product.

Introduction: What This Guide Covers

Project Expansion Automation is the practice of converting a high-level project list into a structured, deeply detailed set of project files using parsers, schemas, dependency graphs, and deterministic generators. It solves the real-world problem of inconsistent content quality and unscalable manual writing by creating a repeatable pipeline that produces project guides you can trust and build on.

What you will build (by the end of this guide):

A Markdown parser that extracts project metadata into schema-validated JSON
A concept dependency graph that reveals prerequisite structure and cycles
A template-driven CLI that generates deterministic project files + index
A QA and research enrichment pipeline that lints, scores, and audits outputs

Scope (what’s included):

Markdown parsing, AST extraction, and metadata normalization
JSON Schema validation and canonical output
Graph modeling of concepts and dependencies
Template expansion with deterministic file output
Quality assurance, lint rules, and citation provenance

Out of scope (for this guide):

Fully automated natural language generation of project content
Large-scale distributed processing or cloud orchestration
Proprietary content management systems

The Big Picture (Mental Model)

                 ┌───────────────────────────────┐
Input Markdown → │ Parser + AST + Extractor       │
                 └───────────────┬───────────────┘
                                 v
                 ┌───────────────────────────────┐
                 │ Schema Validation + Canonical │
                 │ JSON Normalization            │
                 └───────────────┬───────────────┘
                                 v
                 ┌───────────────────────────────┐
                 │ Concept Graph + Dependencies  │
                 └───────────────┬───────────────┘
                                 v
                 ┌───────────────────────────────┐
                 │ Template Expansion Engine     │
                 └───────────────┬───────────────┘
                                 v
                 ┌───────────────────────────────┐
                 │ QA + Lint + Provenance Audit  │
                 └───────────────┬───────────────┘
                                 v
                 ┌───────────────────────────────┐
                 │ Expanded Project Files + Index│
                 └───────────────────────────────┘

Key Terms You’ll See Everywhere

AST (Abstract Syntax Tree): A tree representation of Markdown structure (headings, lists, code blocks).
Schema Validation: Automated rules that ensure JSON data has expected shape and required fields.
Canonicalization: A deterministic serialization of JSON so outputs are identical across runs.
Concept Graph: A directed graph where nodes are concepts and edges represent prerequisites.
Deterministic Generation: Running the pipeline multiple times yields byte-identical output.

How to Use This Guide

Read the Theory Primer first: Treat it like a mini-book. It will save you hours during implementation.
Build the projects in order: Each project builds a critical subsystem for the next.
Validate as you go: Every project includes a “Definition of Done” checklist and golden CLI output.
Keep outputs deterministic: Always run with fixed inputs and deterministic ordering.
Track your assumptions: Write down parsing assumptions and schema constraints explicitly.

Prerequisites & Background Knowledge

Before starting these projects, you should have foundational understanding in these areas:

Essential Prerequisites (Must Have)

Programming Skills:

Comfort reading and writing CLI tools (Python, Go, or Rust)
Experience manipulating JSON data
Familiarity with file system operations and directories

Text Processing Fundamentals:

Tokenization vs parsing
Regular expressions (as a last resort, not a primary strategy)

Data Modeling Basics:

JSON data structures
Schema vs instance data
Error reporting and validation concepts

Recommended Reading: “Clean Code” by Robert C. Martin — Ch. 7 (Error Handling), Ch. 3 (Functions)

Helpful But Not Required

Graph Algorithms:

Topological sort
Cycle detection in directed graphs
Can learn during: Project 2

Templating Systems:

Template rendering vs code generation
Can learn during: Project 3

Quality Engineering:

Lint rules and rubrics
Can learn during: Project 4

Self-Assessment Questions

Before starting, ask yourself:

✅ Can I parse a file and extract structured data from it without regex hacks?
✅ Do I understand the difference between schema validation and type checking?
✅ Can I explain what makes output deterministic and why it matters?
✅ Have I built a CLI that returns non-zero exit codes on failure?
✅ Can I reason about graph cycles and dependency ordering?

If you answered “no” to questions 1-3: Spend a weekend on a small CLI parsing exercise before starting. If you answered “yes” to all 5: You’re ready.

Development Environment Setup

Required Tools:

A Unix-like environment (macOS or Linux)
Python 3.11+ or Go 1.21+ or Rust 1.75+
A JSON Schema validator library
A Markdown parser library with AST support

Recommended Tools:

jq for JSON inspection
graphviz (dot) for graph rendering
ripgrep for source scanning

Testing Your Setup:

$ python3 --version
Python 3.11.5

$ which jq
/usr/bin/jq

$ dot -V
dot - graphviz version 9.0.0

Time Investment

Project 1: 8-12 hours
Project 2: 6-10 hours
Project 3: 10-15 hours
Project 4: 8-12 hours
Total sprint: 4-6 weeks (part-time)

Important Reality Check

You are building infrastructure for content, which means:

Your output quality equals your schema quality.
Debugging is about edge cases, not the happy path.
Determinism is a feature, not an optimization.
The generator is only as good as the parser.

Big Picture / Mental Model (Diagram-First)

         Human-written Markdown (messy, inconsistent)
                          |
                          v
┌──────────────────────────────────────────────────────┐
│ 1. Parse into AST (structural truth)                 │
│ 2. Extract project blocks (boundaries)               │
│ 3. Normalize metadata into JSON                      │
│ 4. Validate JSON against schema                      │
│ 5. Build concept dependency graph                    │
│ 6. Expand templates into full project files          │
│ 7. Lint + rubric + citation checks                   │
└──────────────────────────────────────────────────────┘
                          |
                          v
              Deterministic project library

Theory Primer (Read This Before Coding)

Chapter 1: Markdown Parsing and Structural Extraction

1.1 CommonMark, GFM, and Why ASTs Matter

Definitions & Key Terms

CommonMark: A formal Markdown spec that defines block/inline parsing rules.
GFM: GitHub Flavored Markdown, a strict superset of CommonMark.
AST: A tree where each node is a Markdown construct (heading, list, code block).

Mental Model

Raw Markdown
   |
   v
[Tokenizer] -> [Block Parser] -> [Inline Parser]
   |                  |
   |                  v
   |            Document AST
   v
Tokens

How It Works (Step-by-Step)

Block parsing identifies structural elements like headings, lists, and code blocks.
Inline parsing runs later to resolve links, emphasis, and code spans.
The AST preserves structure and order, letting you define reliable boundaries.
Headings like ## Project N: become boundary markers for extraction.

Minimal Example

## Project 1: Parser
- **Main Programming Language**: Python
- **Difficulty**: Level 3

Common Misconceptions

Misconception: Regex is enough for Markdown parsing.
Correction: Regex fails on nested lists, code fences, and inline formatting.

Check-Your-Understanding

Why must block parsing occur before inline parsing?
What goes wrong if you treat Markdown headings as plain strings?
How do code fences change list parsing behavior?

Where You’ll Use This

Project 1 (Parser & Schema Normalizer)

1.2 Boundary Detection and Source Positions

Definitions & Key Terms

Boundary: A reliable point where a new project starts (e.g., heading node).
Source Position: Line/column info that helps you report errors precisely.

Mental Model

[Heading Node] -> [Boundary Marker]
[Paragraph]    -> [Metadata Candidate]
[List]         -> [Key/Value Pairs]

How It Works

Walk the AST in order.
When you see a Heading with “Project” prefix, start a new block.
Collect nodes until the next boundary.
Record line/column for every metadata token.

Minimal Example (Pseudo-Code)

for node in ast:
    if is_project_heading(node):
        start_new_block(node)
    else:
        append_to_current_block(node)

Common Misconceptions

Misconception: Boundaries always equal blank lines.
Correction: Headings are the only stable boundary marker in Markdown.

Check-Your-Understanding

Why are source positions crucial for schema errors?
How would you handle a project heading inside a block quote?

Where You’ll Use This

Project 1 (Parser & Schema Normalizer)

Chapter 2: Schema-Driven Normalization and Canonical JSON

2.1 JSON Schema as a Contract

Definitions & Key Terms

JSON Schema: A formal grammar for validating JSON documents.
Meta-schema: The schema that validates schemas.
Validation keywords: Constraints like type, required, enum, pattern.

Mental Model

Raw JSON -> [Schema Validator] -> OK | Error List

How It Works

Parse extracted metadata into JSON.
Validate against a JSON Schema.
Return structured errors with locations and suggestions.

Minimal Example

{
  "title": "Project 1: Parser",
  "difficulty": "Level 3",
  "language": "Python"
}

Common Misconceptions

Misconception: Schema is optional if the parser works.
Correction: Schema turns implicit assumptions into enforceable rules.

Check-Your-Understanding

Why should schemas be versioned?
What happens if required fields are missing?

Where You’ll Use This

Project 1 (Parser & Schema Normalizer)
Project 4 (QA + Enrichment)

2.2 Canonicalization and Determinism

Definitions & Key Terms

Canonical JSON: A deterministic serialization with stable key ordering.
JCS: JSON Canonicalization Scheme (RFC 8785).
Determinism: Same inputs produce byte-identical outputs.

Mental Model

Unordered JSON -> [Canonicalizer] -> Stable JSON

How It Works

Sort object keys deterministically.
Normalize numbers and string encodings.
Serialize in a consistent format.

Minimal Example

{"b":2,"a":1}  ->  {"a":1,"b":2}

Common Misconceptions

Misconception: Pretty-printing is enough for determinism.
Correction: Pretty-printing doesn’t fix key ordering or numeric form.

Check-Your-Understanding

Why does canonicalization matter for caching and diffs?
How do floats break deterministic output?

Where You’ll Use This

Project 1 (Parser & Schema Normalizer)
Project 3 (Template Expander CLI)

Chapter 3: Concept Graphs and Dependency Modeling

3.1 Building a Concept Graph

Definitions & Key Terms

Concept Node: A topic like “JSON Schema” or “AST Parsing”.
Edge: A directed relationship representing prerequisites.
DAG: Directed Acyclic Graph.

Mental Model

Concept A -> Concept B -> Concept C

How It Works

Extract concepts from project metadata.
Build edges from prerequisite lists.
Detect cycles and report them.

Minimal Example (DOT)

digraph {
  "Markdown" -> "AST";
  "AST" -> "Schema";
}

Common Misconceptions

Misconception: Any graph is fine; cycles don’t matter.
Correction: Cycles break linear learning paths and must be handled.

Check-Your-Understanding

Why does a cycle make scheduling impossible?
What would a self-loop imply in this model?

Where You’ll Use This

Project 2 (Concept Graph Builder)

3.2 Topological Sort and Cycle Detection

Definitions & Key Terms

Topological Sort: An ordering where prerequisites come first.
Cycle Detection: Finding loops that block ordering.

Mental Model

A -> B -> C
|         ^
└---------┘  (cycle)

How It Works

Compute in-degrees for each node.
Repeatedly remove nodes with zero in-degree.
If nodes remain, you have a cycle.

Minimal Example (Pseudo-Code)

queue = [n for n in nodes if indeg[n] == 0]
while queue:
    n = queue.pop()
    for m in adj[n]:
        indeg[m] -= 1

Common Misconceptions

Misconception: DFS alone gives a valid ordering.
Correction: DFS ordering only works if you explicitly handle back edges.

Check-Your-Understanding

What does it mean if the queue becomes empty early?
How would you report a cycle path to the user?

Where You’ll Use This

Project 2 (Concept Graph Builder)

Chapter 4: Template-Driven Expansion

4.1 Templates as Deterministic Blueprints

Definitions & Key Terms

Template: A structured blueprint with placeholders.
Partial: Reusable template fragments.
Slug: A filesystem-safe identifier derived from a title.

Mental Model

JSON Project -> [Template Engine] -> Markdown File

How It Works

Load a template file (Markdown + placeholders).
Render with project metadata.
Write output with deterministic ordering.

Minimal Example

# Project : 

> 

Common Misconceptions

Misconception: Template rendering is just string replacement.
Correction: Proper templating handles conditionals, loops, and escaping.

Check-Your-Understanding

How do you ensure template output is deterministic?
What happens if a field is missing?

Where You’ll Use This

Project 3 (Project Expander CLI)

4.2 Stable IDs, Slugs, and File Layouts

Definitions & Key Terms

Stable ID: A persistent identifier that does not change between runs.
Slugify: Converting titles to lowercase, hyphenated file names.

Mental Model

"Project 3: CLI" -> "P03-project-expander-cli.md"

How It Works

Normalize title: lowercase, remove punctuation.
Prefix with stable project number.
Enforce max length and uniqueness.

Minimal Example

"Project 12: API Validator" -> P12-api-validator.md

Common Misconceptions

Misconception: File names can be derived from titles directly.
Correction: Titles are not stable; IDs must be.

Check-Your-Understanding

What happens if two projects have the same title?
How do you preserve links when titles change?

Where You’ll Use This

Project 3 (Project Expander CLI)

Chapter 5: QA, Linting, and Provenance

5.1 Lint Rules and Rubric Scoring

Definitions & Key Terms

Lint Rule: A deterministic check against a structural rule.
Rubric: A weighted scoring system for quality evaluation.

Mental Model

Document -> [Lint Rules] -> Errors/Warnings
Document -> [Rubric] -> Score

How It Works

Parse output files.
Validate required sections exist.
Score for depth, clarity, and completeness.

Minimal Example

{"rule": "has_definition_of_done", "status": "fail"}

Common Misconceptions

Misconception: Linting only checks syntax.
Correction: Linting can enforce semantic completeness.

Check-Your-Understanding

What makes a good lint rule?
How do you avoid false positives?

Where You’ll Use This

Project 4 (QA + Enrichment)

5.2 Provenance and Citations

Definitions & Key Terms

Provenance: A record of where facts and claims come from.
Citation Map: A structured link between claims and sources.

Mental Model

Claim -> [Source Link] -> Verified Reference

How It Works

Extract URLs or references from content.
Verify each reference resolves and matches a claim.
Store provenance in a structured JSON map.

Minimal Example

{"claim": "GFM is a superset of CommonMark", "source": "https://github.github.io/gfm/"}

Common Misconceptions

Misconception: Citations are optional in technical content.
Correction: Provenance is how you scale trust and maintainability.

Check-Your-Understanding

How do you store citations without polluting content?
What should happen if a source disappears?

Where You’ll Use This

Project 4 (QA + Enrichment)

Glossary (High-Signal)

AST: Structured tree representation of parsed Markdown.
Canonicalization: Deterministic serialization of JSON data.
Concept Graph: Directed graph of concepts and prerequisites.
DAG: Directed graph without cycles, enabling linear ordering.
Determinism: Same inputs always produce byte-identical outputs.
Lint: Automated checks that enforce structural or semantic rules.
Slug: URL- and filesystem-safe identifier derived from a title.

Why Project Expansion Automation Matters

The Modern Problem It Solves

Engineers spend significant time searching for answers and context instead of building. Technical documentation remains a primary learning resource for developers, but most content is inconsistent, incomplete, or unstructured. A project expansion pipeline turns scattered ideas into reliable, auditable learning assets, reducing knowledge friction and making technical learning reproducible at scale.

Real-world impact (recent studies):

The 2023 Stack Overflow Developer Survey reports that 63% of respondents spend more than 30 minutes per day searching for answers or solutions at work.
Stack Overflow’s 2025 developer survey press release reports that technical documentation remains a top learning resource for developers.

The Shift: From Manual Docs to Deterministic Pipelines

Manual Expansion                          Automated Expansion
┌────────────────────────────┐           ┌────────────────────────────┐
│ One-off writing             │           │ Structured input + schema  │
│ Inconsistent sections       │           │ Deterministic output        │
│ Hard to update              │           │ Regeneratable files         │
│ Unclear dependencies        │           │ Explicit concept graph      │
└────────────────────────────┘           └────────────────────────────┘

Context & Evolution (Brief)

Markdown became the lingua franca for developer documentation, but without a schema, it created a new problem: human-friendly inputs with machine-hostile ambiguity. The modern solution is to keep Markdown for authors while extracting a schema-validated intermediate representation for tooling.

Concept Summary Table

Concept Cluster	What You Need to Internalize
Markdown Parsing	AST-based parsing is the only reliable way to extract structured metadata.
Schema Validation	JSON Schema formalizes assumptions and makes errors actionable.
Canonicalization	Deterministic output requires stable ordering and serialization rules.
Concept Graphs	Dependency edges reveal prerequisites and learning order.
Template Expansion	Templates turn structured data into repeatable content output.
QA + Provenance	Linting and citations keep content trustworthy at scale.

Project-to-Concept Map

Project	What It Builds	Primer Chapters It Uses
Project 1: Markdown Project Parser & Schema Normalizer	Parser + JSON normalization	Ch. 1, Ch. 2
Project 2: Concept Map & Dependency Graph Builder	Concept DAG + DOT output	Ch. 3
Project 3: Template-Driven Project Expander CLI	Template engine + deterministic files	Ch. 4
Project 4: QA + Research Enrichment Pipeline	Linting + rubric + provenance	Ch. 5

Deep Dive Reading by Concept

Fundamentals & Parsing

Concept	Book & Chapter	Why This Matters
Parsing and data extraction	“Clean Code” by Robert C. Martin — Ch. 3 (Functions), Ch. 7 (Error Handling)	Reliable parsers require clean error handling and small, testable functions.
Data modeling and correctness	“Code Complete” by Steve McConnell — Ch. 10 (General Issues in Using Variables)	Helps you design data structures that prevent invalid states.

Graphs & Algorithms

Concept	Book & Chapter	Why This Matters
Directed graphs	“Algorithms, Fourth Edition” by Sedgewick & Wayne — Ch. 4 (Graphs)	Concept graphs and cycle detection are core to dependency reasoning.
Dependency ordering	“Algorithms in C” by Sedgewick — Part 5 (Graphs)	Implementation-level understanding of topological sort.

Architecture & QA

Concept	Book & Chapter	Why This Matters
Pipeline architecture	“Fundamentals of Software Architecture” by Richards & Ford — Ch. 9 (Architecture Styles)	Helps you reason about pipeline design and trade-offs.
Code quality systems	“Refactoring” by Martin Fowler — Ch. 2 (Principles in Refactoring)	Lint and rubric tooling are about maintainability and safe change.

Quick Start: Your First 48 Hours

Day 1 (4 hours):

Read Chapters 1 and 2 only.
Install a CommonMark parser in your language of choice.
Start Project 1 and extract just the project titles.
Write a JSON Schema with only 3 required fields.

Day 2 (4 hours):

Add metadata extraction for difficulty and language.
Validate JSON output and print errors with line/column.
Write a tiny canonicalizer that sorts keys.
Run the golden output example from Project 1.

End of Weekend: You can parse a project list into schema-validated, deterministic JSON. That is the backbone of everything else.

Recommended Learning Paths

Path 1: The Documentation Engineer (Recommended Start)

Best for: People building documentation tooling or content pipelines.

Project 1 → Project 3 → Project 4
Optional: Project 2 if you need dependency graphs

Path 2: The Algorithm-Focused Learner

Best for: Learners interested in graph algorithms and dependency reasoning.

Project 2 → Project 1 → Project 3 → Project 4

Path 3: The Systems Builder

Best for: Learners who want the full pipeline end-to-end.

Project 1 → Project 2 → Project 3 → Project 4 (full sequence)

Success Metrics

By the end, you should be able to:

Parse a Markdown project list into canonical JSON with deterministic output
Detect and report schema errors with exact line/column references
Generate a dependency graph and detect cycles
Expand projects into deterministic Markdown files with stable IDs
Enforce quality rules with linting and citation checks

Optional Appendices

Appendix A: Markdown + JSON Schema Cheatsheet

Markdown boundaries to treat as project markers:

## Project N: Title
#### Project N: Title

JSON Schema essentials:

type, required, enum, pattern, additionalProperties

Appendix B: Determinism Checklist

Key ordering is stable
Timestamps are fixed or removed
Output sorting is deterministic
Slugs use a stable algorithm
Canonical JSON serialization is used

Appendix C: CLI Exit Code Conventions

0: Success
1: Invalid input
2: Schema validation failure
3: Internal error
4: Partial output or non-fatal warnings

Appendix D: Primary Specs & References

CommonMark Spec: https://spec.commonmark.org/
GitHub Flavored Markdown (GFM) Spec: https://github.github.io/gfm/
JSON Schema Specification (2020-12): https://json-schema.org/specification
JSON Canonicalization Scheme (RFC 8785): https://www.rfc-editor.org/rfc/rfc8785
Graphviz DOT Language: https://graphviz.org/doc/info/lang.html
Stack Overflow Developer Survey 2023: https://survey.stackoverflow.co/2023
Stack Overflow Developer Survey 2025 (learning resources): https://survey.stackoverflow.co/2025/developers/

Projects

Project 1: Markdown Project Parser & Schema Normalizer

Main Programming Language: Python 3.11
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4 — Infrastructure builder
Business Potential: 3 — Internal tooling and content operations
Difficulty: Level 3 — Intermediate
Knowledge Area: Documentation Engineering
Software or Tool: CommonMark parser + JSON Schema validator
Main Book: “Clean Code” by Robert C. Martin

What you’ll build: A CLI that parses a Markdown project list into deterministic, schema-validated JSON with actionable errors and line/column diagnostics.

Why it teaches this topic: It forces you to separate syntax from structure, validate assumptions with schemas, and produce repeatable outputs—core to any scalable content pipeline.

Core challenges you’ll face:

Reliable boundary detection → avoiding false positives in headings
Schema design → balancing flexibility and strictness
Deterministic output → canonicalization and ordering

Real World Outcome

What you will see:

A build/projects.json file containing a normalized list of projects
A build/errors.json file with line/column errors (if any)
A deterministic output that is byte-identical across runs

Command Line Outcome Example:

# 1. Parse and normalize
$ expander parse projects.md --out build/
[ok] Parsed 12 project blocks
[ok] Schema validation: 12/12 valid
[ok] Wrote build/projects.json (deterministic)

# 2. Validate only
$ expander validate projects.md
[ok] Schema validation passed (0 errors)

# 3. Trigger a failure
$ expander validate projects_missing.md
[error] Project 4 missing required field: "difficulty"
        at line 214, column 5
exit code: 2

The Core Question You’re Answering

“How do I turn ambiguous Markdown into a reliable, machine-validated data model?”

This question forces you to confront the gap between human writing and machine expectations. Solving it teaches you to build trustable pipelines rather than one-off scripts.

Concepts You Must Understand First

CommonMark Parsing
- Why is CommonMark a two-phase parser (block then inline)?
- What goes wrong if you parse inline formatting too early?
- Book Reference: “Clean Code” Ch. 3 — Robert C. Martin
JSON Schema Validation
- What does required actually enforce?
- How do you express enums and patterns?
- Book Reference: “Code Complete” Ch. 10 — Steve McConnell
Deterministic Serialization
- Why is key ordering important for diff-based workflows?
- How does canonicalization reduce noise?
- Book Reference: “Clean Architecture” Ch. 7 — Robert C. Martin

Questions to Guide Your Design

Parsing Strategy
- How will you detect a new project boundary?
- How will you handle malformed or nested headings?
Schema Design
- Which fields are required vs optional?
- How do you enforce enum values for difficulty?
Diagnostics
- How will you map schema errors back to source lines?
- What should exit codes represent?

Thinking Exercise

Trace a broken project entry

Imagine this input:

## Project 7: Foo
- **Language**: Python
- **Difficulty**: 

Questions while tracing:

Where does the missing difficulty value become a schema error?
What line/column should the error report?
Should this be a warning or a hard failure?

The Interview Questions They’ll Ask

“Why is Markdown parsing hard to do with regex?”
“What is JSON Schema and why is it useful?”
“How would you make JSON output deterministic?”
“How do you surface validation errors to users?”
“Why should CLI tools return explicit exit codes?”

Hints in Layers

Hint 1: Start with AST extraction Use a CommonMark parser and walk heading nodes first.

Hint 2: Design the schema early Write the JSON Schema before you parse. It defines your extraction targets.

Hint 3: Canonicalize output Sort object keys and use a stable serializer.

json.dumps(data, sort_keys=True, separators=(",", ":"))

Hint 4: Error positions Store line/column metadata alongside extracted tokens.

Books That Will Help

Topic	Book	Chapter
Error handling	“Clean Code” by Robert C. Martin	Ch. 7
Data modeling	“Code Complete” by Steve McConnell	Ch. 10
CLI design	“The Pragmatic Programmer” by Thomas & Hunt	Ch. 5

Common Pitfalls & Debugging

Problem 1: “Projects split incorrectly”

Why: Headings are parsed as plain text instead of AST nodes.
Fix: Use a CommonMark parser and detect heading nodes explicitly.
Quick test: Count the number of parsed headings vs expected projects.

Problem 2: “Schema validation passes but output is wrong”

Why: Schema is too permissive.
Fix: Add enum, minLength, and pattern constraints.
Quick test: Inject an invalid value and ensure schema rejects it.

Problem 3: “Diffs change every run”

Why: Output ordering is not deterministic.
Fix: Sort keys and normalize arrays.
Quick test: Run the pipeline twice and compare file hashes.

Definition of Done

Parser extracts all project blocks correctly
JSON output validates against schema
Output is deterministic across runs
Errors include line/column references
CLI exits with correct exit codes

Project 2: Concept Map & Dependency Graph Builder

Main Programming Language: Python 3.11
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4 — Graph wizardry
Business Potential: 2 — Internal analytics and planning
Difficulty: Level 3 — Intermediate
Knowledge Area: Graph Algorithms
Software or Tool: Graphviz DOT + cycle detection
Main Book: “Algorithms, Fourth Edition” by Sedgewick & Wayne

What you’ll build: A tool that reads normalized project JSON, extracts concept dependencies, and outputs a DAG with cycle detection and DOT/SVG visualizations.

Why it teaches this topic: It forces you to model learning dependencies explicitly, then verify they are acyclic and navigable.

Core challenges you’ll face:

Graph modeling → designing node IDs and edge semantics
Cycle detection → detecting and reporting cycles clearly
Visualization → producing DOT output for graph rendering

Real World Outcome

What you will see:

build/concepts.json with nodes and edges
build/concepts.dot for Graphviz rendering
build/concepts.svg visualizing dependencies

Command Line Outcome Example:

# 1. Build concept graph
$ expander concepts build/projects.json --out build/
[ok] Loaded 12 projects
[ok] Extracted 34 concepts
[ok] Graph is acyclic
[ok] Wrote build/concepts.dot
[ok] Rendered build/concepts.svg

# 2. Detect a cycle
$ expander concepts build/projects_with_cycle.json --out build/
[error] Cycle detected: "Schema" -> "Parser" -> "Schema"
exit code: 2

The Core Question You’re Answering

“How do I make hidden learning dependencies explicit and verifiable?”

Dependency graphs turn assumptions into visual and testable structure.

Concepts You Must Understand First

Directed Graphs
- What is a node vs an edge?
- How do you represent prerequisites?
- Book Reference: “Algorithms, Fourth Edition” Ch. 4 — Sedgewick & Wayne
Topological Sort
- Why does a topological ordering only exist for DAGs?
- How do you compute one efficiently?
- Book Reference: “Algorithms in C” Part 5 — Sedgewick
DOT Language
- What does digraph mean in DOT?
- How do you label nodes and edges?
- Book Reference: “Clean Code” Ch. 2 — Robert C. Martin (for naming clarity)

Questions to Guide Your Design

Graph Model
- Will concepts be unique globally or per project?
- How will you handle synonyms?
Cycle Reporting
- How will you output the actual cycle path?
- Should cycles be errors or warnings?
Visualization
- What layout makes dependencies readable?
- How will you avoid node clutter?

Thinking Exercise

Sketch a concept graph with 5 nodes and one cycle. How would your tool report it?

The Interview Questions They’ll Ask

“What is a DAG and why is it useful?”
“How does topological sorting work?”
“What is a cycle and how do you detect it?”
“Why is visualization important for dependency graphs?”
“How would you scale graph generation for 1,000 concepts?”

Hints in Layers

Hint 1: Start with adjacency lists Build a dict[str, list[str]] of edges.

Hint 2: Use Kahn’s algorithm It gives both ordering and cycle detection.

Hint 3: Output DOT Use digraph { "A" -> "B"; } as your minimal format.

Hint 4: Cycle detail Track parent pointers to reconstruct the cycle path.

Books That Will Help

Topic	Book	Chapter
Graphs	“Algorithms, Fourth Edition”	Ch. 4
Dependency ordering	“Algorithms in C”	Part 5
Naming & clarity	“Clean Code”	Ch. 2

Common Pitfalls & Debugging

Problem 1: “Cycle detection never triggers”

Why: In-degree counts are not updated correctly.
Fix: Decrement in-degree for each outgoing edge.
Quick test: Create a 2-node cycle and confirm detection.

Problem 2: “DOT output renders blank”

Why: Missing digraph header or semicolons.
Fix: Validate DOT syntax with dot -Tsvg.
Quick test: Run dot -Tsvg build/concepts.dot -o /tmp/test.svg.

Definition of Done

Graph nodes and edges generated correctly
Cycle detection works with clear error output
DOT output renders to SVG
CLI reports meaningful exit codes

Project 3: Template-Driven Project Expander CLI

Main Programming Language: Python 3.11
Alternative Programming Languages: Rust, Go
Coolness Level: Level 5 — Content pipeline architect
Business Potential: 4 — Productizable documentation tooling
Difficulty: Level 4 — Advanced
Knowledge Area: Code Generation
Software or Tool: Template engine + filesystem writer
Main Book: “Clean Architecture” by Robert C. Martin

What you’ll build: A CLI that reads normalized project JSON and expands each project into its own Markdown file using templates, producing deterministic output and a README index.

Why it teaches this topic: It teaches deterministic generation, template modeling, and reproducible outputs at scale.

Core challenges you’ll face:

Template design → preventing missing sections
File layout → deterministic naming and ordering
Regeneration → idempotent output

Real World Outcome

What you will see:

expander.codex/ directory with per-project files
expander.codex/README.md index
Stable filenames like P03-project-expander-cli.md

Command Line Outcome Example:

# 1. Expand projects into files
$ expander expand build/projects.json --out expander.codex
[ok] Expanded 12 projects
[ok] Wrote expander.codex/README.md
[ok] Deterministic output confirmed

# 2. Re-run to verify determinism
$ expander expand build/projects.json --out expander.codex
[ok] No changes detected (hash match)

The Core Question You’re Answering

“How do I turn structured data into deterministic, publishable content?”

This is the core of content automation: stable, repeatable generation.

Concepts You Must Understand First

Template Engines
- How do you safely interpolate data?
- How do you prevent missing sections?
- Book Reference: “Clean Architecture” Ch. 7 — Robert C. Martin
Deterministic Output
- What changes between runs?
- How do you guarantee ordering?
- Book Reference: “Code Complete” Ch. 12 — Steve McConnell
Slugging and IDs
- Why are titles not stable identifiers?
- How do you enforce uniqueness?
- Book Reference: “Refactoring” Ch. 2 — Martin Fowler

Questions to Guide Your Design

Template Layout
- What sections are mandatory?
- How will optional fields render?
File Naming
- How do you prevent collisions?
- How do you handle renames?
Regeneration
- Should old files be deleted?
- How do you detect changes?

Thinking Exercise

Design a template that ensures every project includes a “Definition of Done” section, even if the input data is missing. How would you implement a default?

The Interview Questions They’ll Ask

“What is the difference between code generation and templating?”
“How do you make template output deterministic?”
“How do you handle missing data fields gracefully?”
“What is idempotence in file generation?”
“How do you test generated files?”

Hints in Layers

Hint 1: Start with a single template Render one project file before building the loop.

Hint 2: Sort projects by ID Stable ordering prevents diff noise.

Hint 3: Use hash comparisons Compute file hashes to detect changes.

Hint 4: Keep a README index Generate a summary file to verify all outputs.

Books That Will Help

Topic	Book	Chapter
Architecture	“Clean Architecture”	Ch. 7
Maintainability	“Refactoring”	Ch. 2
Quality design	“Code Complete”	Ch. 12

Common Pitfalls & Debugging

Problem 1: “Files change on every run”

Why: Non-deterministic ordering or timestamps.
Fix: Sort inputs and remove timestamps.
Quick test: Run twice and compare checksums.

Problem 2: “Templates break on missing fields”

Why: No defaults or missing conditional logic.
Fix: Define defaults and fallbacks.
Quick test: Remove optional data and render.

Definition of Done

Per-project files generated with correct names
README index generated and sorted
Re-running produces no diff
Missing fields handled gracefully

Project 4: QA + Research Enrichment Pipeline

Main Programming Language: Python 3.11
Alternative Programming Languages: Rust, Go
Coolness Level: Level 5 — Quality enforcer
Business Potential: 4 — Content QA automation
Difficulty: Level 4 — Advanced
Knowledge Area: Quality Engineering
Software or Tool: Lint engine + rubric scorer
Main Book: “Refactoring” by Martin Fowler

What you’ll build: A pipeline that lints project files for completeness, scores them against a rubric, and verifies citation provenance.

Why it teaches this topic: It forces you to encode quality expectations into machine-checkable rules.

Core challenges you’ll face:

Rule design → avoiding false positives
Rubric scoring → balancing depth vs breadth
Provenance checks → ensuring sources are valid

Real World Outcome

What you will see:

build/lint.json with structured lint results
build/rubric.json with quality scores
build/sources.json mapping claims to sources

Command Line Outcome Example:

# 1. Run lint checks
$ expander lint expander.codex
[error] P02 missing "Definition of Done" section
[error] P03 missing "Common Pitfalls" section
exit code: 1

# 2. Score quality
$ expander rubric expander.codex
[ok] Overall score: 86/100
[warn] Project 1 lacks explicit CLI failure example

# 3. Verify sources
$ expander sources expander.codex
[ok] 14 sources verified
[warn] 2 sources missing or unreachable

The Core Question You’re Answering

“How do I make content quality measurable and enforceable?”

This project turns subjective quality into objective checks.

Concepts You Must Understand First

Lint Rule Design
- What is a deterministic rule?
- How do you avoid false positives?
- Book Reference: “Refactoring” Ch. 2 — Martin Fowler
Rubric Scoring
- How do you weight sections?
- What is an acceptable score threshold?
- Book Reference: “Clean Code” Ch. 4 — Robert C. Martin
Provenance Tracking
- How do you store citations?
- How do you handle dead links?
- Book Reference: “Code Complete” Ch. 20 — Steve McConnell

Questions to Guide Your Design

Lint Rules
- Which sections are mandatory?
- Which failures should be warnings vs errors?
Rubric
- How do you score depth of explanation?
- How do you prevent gaming?
Sources
- Where do you store source metadata?
- How do you validate source availability?

Thinking Exercise

Design a rubric with 5 weighted criteria. How would you score a project missing “Real World Outcome”?

The Interview Questions They’ll Ask

“How do you define content quality in measurable terms?”
“What makes a lint rule effective?”
“How do you handle false positives in linting?”
“Why is provenance important in technical documentation?”
“How would you scale QA to 1,000 projects?”

Hints in Layers

Hint 1: Start with structural checks Verify sections exist before scoring depth.

Hint 2: Use a ruleset file Store lint rules in JSON for easy updates.

Hint 3: Score by section length + keywords Combine structural checks with heuristic scoring.

Hint 4: Build a sources map Extract links and store them in sources.json.

Books That Will Help

Topic	Book	Chapter
Quality mindset	“Refactoring”	Ch. 2
Documentation clarity	“Clean Code”	Ch. 4
Large-scale projects	“Code Complete”	Ch. 20

Common Pitfalls & Debugging

Problem 1: “Lint rule flags everything”

Why: Rule is too strict or mismatched to input.
Fix: Add thresholds and allow warnings.
Quick test: Run on a known good file and confirm zero errors.

Problem 2: “Rubric scores are meaningless”

Why: Weights are arbitrary.
Fix: Calibrate scores against real examples.
Quick test: Score three known files and compare.

Definition of Done

Lint engine flags missing sections correctly
Rubric produces stable scores
Provenance map is generated
CLI returns exit codes for failures