Project Expansion Automation: Real World Projects

Goal: Build a deep mental model for turning a loose project list into a deterministic, validated, searchable, and extensible collection of high-quality project mini-books. You will understand how Markdown parsing works at the AST level, how JSON Schema creates a contract between content and tooling, how concept graphs expose prerequisite structure, and how deterministic generation makes outputs reproducible across machines and time. By the end, you’ll be able to build a full CLI pipeline that ingests a messy Markdown file, normalizes it into canonical JSON, generates a directory of expanded project files, and enforces quality with linting and provenance checks. This guide is both a systems design exercise and a documentation-engineering toolkit you can reuse for any knowledge product.


Introduction: What This Guide Covers

Project Expansion Automation is the practice of converting a high-level project list into a structured, deeply detailed set of project files using parsers, schemas, dependency graphs, and deterministic generators. It solves the real-world problem of inconsistent content quality and unscalable manual writing by creating a repeatable pipeline that produces project guides you can trust and build on.

What you will build (by the end of this guide):

  • A Markdown parser that extracts project metadata into schema-validated JSON
  • A concept dependency graph that reveals prerequisite structure and cycles
  • A template-driven CLI that generates deterministic project files + index
  • A QA and research enrichment pipeline that lints, scores, and audits outputs

Scope (what’s included):

  • Markdown parsing, AST extraction, and metadata normalization
  • JSON Schema validation and canonical output
  • Graph modeling of concepts and dependencies
  • Template expansion with deterministic file output
  • Quality assurance, lint rules, and citation provenance

Out of scope (for this guide):

  • Fully automated natural language generation of project content
  • Large-scale distributed processing or cloud orchestration
  • Proprietary content management systems

The Big Picture (Mental Model)

                 ┌───────────────────────────────┐
Input Markdown → │ Parser + AST + Extractor       │
                 └───────────────┬───────────────┘
                                 v
                 ┌───────────────────────────────┐
                 │ Schema Validation + Canonical │
                 │ JSON Normalization            │
                 └───────────────┬───────────────┘
                                 v
                 ┌───────────────────────────────┐
                 │ Concept Graph + Dependencies  │
                 └───────────────┬───────────────┘
                                 v
                 ┌───────────────────────────────┐
                 │ Template Expansion Engine     │
                 └───────────────┬───────────────┘
                                 v
                 ┌───────────────────────────────┐
                 │ QA + Lint + Provenance Audit  │
                 └───────────────┬───────────────┘
                                 v
                 ┌───────────────────────────────┐
                 │ Expanded Project Files + Index│
                 └───────────────────────────────┘

Key Terms You’ll See Everywhere

  • AST (Abstract Syntax Tree): A tree representation of Markdown structure (headings, lists, code blocks).
  • Schema Validation: Automated rules that ensure JSON data has expected shape and required fields.
  • Canonicalization: A deterministic serialization of JSON so outputs are identical across runs.
  • Concept Graph: A directed graph where nodes are concepts and edges represent prerequisites.
  • Deterministic Generation: Running the pipeline multiple times yields byte-identical output.

How to Use This Guide

  1. Read the Theory Primer first: Treat it like a mini-book. It will save you hours during implementation.
  2. Build the projects in order: Each project builds a critical subsystem for the next.
  3. Validate as you go: Every project includes a “Definition of Done” checklist and golden CLI output.
  4. Keep outputs deterministic: Always run with fixed inputs and deterministic ordering.
  5. Track your assumptions: Write down parsing assumptions and schema constraints explicitly.

Prerequisites & Background Knowledge

Before starting these projects, you should have foundational understanding in these areas:

Essential Prerequisites (Must Have)

Programming Skills:

  • Comfort reading and writing CLI tools (Python, Go, or Rust)
  • Experience manipulating JSON data
  • Familiarity with file system operations and directories

Text Processing Fundamentals:

  • Tokenization vs parsing
  • Regular expressions (as a last resort, not a primary strategy)

Data Modeling Basics:

  • JSON data structures
  • Schema vs instance data
  • Error reporting and validation concepts

Recommended Reading: “Clean Code” by Robert C. Martin — Ch. 7 (Error Handling), Ch. 3 (Functions)

Helpful But Not Required

Graph Algorithms:

  • Topological sort
  • Cycle detection in directed graphs
  • Can learn during: Project 2

Templating Systems:

  • Template rendering vs code generation
  • Can learn during: Project 3

Quality Engineering:

  • Lint rules and rubrics
  • Can learn during: Project 4

Self-Assessment Questions

Before starting, ask yourself:

  1. ✅ Can I parse a file and extract structured data from it without regex hacks?
  2. ✅ Do I understand the difference between schema validation and type checking?
  3. ✅ Can I explain what makes output deterministic and why it matters?
  4. ✅ Have I built a CLI that returns non-zero exit codes on failure?
  5. ✅ Can I reason about graph cycles and dependency ordering?

If you answered “no” to questions 1-3: Spend a weekend on a small CLI parsing exercise before starting. If you answered “yes” to all 5: You’re ready.

Development Environment Setup

Required Tools:

  • A Unix-like environment (macOS or Linux)
  • Python 3.11+ or Go 1.21+ or Rust 1.75+
  • A JSON Schema validator library
  • A Markdown parser library with AST support

Recommended Tools:

  • jq for JSON inspection
  • graphviz (dot) for graph rendering
  • ripgrep for source scanning

Testing Your Setup:

$ python3 --version
Python 3.11.5

$ which jq
/usr/bin/jq

$ dot -V
dot - graphviz version 9.0.0

Time Investment

  • Project 1: 8-12 hours
  • Project 2: 6-10 hours
  • Project 3: 10-15 hours
  • Project 4: 8-12 hours
  • Total sprint: 4-6 weeks (part-time)

Important Reality Check

You are building infrastructure for content, which means:

  1. Your output quality equals your schema quality.
  2. Debugging is about edge cases, not the happy path.
  3. Determinism is a feature, not an optimization.
  4. The generator is only as good as the parser.

Big Picture / Mental Model (Diagram-First)

         Human-written Markdown (messy, inconsistent)
                          |
                          v
┌──────────────────────────────────────────────────────┐
│ 1. Parse into AST (structural truth)                 │
│ 2. Extract project blocks (boundaries)               │
│ 3. Normalize metadata into JSON                      │
│ 4. Validate JSON against schema                      │
│ 5. Build concept dependency graph                    │
│ 6. Expand templates into full project files          │
│ 7. Lint + rubric + citation checks                   │
└──────────────────────────────────────────────────────┘
                          |
                          v
              Deterministic project library

Theory Primer (Read This Before Coding)

Chapter 1: Markdown Parsing and Structural Extraction

1.1 CommonMark, GFM, and Why ASTs Matter

Definitions & Key Terms
  • CommonMark: A formal Markdown spec that defines block/inline parsing rules.
  • GFM: GitHub Flavored Markdown, a strict superset of CommonMark.
  • AST: A tree where each node is a Markdown construct (heading, list, code block).
Mental Model
Raw Markdown
   |
   v
[Tokenizer] -> [Block Parser] -> [Inline Parser]
   |                  |
   |                  v
   |            Document AST
   v
Tokens
How It Works (Step-by-Step)
  1. Block parsing identifies structural elements like headings, lists, and code blocks.
  2. Inline parsing runs later to resolve links, emphasis, and code spans.
  3. The AST preserves structure and order, letting you define reliable boundaries.
  4. Headings like ## Project N: become boundary markers for extraction.
Minimal Example
## Project 1: Parser
- **Main Programming Language**: Python
- **Difficulty**: Level 3
Common Misconceptions
  • Misconception: Regex is enough for Markdown parsing.
  • Correction: Regex fails on nested lists, code fences, and inline formatting.
Check-Your-Understanding
  1. Why must block parsing occur before inline parsing?
  2. What goes wrong if you treat Markdown headings as plain strings?
  3. How do code fences change list parsing behavior?
Where You’ll Use This
  • Project 1 (Parser & Schema Normalizer)

1.2 Boundary Detection and Source Positions

Definitions & Key Terms
  • Boundary: A reliable point where a new project starts (e.g., heading node).
  • Source Position: Line/column info that helps you report errors precisely.
Mental Model
[Heading Node] -> [Boundary Marker]
[Paragraph]    -> [Metadata Candidate]
[List]         -> [Key/Value Pairs]
How It Works
  1. Walk the AST in order.
  2. When you see a Heading with “Project” prefix, start a new block.
  3. Collect nodes until the next boundary.
  4. Record line/column for every metadata token.
Minimal Example (Pseudo-Code)
for node in ast:
    if is_project_heading(node):
        start_new_block(node)
    else:
        append_to_current_block(node)
Common Misconceptions
  • Misconception: Boundaries always equal blank lines.
  • Correction: Headings are the only stable boundary marker in Markdown.
Check-Your-Understanding
  1. Why are source positions crucial for schema errors?
  2. How would you handle a project heading inside a block quote?
Where You’ll Use This
  • Project 1 (Parser & Schema Normalizer)

Chapter 2: Schema-Driven Normalization and Canonical JSON

2.1 JSON Schema as a Contract

Definitions & Key Terms
  • JSON Schema: A formal grammar for validating JSON documents.
  • Meta-schema: The schema that validates schemas.
  • Validation keywords: Constraints like type, required, enum, pattern.
Mental Model
Raw JSON -> [Schema Validator] -> OK | Error List
How It Works
  1. Parse extracted metadata into JSON.
  2. Validate against a JSON Schema.
  3. Return structured errors with locations and suggestions.
Minimal Example
{
  "title": "Project 1: Parser",
  "difficulty": "Level 3",
  "language": "Python"
}
Common Misconceptions
  • Misconception: Schema is optional if the parser works.
  • Correction: Schema turns implicit assumptions into enforceable rules.
Check-Your-Understanding
  1. Why should schemas be versioned?
  2. What happens if required fields are missing?
Where You’ll Use This
  • Project 1 (Parser & Schema Normalizer)
  • Project 4 (QA + Enrichment)

2.2 Canonicalization and Determinism

Definitions & Key Terms
  • Canonical JSON: A deterministic serialization with stable key ordering.
  • JCS: JSON Canonicalization Scheme (RFC 8785).
  • Determinism: Same inputs produce byte-identical outputs.
Mental Model
Unordered JSON -> [Canonicalizer] -> Stable JSON
How It Works
  1. Sort object keys deterministically.
  2. Normalize numbers and string encodings.
  3. Serialize in a consistent format.
Minimal Example
{"b":2,"a":1}  ->  {"a":1,"b":2}
Common Misconceptions
  • Misconception: Pretty-printing is enough for determinism.
  • Correction: Pretty-printing doesn’t fix key ordering or numeric form.
Check-Your-Understanding
  1. Why does canonicalization matter for caching and diffs?
  2. How do floats break deterministic output?
Where You’ll Use This
  • Project 1 (Parser & Schema Normalizer)
  • Project 3 (Template Expander CLI)

Chapter 3: Concept Graphs and Dependency Modeling

3.1 Building a Concept Graph

Definitions & Key Terms
  • Concept Node: A topic like “JSON Schema” or “AST Parsing”.
  • Edge: A directed relationship representing prerequisites.
  • DAG: Directed Acyclic Graph.
Mental Model
Concept A -> Concept B -> Concept C
How It Works
  1. Extract concepts from project metadata.
  2. Build edges from prerequisite lists.
  3. Detect cycles and report them.
Minimal Example (DOT)
digraph {
  "Markdown" -> "AST";
  "AST" -> "Schema";
}
Common Misconceptions
  • Misconception: Any graph is fine; cycles don’t matter.
  • Correction: Cycles break linear learning paths and must be handled.
Check-Your-Understanding
  1. Why does a cycle make scheduling impossible?
  2. What would a self-loop imply in this model?
Where You’ll Use This
  • Project 2 (Concept Graph Builder)

3.2 Topological Sort and Cycle Detection

Definitions & Key Terms
  • Topological Sort: An ordering where prerequisites come first.
  • Cycle Detection: Finding loops that block ordering.
Mental Model
A -> B -> C
|         ^
└---------┘  (cycle)
How It Works
  1. Compute in-degrees for each node.
  2. Repeatedly remove nodes with zero in-degree.
  3. If nodes remain, you have a cycle.
Minimal Example (Pseudo-Code)
queue = [n for n in nodes if indeg[n] == 0]
while queue:
    n = queue.pop()
    for m in adj[n]:
        indeg[m] -= 1
Common Misconceptions
  • Misconception: DFS alone gives a valid ordering.
  • Correction: DFS ordering only works if you explicitly handle back edges.
Check-Your-Understanding
  1. What does it mean if the queue becomes empty early?
  2. How would you report a cycle path to the user?
Where You’ll Use This
  • Project 2 (Concept Graph Builder)

Chapter 4: Template-Driven Expansion

4.1 Templates as Deterministic Blueprints

Definitions & Key Terms
  • Template: A structured blueprint with placeholders.
  • Partial: Reusable template fragments.
  • Slug: A filesystem-safe identifier derived from a title.
Mental Model
JSON Project -> [Template Engine] -> Markdown File
How It Works
  1. Load a template file (Markdown + placeholders).
  2. Render with project metadata.
  3. Write output with deterministic ordering.
Minimal Example
# Project : 

> 
Common Misconceptions
  • Misconception: Template rendering is just string replacement.
  • Correction: Proper templating handles conditionals, loops, and escaping.
Check-Your-Understanding
  1. How do you ensure template output is deterministic?
  2. What happens if a field is missing?
Where You’ll Use This
  • Project 3 (Project Expander CLI)

4.2 Stable IDs, Slugs, and File Layouts

Definitions & Key Terms
  • Stable ID: A persistent identifier that does not change between runs.
  • Slugify: Converting titles to lowercase, hyphenated file names.
Mental Model
"Project 3: CLI" -> "P03-project-expander-cli.md"
How It Works
  1. Normalize title: lowercase, remove punctuation.
  2. Prefix with stable project number.
  3. Enforce max length and uniqueness.
Minimal Example
"Project 12: API Validator" -> P12-api-validator.md
Common Misconceptions
  • Misconception: File names can be derived from titles directly.
  • Correction: Titles are not stable; IDs must be.
Check-Your-Understanding
  1. What happens if two projects have the same title?
  2. How do you preserve links when titles change?
Where You’ll Use This
  • Project 3 (Project Expander CLI)

Chapter 5: QA, Linting, and Provenance

5.1 Lint Rules and Rubric Scoring

Definitions & Key Terms
  • Lint Rule: A deterministic check against a structural rule.
  • Rubric: A weighted scoring system for quality evaluation.
Mental Model
Document -> [Lint Rules] -> Errors/Warnings
Document -> [Rubric] -> Score
How It Works
  1. Parse output files.
  2. Validate required sections exist.
  3. Score for depth, clarity, and completeness.
Minimal Example
{"rule": "has_definition_of_done", "status": "fail"}
Common Misconceptions
  • Misconception: Linting only checks syntax.
  • Correction: Linting can enforce semantic completeness.
Check-Your-Understanding
  1. What makes a good lint rule?
  2. How do you avoid false positives?
Where You’ll Use This
  • Project 4 (QA + Enrichment)

5.2 Provenance and Citations

Definitions & Key Terms
  • Provenance: A record of where facts and claims come from.
  • Citation Map: A structured link between claims and sources.
Mental Model
Claim -> [Source Link] -> Verified Reference
How It Works
  1. Extract URLs or references from content.
  2. Verify each reference resolves and matches a claim.
  3. Store provenance in a structured JSON map.
Minimal Example
{"claim": "GFM is a superset of CommonMark", "source": "https://github.github.io/gfm/"}
Common Misconceptions
  • Misconception: Citations are optional in technical content.
  • Correction: Provenance is how you scale trust and maintainability.
Check-Your-Understanding
  1. How do you store citations without polluting content?
  2. What should happen if a source disappears?
Where You’ll Use This
  • Project 4 (QA + Enrichment)

Glossary (High-Signal)

  • AST: Structured tree representation of parsed Markdown.
  • Canonicalization: Deterministic serialization of JSON data.
  • Concept Graph: Directed graph of concepts and prerequisites.
  • DAG: Directed graph without cycles, enabling linear ordering.
  • Determinism: Same inputs always produce byte-identical outputs.
  • Lint: Automated checks that enforce structural or semantic rules.
  • Slug: URL- and filesystem-safe identifier derived from a title.

Why Project Expansion Automation Matters

The Modern Problem It Solves

Engineers spend significant time searching for answers and context instead of building. Technical documentation remains a primary learning resource for developers, but most content is inconsistent, incomplete, or unstructured. A project expansion pipeline turns scattered ideas into reliable, auditable learning assets, reducing knowledge friction and making technical learning reproducible at scale.

Real-world impact (recent studies):

  • The 2023 Stack Overflow Developer Survey reports that 63% of respondents spend more than 30 minutes per day searching for answers or solutions at work.
  • Stack Overflow’s 2025 developer survey press release reports that technical documentation remains a top learning resource for developers.

The Shift: From Manual Docs to Deterministic Pipelines

Manual Expansion                          Automated Expansion
┌────────────────────────────┐           ┌────────────────────────────┐
│ One-off writing             │           │ Structured input + schema  │
│ Inconsistent sections       │           │ Deterministic output        │
│ Hard to update              │           │ Regeneratable files         │
│ Unclear dependencies        │           │ Explicit concept graph      │
└────────────────────────────┘           └────────────────────────────┘

Context & Evolution (Brief)

Markdown became the lingua franca for developer documentation, but without a schema, it created a new problem: human-friendly inputs with machine-hostile ambiguity. The modern solution is to keep Markdown for authors while extracting a schema-validated intermediate representation for tooling.


Concept Summary Table

Concept Cluster What You Need to Internalize
Markdown Parsing AST-based parsing is the only reliable way to extract structured metadata.
Schema Validation JSON Schema formalizes assumptions and makes errors actionable.
Canonicalization Deterministic output requires stable ordering and serialization rules.
Concept Graphs Dependency edges reveal prerequisites and learning order.
Template Expansion Templates turn structured data into repeatable content output.
QA + Provenance Linting and citations keep content trustworthy at scale.

Project-to-Concept Map

Project What It Builds Primer Chapters It Uses
Project 1: Markdown Project Parser & Schema Normalizer Parser + JSON normalization Ch. 1, Ch. 2
Project 2: Concept Map & Dependency Graph Builder Concept DAG + DOT output Ch. 3
Project 3: Template-Driven Project Expander CLI Template engine + deterministic files Ch. 4
Project 4: QA + Research Enrichment Pipeline Linting + rubric + provenance Ch. 5

Deep Dive Reading by Concept

Fundamentals & Parsing

Concept Book & Chapter Why This Matters
Parsing and data extraction “Clean Code” by Robert C. Martin — Ch. 3 (Functions), Ch. 7 (Error Handling) Reliable parsers require clean error handling and small, testable functions.
Data modeling and correctness “Code Complete” by Steve McConnell — Ch. 10 (General Issues in Using Variables) Helps you design data structures that prevent invalid states.

Graphs & Algorithms

Concept Book & Chapter Why This Matters
Directed graphs “Algorithms, Fourth Edition” by Sedgewick & Wayne — Ch. 4 (Graphs) Concept graphs and cycle detection are core to dependency reasoning.
Dependency ordering “Algorithms in C” by Sedgewick — Part 5 (Graphs) Implementation-level understanding of topological sort.

Architecture & QA

Concept Book & Chapter Why This Matters
Pipeline architecture “Fundamentals of Software Architecture” by Richards & Ford — Ch. 9 (Architecture Styles) Helps you reason about pipeline design and trade-offs.
Code quality systems “Refactoring” by Martin Fowler — Ch. 2 (Principles in Refactoring) Lint and rubric tooling are about maintainability and safe change.

Quick Start: Your First 48 Hours

Day 1 (4 hours):

  1. Read Chapters 1 and 2 only.
  2. Install a CommonMark parser in your language of choice.
  3. Start Project 1 and extract just the project titles.
  4. Write a JSON Schema with only 3 required fields.

Day 2 (4 hours):

  1. Add metadata extraction for difficulty and language.
  2. Validate JSON output and print errors with line/column.
  3. Write a tiny canonicalizer that sorts keys.
  4. Run the golden output example from Project 1.

End of Weekend: You can parse a project list into schema-validated, deterministic JSON. That is the backbone of everything else.


Best for: People building documentation tooling or content pipelines.

  1. Project 1 → Project 3 → Project 4
  2. Optional: Project 2 if you need dependency graphs

Path 2: The Algorithm-Focused Learner

Best for: Learners interested in graph algorithms and dependency reasoning.

  1. Project 2 → Project 1 → Project 3 → Project 4

Path 3: The Systems Builder

Best for: Learners who want the full pipeline end-to-end.

  1. Project 1 → Project 2 → Project 3 → Project 4 (full sequence)

Success Metrics

By the end, you should be able to:

  • Parse a Markdown project list into canonical JSON with deterministic output
  • Detect and report schema errors with exact line/column references
  • Generate a dependency graph and detect cycles
  • Expand projects into deterministic Markdown files with stable IDs
  • Enforce quality rules with linting and citation checks

Optional Appendices

Appendix A: Markdown + JSON Schema Cheatsheet

Markdown boundaries to treat as project markers:

  • ## Project N: Title
  • #### Project N: Title

JSON Schema essentials:

  • type, required, enum, pattern, additionalProperties

Appendix B: Determinism Checklist

  • Key ordering is stable
  • Timestamps are fixed or removed
  • Output sorting is deterministic
  • Slugs use a stable algorithm
  • Canonical JSON serialization is used

Appendix C: CLI Exit Code Conventions

  • 0: Success
  • 1: Invalid input
  • 2: Schema validation failure
  • 3: Internal error
  • 4: Partial output or non-fatal warnings

Appendix D: Primary Specs & References

  • CommonMark Spec: https://spec.commonmark.org/
  • GitHub Flavored Markdown (GFM) Spec: https://github.github.io/gfm/
  • JSON Schema Specification (2020-12): https://json-schema.org/specification
  • JSON Canonicalization Scheme (RFC 8785): https://www.rfc-editor.org/rfc/rfc8785
  • Graphviz DOT Language: https://graphviz.org/doc/info/lang.html
  • Stack Overflow Developer Survey 2023: https://survey.stackoverflow.co/2023
  • Stack Overflow Developer Survey 2025 (learning resources): https://survey.stackoverflow.co/2025/developers/

Projects

Project 1: Markdown Project Parser & Schema Normalizer

  • Main Programming Language: Python 3.11
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4 — Infrastructure builder
  • Business Potential: 3 — Internal tooling and content operations
  • Difficulty: Level 3 — Intermediate
  • Knowledge Area: Documentation Engineering
  • Software or Tool: CommonMark parser + JSON Schema validator
  • Main Book: “Clean Code” by Robert C. Martin

What you’ll build: A CLI that parses a Markdown project list into deterministic, schema-validated JSON with actionable errors and line/column diagnostics.

Why it teaches this topic: It forces you to separate syntax from structure, validate assumptions with schemas, and produce repeatable outputs—core to any scalable content pipeline.

Core challenges you’ll face:

  • Reliable boundary detection → avoiding false positives in headings
  • Schema design → balancing flexibility and strictness
  • Deterministic output → canonicalization and ordering

Real World Outcome

What you will see:

  1. A build/projects.json file containing a normalized list of projects
  2. A build/errors.json file with line/column errors (if any)
  3. A deterministic output that is byte-identical across runs

Command Line Outcome Example:

# 1. Parse and normalize
$ expander parse projects.md --out build/
[ok] Parsed 12 project blocks
[ok] Schema validation: 12/12 valid
[ok] Wrote build/projects.json (deterministic)

# 2. Validate only
$ expander validate projects.md
[ok] Schema validation passed (0 errors)

# 3. Trigger a failure
$ expander validate projects_missing.md
[error] Project 4 missing required field: "difficulty"
        at line 214, column 5
exit code: 2

The Core Question You’re Answering

“How do I turn ambiguous Markdown into a reliable, machine-validated data model?”

This question forces you to confront the gap between human writing and machine expectations. Solving it teaches you to build trustable pipelines rather than one-off scripts.


Concepts You Must Understand First

  1. CommonMark Parsing
    • Why is CommonMark a two-phase parser (block then inline)?
    • What goes wrong if you parse inline formatting too early?
    • Book Reference: “Clean Code” Ch. 3 — Robert C. Martin
  2. JSON Schema Validation
    • What does required actually enforce?
    • How do you express enums and patterns?
    • Book Reference: “Code Complete” Ch. 10 — Steve McConnell
  3. Deterministic Serialization
    • Why is key ordering important for diff-based workflows?
    • How does canonicalization reduce noise?
    • Book Reference: “Clean Architecture” Ch. 7 — Robert C. Martin

Questions to Guide Your Design

  1. Parsing Strategy
    • How will you detect a new project boundary?
    • How will you handle malformed or nested headings?
  2. Schema Design
    • Which fields are required vs optional?
    • How do you enforce enum values for difficulty?
  3. Diagnostics
    • How will you map schema errors back to source lines?
    • What should exit codes represent?

Thinking Exercise

Trace a broken project entry

Imagine this input:

## Project 7: Foo
- **Language**: Python
- **Difficulty**: 

Questions while tracing:

  • Where does the missing difficulty value become a schema error?
  • What line/column should the error report?
  • Should this be a warning or a hard failure?

The Interview Questions They’ll Ask

  1. “Why is Markdown parsing hard to do with regex?”
  2. “What is JSON Schema and why is it useful?”
  3. “How would you make JSON output deterministic?”
  4. “How do you surface validation errors to users?”
  5. “Why should CLI tools return explicit exit codes?”

Hints in Layers

Hint 1: Start with AST extraction Use a CommonMark parser and walk heading nodes first.

Hint 2: Design the schema early Write the JSON Schema before you parse. It defines your extraction targets.

Hint 3: Canonicalize output Sort object keys and use a stable serializer.

json.dumps(data, sort_keys=True, separators=(",", ":"))

Hint 4: Error positions Store line/column metadata alongside extracted tokens.


Books That Will Help

Topic Book Chapter
Error handling “Clean Code” by Robert C. Martin Ch. 7
Data modeling “Code Complete” by Steve McConnell Ch. 10
CLI design “The Pragmatic Programmer” by Thomas & Hunt Ch. 5

Common Pitfalls & Debugging

Problem 1: “Projects split incorrectly”

  • Why: Headings are parsed as plain text instead of AST nodes.
  • Fix: Use a CommonMark parser and detect heading nodes explicitly.
  • Quick test: Count the number of parsed headings vs expected projects.

Problem 2: “Schema validation passes but output is wrong”

  • Why: Schema is too permissive.
  • Fix: Add enum, minLength, and pattern constraints.
  • Quick test: Inject an invalid value and ensure schema rejects it.

Problem 3: “Diffs change every run”

  • Why: Output ordering is not deterministic.
  • Fix: Sort keys and normalize arrays.
  • Quick test: Run the pipeline twice and compare file hashes.

Definition of Done

  • Parser extracts all project blocks correctly
  • JSON output validates against schema
  • Output is deterministic across runs
  • Errors include line/column references
  • CLI exits with correct exit codes

Project 2: Concept Map & Dependency Graph Builder

  • Main Programming Language: Python 3.11
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4 — Graph wizardry
  • Business Potential: 2 — Internal analytics and planning
  • Difficulty: Level 3 — Intermediate
  • Knowledge Area: Graph Algorithms
  • Software or Tool: Graphviz DOT + cycle detection
  • Main Book: “Algorithms, Fourth Edition” by Sedgewick & Wayne

What you’ll build: A tool that reads normalized project JSON, extracts concept dependencies, and outputs a DAG with cycle detection and DOT/SVG visualizations.

Why it teaches this topic: It forces you to model learning dependencies explicitly, then verify they are acyclic and navigable.

Core challenges you’ll face:

  • Graph modeling → designing node IDs and edge semantics
  • Cycle detection → detecting and reporting cycles clearly
  • Visualization → producing DOT output for graph rendering

Real World Outcome

What you will see:

  1. build/concepts.json with nodes and edges
  2. build/concepts.dot for Graphviz rendering
  3. build/concepts.svg visualizing dependencies

Command Line Outcome Example:

# 1. Build concept graph
$ expander concepts build/projects.json --out build/
[ok] Loaded 12 projects
[ok] Extracted 34 concepts
[ok] Graph is acyclic
[ok] Wrote build/concepts.dot
[ok] Rendered build/concepts.svg

# 2. Detect a cycle
$ expander concepts build/projects_with_cycle.json --out build/
[error] Cycle detected: "Schema" -> "Parser" -> "Schema"
exit code: 2

The Core Question You’re Answering

“How do I make hidden learning dependencies explicit and verifiable?”

Dependency graphs turn assumptions into visual and testable structure.


Concepts You Must Understand First

  1. Directed Graphs
    • What is a node vs an edge?
    • How do you represent prerequisites?
    • Book Reference: “Algorithms, Fourth Edition” Ch. 4 — Sedgewick & Wayne
  2. Topological Sort
    • Why does a topological ordering only exist for DAGs?
    • How do you compute one efficiently?
    • Book Reference: “Algorithms in C” Part 5 — Sedgewick
  3. DOT Language
    • What does digraph mean in DOT?
    • How do you label nodes and edges?
    • Book Reference: “Clean Code” Ch. 2 — Robert C. Martin (for naming clarity)

Questions to Guide Your Design

  1. Graph Model
    • Will concepts be unique globally or per project?
    • How will you handle synonyms?
  2. Cycle Reporting
    • How will you output the actual cycle path?
    • Should cycles be errors or warnings?
  3. Visualization
    • What layout makes dependencies readable?
    • How will you avoid node clutter?

Thinking Exercise

Sketch a concept graph with 5 nodes and one cycle. How would your tool report it?


The Interview Questions They’ll Ask

  1. “What is a DAG and why is it useful?”
  2. “How does topological sorting work?”
  3. “What is a cycle and how do you detect it?”
  4. “Why is visualization important for dependency graphs?”
  5. “How would you scale graph generation for 1,000 concepts?”

Hints in Layers

Hint 1: Start with adjacency lists Build a dict[str, list[str]] of edges.

Hint 2: Use Kahn’s algorithm It gives both ordering and cycle detection.

Hint 3: Output DOT Use digraph { "A" -> "B"; } as your minimal format.

Hint 4: Cycle detail Track parent pointers to reconstruct the cycle path.


Books That Will Help

Topic Book Chapter
Graphs “Algorithms, Fourth Edition” Ch. 4
Dependency ordering “Algorithms in C” Part 5
Naming & clarity “Clean Code” Ch. 2

Common Pitfalls & Debugging

Problem 1: “Cycle detection never triggers”

  • Why: In-degree counts are not updated correctly.
  • Fix: Decrement in-degree for each outgoing edge.
  • Quick test: Create a 2-node cycle and confirm detection.

Problem 2: “DOT output renders blank”

  • Why: Missing digraph header or semicolons.
  • Fix: Validate DOT syntax with dot -Tsvg.
  • Quick test: Run dot -Tsvg build/concepts.dot -o /tmp/test.svg.

Definition of Done

  • Graph nodes and edges generated correctly
  • Cycle detection works with clear error output
  • DOT output renders to SVG
  • CLI reports meaningful exit codes

Project 3: Template-Driven Project Expander CLI

  • Main Programming Language: Python 3.11
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 5 — Content pipeline architect
  • Business Potential: 4 — Productizable documentation tooling
  • Difficulty: Level 4 — Advanced
  • Knowledge Area: Code Generation
  • Software or Tool: Template engine + filesystem writer
  • Main Book: “Clean Architecture” by Robert C. Martin

What you’ll build: A CLI that reads normalized project JSON and expands each project into its own Markdown file using templates, producing deterministic output and a README index.

Why it teaches this topic: It teaches deterministic generation, template modeling, and reproducible outputs at scale.

Core challenges you’ll face:

  • Template design → preventing missing sections
  • File layout → deterministic naming and ordering
  • Regeneration → idempotent output

Real World Outcome

What you will see:

  1. expander.codex/ directory with per-project files
  2. expander.codex/README.md index
  3. Stable filenames like P03-project-expander-cli.md

Command Line Outcome Example:

# 1. Expand projects into files
$ expander expand build/projects.json --out expander.codex
[ok] Expanded 12 projects
[ok] Wrote expander.codex/README.md
[ok] Deterministic output confirmed

# 2. Re-run to verify determinism
$ expander expand build/projects.json --out expander.codex
[ok] No changes detected (hash match)

The Core Question You’re Answering

“How do I turn structured data into deterministic, publishable content?”

This is the core of content automation: stable, repeatable generation.


Concepts You Must Understand First

  1. Template Engines
    • How do you safely interpolate data?
    • How do you prevent missing sections?
    • Book Reference: “Clean Architecture” Ch. 7 — Robert C. Martin
  2. Deterministic Output
    • What changes between runs?
    • How do you guarantee ordering?
    • Book Reference: “Code Complete” Ch. 12 — Steve McConnell
  3. Slugging and IDs
    • Why are titles not stable identifiers?
    • How do you enforce uniqueness?
    • Book Reference: “Refactoring” Ch. 2 — Martin Fowler

Questions to Guide Your Design

  1. Template Layout
    • What sections are mandatory?
    • How will optional fields render?
  2. File Naming
    • How do you prevent collisions?
    • How do you handle renames?
  3. Regeneration
    • Should old files be deleted?
    • How do you detect changes?

Thinking Exercise

Design a template that ensures every project includes a “Definition of Done” section, even if the input data is missing. How would you implement a default?


The Interview Questions They’ll Ask

  1. “What is the difference between code generation and templating?”
  2. “How do you make template output deterministic?”
  3. “How do you handle missing data fields gracefully?”
  4. “What is idempotence in file generation?”
  5. “How do you test generated files?”

Hints in Layers

Hint 1: Start with a single template Render one project file before building the loop.

Hint 2: Sort projects by ID Stable ordering prevents diff noise.

Hint 3: Use hash comparisons Compute file hashes to detect changes.

Hint 4: Keep a README index Generate a summary file to verify all outputs.


Books That Will Help

Topic Book Chapter
Architecture “Clean Architecture” Ch. 7
Maintainability “Refactoring” Ch. 2
Quality design “Code Complete” Ch. 12

Common Pitfalls & Debugging

Problem 1: “Files change on every run”

  • Why: Non-deterministic ordering or timestamps.
  • Fix: Sort inputs and remove timestamps.
  • Quick test: Run twice and compare checksums.

Problem 2: “Templates break on missing fields”

  • Why: No defaults or missing conditional logic.
  • Fix: Define defaults and fallbacks.
  • Quick test: Remove optional data and render.

Definition of Done

  • Per-project files generated with correct names
  • README index generated and sorted
  • Re-running produces no diff
  • Missing fields handled gracefully

Project 4: QA + Research Enrichment Pipeline

  • Main Programming Language: Python 3.11
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 5 — Quality enforcer
  • Business Potential: 4 — Content QA automation
  • Difficulty: Level 4 — Advanced
  • Knowledge Area: Quality Engineering
  • Software or Tool: Lint engine + rubric scorer
  • Main Book: “Refactoring” by Martin Fowler

What you’ll build: A pipeline that lints project files for completeness, scores them against a rubric, and verifies citation provenance.

Why it teaches this topic: It forces you to encode quality expectations into machine-checkable rules.

Core challenges you’ll face:

  • Rule design → avoiding false positives
  • Rubric scoring → balancing depth vs breadth
  • Provenance checks → ensuring sources are valid

Real World Outcome

What you will see:

  1. build/lint.json with structured lint results
  2. build/rubric.json with quality scores
  3. build/sources.json mapping claims to sources

Command Line Outcome Example:

# 1. Run lint checks
$ expander lint expander.codex
[error] P02 missing "Definition of Done" section
[error] P03 missing "Common Pitfalls" section
exit code: 1

# 2. Score quality
$ expander rubric expander.codex
[ok] Overall score: 86/100
[warn] Project 1 lacks explicit CLI failure example

# 3. Verify sources
$ expander sources expander.codex
[ok] 14 sources verified
[warn] 2 sources missing or unreachable

The Core Question You’re Answering

“How do I make content quality measurable and enforceable?”

This project turns subjective quality into objective checks.


Concepts You Must Understand First

  1. Lint Rule Design
    • What is a deterministic rule?
    • How do you avoid false positives?
    • Book Reference: “Refactoring” Ch. 2 — Martin Fowler
  2. Rubric Scoring
    • How do you weight sections?
    • What is an acceptable score threshold?
    • Book Reference: “Clean Code” Ch. 4 — Robert C. Martin
  3. Provenance Tracking
    • How do you store citations?
    • How do you handle dead links?
    • Book Reference: “Code Complete” Ch. 20 — Steve McConnell

Questions to Guide Your Design

  1. Lint Rules
    • Which sections are mandatory?
    • Which failures should be warnings vs errors?
  2. Rubric
    • How do you score depth of explanation?
    • How do you prevent gaming?
  3. Sources
    • Where do you store source metadata?
    • How do you validate source availability?

Thinking Exercise

Design a rubric with 5 weighted criteria. How would you score a project missing “Real World Outcome”?


The Interview Questions They’ll Ask

  1. “How do you define content quality in measurable terms?”
  2. “What makes a lint rule effective?”
  3. “How do you handle false positives in linting?”
  4. “Why is provenance important in technical documentation?”
  5. “How would you scale QA to 1,000 projects?”

Hints in Layers

Hint 1: Start with structural checks Verify sections exist before scoring depth.

Hint 2: Use a ruleset file Store lint rules in JSON for easy updates.

Hint 3: Score by section length + keywords Combine structural checks with heuristic scoring.

Hint 4: Build a sources map Extract links and store them in sources.json.


Books That Will Help

Topic Book Chapter
Quality mindset “Refactoring” Ch. 2
Documentation clarity “Clean Code” Ch. 4
Large-scale projects “Code Complete” Ch. 20

Common Pitfalls & Debugging

Problem 1: “Lint rule flags everything”

  • Why: Rule is too strict or mismatched to input.
  • Fix: Add thresholds and allow warnings.
  • Quick test: Run on a known good file and confirm zero errors.

Problem 2: “Rubric scores are meaningless”

  • Why: Weights are arbitrary.
  • Fix: Calibrate scores against real examples.
  • Quick test: Score three known files and compare.

Definition of Done

  • Lint engine flags missing sections correctly
  • Rubric produces stable scores
  • Provenance map is generated
  • CLI returns exit codes for failures