Project Expansion Automation: Real World Projects
Goal: Build a deep mental model for turning a loose project list into a deterministic, validated, searchable, and extensible collection of high-quality project mini-books. You will understand how Markdown parsing works at the AST level, how JSON Schema creates a contract between content and tooling, how concept graphs expose prerequisite structure, and how deterministic generation makes outputs reproducible across machines and time. By the end, you’ll be able to build a full CLI pipeline that ingests a messy Markdown file, normalizes it into canonical JSON, generates a directory of expanded project files, and enforces quality with linting and provenance checks. This guide is both a systems design exercise and a documentation-engineering toolkit you can reuse for any knowledge product.
Introduction: What This Guide Covers
Project Expansion Automation is the practice of converting a high-level project list into a structured, deeply detailed set of project files using parsers, schemas, dependency graphs, and deterministic generators. It solves the real-world problem of inconsistent content quality and unscalable manual writing by creating a repeatable pipeline that produces project guides you can trust and build on.
What you will build (by the end of this guide):
- A Markdown parser that extracts project metadata into schema-validated JSON
- A concept dependency graph that reveals prerequisite structure and cycles
- A template-driven CLI that generates deterministic project files + index
- A QA and research enrichment pipeline that lints, scores, and audits outputs
Scope (what’s included):
- Markdown parsing, AST extraction, and metadata normalization
- JSON Schema validation and canonical output
- Graph modeling of concepts and dependencies
- Template expansion with deterministic file output
- Quality assurance, lint rules, and citation provenance
Out of scope (for this guide):
- Fully automated natural language generation of project content
- Large-scale distributed processing or cloud orchestration
- Proprietary content management systems
The Big Picture (Mental Model)
┌───────────────────────────────┐
Input Markdown → │ Parser + AST + Extractor │
└───────────────┬───────────────┘
v
┌───────────────────────────────┐
│ Schema Validation + Canonical │
│ JSON Normalization │
└───────────────┬───────────────┘
v
┌───────────────────────────────┐
│ Concept Graph + Dependencies │
└───────────────┬───────────────┘
v
┌───────────────────────────────┐
│ Template Expansion Engine │
└───────────────┬───────────────┘
v
┌───────────────────────────────┐
│ QA + Lint + Provenance Audit │
└───────────────┬───────────────┘
v
┌───────────────────────────────┐
│ Expanded Project Files + Index│
└───────────────────────────────┘
Key Terms You’ll See Everywhere
- AST (Abstract Syntax Tree): A tree representation of Markdown structure (headings, lists, code blocks).
- Schema Validation: Automated rules that ensure JSON data has expected shape and required fields.
- Canonicalization: A deterministic serialization of JSON so outputs are identical across runs.
- Concept Graph: A directed graph where nodes are concepts and edges represent prerequisites.
- Deterministic Generation: Running the pipeline multiple times yields byte-identical output.
How to Use This Guide
- Read the Theory Primer first: Treat it like a mini-book. It will save you hours during implementation.
- Build the projects in order: Each project builds a critical subsystem for the next.
- Validate as you go: Every project includes a “Definition of Done” checklist and golden CLI output.
- Keep outputs deterministic: Always run with fixed inputs and deterministic ordering.
- Track your assumptions: Write down parsing assumptions and schema constraints explicitly.
Prerequisites & Background Knowledge
Before starting these projects, you should have foundational understanding in these areas:
Essential Prerequisites (Must Have)
Programming Skills:
- Comfort reading and writing CLI tools (Python, Go, or Rust)
- Experience manipulating JSON data
- Familiarity with file system operations and directories
Text Processing Fundamentals:
- Tokenization vs parsing
- Regular expressions (as a last resort, not a primary strategy)
Data Modeling Basics:
- JSON data structures
- Schema vs instance data
- Error reporting and validation concepts
Recommended Reading: “Clean Code” by Robert C. Martin — Ch. 7 (Error Handling), Ch. 3 (Functions)
Helpful But Not Required
Graph Algorithms:
- Topological sort
- Cycle detection in directed graphs
- Can learn during: Project 2
Templating Systems:
- Template rendering vs code generation
- Can learn during: Project 3
Quality Engineering:
- Lint rules and rubrics
- Can learn during: Project 4
Self-Assessment Questions
Before starting, ask yourself:
- ✅ Can I parse a file and extract structured data from it without regex hacks?
- ✅ Do I understand the difference between schema validation and type checking?
- ✅ Can I explain what makes output deterministic and why it matters?
- ✅ Have I built a CLI that returns non-zero exit codes on failure?
- ✅ Can I reason about graph cycles and dependency ordering?
If you answered “no” to questions 1-3: Spend a weekend on a small CLI parsing exercise before starting. If you answered “yes” to all 5: You’re ready.
Development Environment Setup
Required Tools:
- A Unix-like environment (macOS or Linux)
- Python 3.11+ or Go 1.21+ or Rust 1.75+
- A JSON Schema validator library
- A Markdown parser library with AST support
Recommended Tools:
jqfor JSON inspectiongraphviz(dot) for graph renderingripgrepfor source scanning
Testing Your Setup:
$ python3 --version
Python 3.11.5
$ which jq
/usr/bin/jq
$ dot -V
dot - graphviz version 9.0.0
Time Investment
- Project 1: 8-12 hours
- Project 2: 6-10 hours
- Project 3: 10-15 hours
- Project 4: 8-12 hours
- Total sprint: 4-6 weeks (part-time)
Important Reality Check
You are building infrastructure for content, which means:
- Your output quality equals your schema quality.
- Debugging is about edge cases, not the happy path.
- Determinism is a feature, not an optimization.
- The generator is only as good as the parser.
Big Picture / Mental Model (Diagram-First)
Human-written Markdown (messy, inconsistent)
|
v
┌──────────────────────────────────────────────────────┐
│ 1. Parse into AST (structural truth) │
│ 2. Extract project blocks (boundaries) │
│ 3. Normalize metadata into JSON │
│ 4. Validate JSON against schema │
│ 5. Build concept dependency graph │
│ 6. Expand templates into full project files │
│ 7. Lint + rubric + citation checks │
└──────────────────────────────────────────────────────┘
|
v
Deterministic project library
Theory Primer (Read This Before Coding)
Chapter 1: Markdown Parsing and Structural Extraction
1.1 CommonMark, GFM, and Why ASTs Matter
Definitions & Key Terms
- CommonMark: A formal Markdown spec that defines block/inline parsing rules.
- GFM: GitHub Flavored Markdown, a strict superset of CommonMark.
- AST: A tree where each node is a Markdown construct (heading, list, code block).
Mental Model
Raw Markdown
|
v
[Tokenizer] -> [Block Parser] -> [Inline Parser]
| |
| v
| Document AST
v
Tokens
How It Works (Step-by-Step)
- Block parsing identifies structural elements like headings, lists, and code blocks.
- Inline parsing runs later to resolve links, emphasis, and code spans.
- The AST preserves structure and order, letting you define reliable boundaries.
- Headings like
## Project N:become boundary markers for extraction.
Minimal Example
## Project 1: Parser
- **Main Programming Language**: Python
- **Difficulty**: Level 3
Common Misconceptions
- Misconception: Regex is enough for Markdown parsing.
- Correction: Regex fails on nested lists, code fences, and inline formatting.
Check-Your-Understanding
- Why must block parsing occur before inline parsing?
- What goes wrong if you treat Markdown headings as plain strings?
- How do code fences change list parsing behavior?
Where You’ll Use This
- Project 1 (Parser & Schema Normalizer)
1.2 Boundary Detection and Source Positions
Definitions & Key Terms
- Boundary: A reliable point where a new project starts (e.g., heading node).
- Source Position: Line/column info that helps you report errors precisely.
Mental Model
[Heading Node] -> [Boundary Marker]
[Paragraph] -> [Metadata Candidate]
[List] -> [Key/Value Pairs]
How It Works
- Walk the AST in order.
- When you see a
Headingwith “Project” prefix, start a new block. - Collect nodes until the next boundary.
- Record line/column for every metadata token.
Minimal Example (Pseudo-Code)
for node in ast:
if is_project_heading(node):
start_new_block(node)
else:
append_to_current_block(node)
Common Misconceptions
- Misconception: Boundaries always equal blank lines.
- Correction: Headings are the only stable boundary marker in Markdown.
Check-Your-Understanding
- Why are source positions crucial for schema errors?
- How would you handle a project heading inside a block quote?
Where You’ll Use This
- Project 1 (Parser & Schema Normalizer)
Chapter 2: Schema-Driven Normalization and Canonical JSON
2.1 JSON Schema as a Contract
Definitions & Key Terms
- JSON Schema: A formal grammar for validating JSON documents.
- Meta-schema: The schema that validates schemas.
- Validation keywords: Constraints like
type,required,enum,pattern.
Mental Model
Raw JSON -> [Schema Validator] -> OK | Error List
How It Works
- Parse extracted metadata into JSON.
- Validate against a JSON Schema.
- Return structured errors with locations and suggestions.
Minimal Example
{
"title": "Project 1: Parser",
"difficulty": "Level 3",
"language": "Python"
}
Common Misconceptions
- Misconception: Schema is optional if the parser works.
- Correction: Schema turns implicit assumptions into enforceable rules.
Check-Your-Understanding
- Why should schemas be versioned?
- What happens if required fields are missing?
Where You’ll Use This
- Project 1 (Parser & Schema Normalizer)
- Project 4 (QA + Enrichment)
2.2 Canonicalization and Determinism
Definitions & Key Terms
- Canonical JSON: A deterministic serialization with stable key ordering.
- JCS: JSON Canonicalization Scheme (RFC 8785).
- Determinism: Same inputs produce byte-identical outputs.
Mental Model
Unordered JSON -> [Canonicalizer] -> Stable JSON
How It Works
- Sort object keys deterministically.
- Normalize numbers and string encodings.
- Serialize in a consistent format.
Minimal Example
{"b":2,"a":1} -> {"a":1,"b":2}
Common Misconceptions
- Misconception: Pretty-printing is enough for determinism.
- Correction: Pretty-printing doesn’t fix key ordering or numeric form.
Check-Your-Understanding
- Why does canonicalization matter for caching and diffs?
- How do floats break deterministic output?
Where You’ll Use This
- Project 1 (Parser & Schema Normalizer)
- Project 3 (Template Expander CLI)
Chapter 3: Concept Graphs and Dependency Modeling
3.1 Building a Concept Graph
Definitions & Key Terms
- Concept Node: A topic like “JSON Schema” or “AST Parsing”.
- Edge: A directed relationship representing prerequisites.
- DAG: Directed Acyclic Graph.
Mental Model
Concept A -> Concept B -> Concept C
How It Works
- Extract concepts from project metadata.
- Build edges from prerequisite lists.
- Detect cycles and report them.
Minimal Example (DOT)
digraph {
"Markdown" -> "AST";
"AST" -> "Schema";
}
Common Misconceptions
- Misconception: Any graph is fine; cycles don’t matter.
- Correction: Cycles break linear learning paths and must be handled.
Check-Your-Understanding
- Why does a cycle make scheduling impossible?
- What would a self-loop imply in this model?
Where You’ll Use This
- Project 2 (Concept Graph Builder)
3.2 Topological Sort and Cycle Detection
Definitions & Key Terms
- Topological Sort: An ordering where prerequisites come first.
- Cycle Detection: Finding loops that block ordering.
Mental Model
A -> B -> C
| ^
└---------┘ (cycle)
How It Works
- Compute in-degrees for each node.
- Repeatedly remove nodes with zero in-degree.
- If nodes remain, you have a cycle.
Minimal Example (Pseudo-Code)
queue = [n for n in nodes if indeg[n] == 0]
while queue:
n = queue.pop()
for m in adj[n]:
indeg[m] -= 1
Common Misconceptions
- Misconception: DFS alone gives a valid ordering.
- Correction: DFS ordering only works if you explicitly handle back edges.
Check-Your-Understanding
- What does it mean if the queue becomes empty early?
- How would you report a cycle path to the user?
Where You’ll Use This
- Project 2 (Concept Graph Builder)
Chapter 4: Template-Driven Expansion
4.1 Templates as Deterministic Blueprints
Definitions & Key Terms
- Template: A structured blueprint with placeholders.
- Partial: Reusable template fragments.
- Slug: A filesystem-safe identifier derived from a title.
Mental Model
JSON Project -> [Template Engine] -> Markdown File
How It Works
- Load a template file (Markdown + placeholders).
- Render with project metadata.
- Write output with deterministic ordering.
Minimal Example
# Project :
>
Common Misconceptions
- Misconception: Template rendering is just string replacement.
- Correction: Proper templating handles conditionals, loops, and escaping.
Check-Your-Understanding
- How do you ensure template output is deterministic?
- What happens if a field is missing?
Where You’ll Use This
- Project 3 (Project Expander CLI)
4.2 Stable IDs, Slugs, and File Layouts
Definitions & Key Terms
- Stable ID: A persistent identifier that does not change between runs.
- Slugify: Converting titles to lowercase, hyphenated file names.
Mental Model
"Project 3: CLI" -> "P03-project-expander-cli.md"
How It Works
- Normalize title: lowercase, remove punctuation.
- Prefix with stable project number.
- Enforce max length and uniqueness.
Minimal Example
"Project 12: API Validator" -> P12-api-validator.md
Common Misconceptions
- Misconception: File names can be derived from titles directly.
- Correction: Titles are not stable; IDs must be.
Check-Your-Understanding
- What happens if two projects have the same title?
- How do you preserve links when titles change?
Where You’ll Use This
- Project 3 (Project Expander CLI)
Chapter 5: QA, Linting, and Provenance
5.1 Lint Rules and Rubric Scoring
Definitions & Key Terms
- Lint Rule: A deterministic check against a structural rule.
- Rubric: A weighted scoring system for quality evaluation.
Mental Model
Document -> [Lint Rules] -> Errors/Warnings
Document -> [Rubric] -> Score
How It Works
- Parse output files.
- Validate required sections exist.
- Score for depth, clarity, and completeness.
Minimal Example
{"rule": "has_definition_of_done", "status": "fail"}
Common Misconceptions
- Misconception: Linting only checks syntax.
- Correction: Linting can enforce semantic completeness.
Check-Your-Understanding
- What makes a good lint rule?
- How do you avoid false positives?
Where You’ll Use This
- Project 4 (QA + Enrichment)
5.2 Provenance and Citations
Definitions & Key Terms
- Provenance: A record of where facts and claims come from.
- Citation Map: A structured link between claims and sources.
Mental Model
Claim -> [Source Link] -> Verified Reference
How It Works
- Extract URLs or references from content.
- Verify each reference resolves and matches a claim.
- Store provenance in a structured JSON map.
Minimal Example
{"claim": "GFM is a superset of CommonMark", "source": "https://github.github.io/gfm/"}
Common Misconceptions
- Misconception: Citations are optional in technical content.
- Correction: Provenance is how you scale trust and maintainability.
Check-Your-Understanding
- How do you store citations without polluting content?
- What should happen if a source disappears?
Where You’ll Use This
- Project 4 (QA + Enrichment)
Glossary (High-Signal)
- AST: Structured tree representation of parsed Markdown.
- Canonicalization: Deterministic serialization of JSON data.
- Concept Graph: Directed graph of concepts and prerequisites.
- DAG: Directed graph without cycles, enabling linear ordering.
- Determinism: Same inputs always produce byte-identical outputs.
- Lint: Automated checks that enforce structural or semantic rules.
- Slug: URL- and filesystem-safe identifier derived from a title.
Why Project Expansion Automation Matters
The Modern Problem It Solves
Engineers spend significant time searching for answers and context instead of building. Technical documentation remains a primary learning resource for developers, but most content is inconsistent, incomplete, or unstructured. A project expansion pipeline turns scattered ideas into reliable, auditable learning assets, reducing knowledge friction and making technical learning reproducible at scale.
Real-world impact (recent studies):
- The 2023 Stack Overflow Developer Survey reports that 63% of respondents spend more than 30 minutes per day searching for answers or solutions at work.
- Stack Overflow’s 2025 developer survey press release reports that technical documentation remains a top learning resource for developers.
The Shift: From Manual Docs to Deterministic Pipelines
Manual Expansion Automated Expansion
┌────────────────────────────┐ ┌────────────────────────────┐
│ One-off writing │ │ Structured input + schema │
│ Inconsistent sections │ │ Deterministic output │
│ Hard to update │ │ Regeneratable files │
│ Unclear dependencies │ │ Explicit concept graph │
└────────────────────────────┘ └────────────────────────────┘
Context & Evolution (Brief)
Markdown became the lingua franca for developer documentation, but without a schema, it created a new problem: human-friendly inputs with machine-hostile ambiguity. The modern solution is to keep Markdown for authors while extracting a schema-validated intermediate representation for tooling.
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| Markdown Parsing | AST-based parsing is the only reliable way to extract structured metadata. |
| Schema Validation | JSON Schema formalizes assumptions and makes errors actionable. |
| Canonicalization | Deterministic output requires stable ordering and serialization rules. |
| Concept Graphs | Dependency edges reveal prerequisites and learning order. |
| Template Expansion | Templates turn structured data into repeatable content output. |
| QA + Provenance | Linting and citations keep content trustworthy at scale. |
Project-to-Concept Map
| Project | What It Builds | Primer Chapters It Uses |
|---|---|---|
| Project 1: Markdown Project Parser & Schema Normalizer | Parser + JSON normalization | Ch. 1, Ch. 2 |
| Project 2: Concept Map & Dependency Graph Builder | Concept DAG + DOT output | Ch. 3 |
| Project 3: Template-Driven Project Expander CLI | Template engine + deterministic files | Ch. 4 |
| Project 4: QA + Research Enrichment Pipeline | Linting + rubric + provenance | Ch. 5 |
Deep Dive Reading by Concept
Fundamentals & Parsing
| Concept | Book & Chapter | Why This Matters |
|---|---|---|
| Parsing and data extraction | “Clean Code” by Robert C. Martin — Ch. 3 (Functions), Ch. 7 (Error Handling) | Reliable parsers require clean error handling and small, testable functions. |
| Data modeling and correctness | “Code Complete” by Steve McConnell — Ch. 10 (General Issues in Using Variables) | Helps you design data structures that prevent invalid states. |
Graphs & Algorithms
| Concept | Book & Chapter | Why This Matters |
|---|---|---|
| Directed graphs | “Algorithms, Fourth Edition” by Sedgewick & Wayne — Ch. 4 (Graphs) | Concept graphs and cycle detection are core to dependency reasoning. |
| Dependency ordering | “Algorithms in C” by Sedgewick — Part 5 (Graphs) | Implementation-level understanding of topological sort. |
Architecture & QA
| Concept | Book & Chapter | Why This Matters |
|---|---|---|
| Pipeline architecture | “Fundamentals of Software Architecture” by Richards & Ford — Ch. 9 (Architecture Styles) | Helps you reason about pipeline design and trade-offs. |
| Code quality systems | “Refactoring” by Martin Fowler — Ch. 2 (Principles in Refactoring) | Lint and rubric tooling are about maintainability and safe change. |
Quick Start: Your First 48 Hours
Day 1 (4 hours):
- Read Chapters 1 and 2 only.
- Install a CommonMark parser in your language of choice.
- Start Project 1 and extract just the project titles.
- Write a JSON Schema with only 3 required fields.
Day 2 (4 hours):
- Add metadata extraction for difficulty and language.
- Validate JSON output and print errors with line/column.
- Write a tiny canonicalizer that sorts keys.
- Run the golden output example from Project 1.
End of Weekend: You can parse a project list into schema-validated, deterministic JSON. That is the backbone of everything else.
Recommended Learning Paths
Path 1: The Documentation Engineer (Recommended Start)
Best for: People building documentation tooling or content pipelines.
- Project 1 → Project 3 → Project 4
- Optional: Project 2 if you need dependency graphs
Path 2: The Algorithm-Focused Learner
Best for: Learners interested in graph algorithms and dependency reasoning.
- Project 2 → Project 1 → Project 3 → Project 4
Path 3: The Systems Builder
Best for: Learners who want the full pipeline end-to-end.
- Project 1 → Project 2 → Project 3 → Project 4 (full sequence)
Success Metrics
By the end, you should be able to:
- Parse a Markdown project list into canonical JSON with deterministic output
- Detect and report schema errors with exact line/column references
- Generate a dependency graph and detect cycles
- Expand projects into deterministic Markdown files with stable IDs
- Enforce quality rules with linting and citation checks
Optional Appendices
Appendix A: Markdown + JSON Schema Cheatsheet
Markdown boundaries to treat as project markers:
## Project N: Title#### Project N: Title
JSON Schema essentials:
type,required,enum,pattern,additionalProperties
Appendix B: Determinism Checklist
- Key ordering is stable
- Timestamps are fixed or removed
- Output sorting is deterministic
- Slugs use a stable algorithm
- Canonical JSON serialization is used
Appendix C: CLI Exit Code Conventions
0: Success1: Invalid input2: Schema validation failure3: Internal error4: Partial output or non-fatal warnings
Appendix D: Primary Specs & References
- CommonMark Spec: https://spec.commonmark.org/
- GitHub Flavored Markdown (GFM) Spec: https://github.github.io/gfm/
- JSON Schema Specification (2020-12): https://json-schema.org/specification
- JSON Canonicalization Scheme (RFC 8785): https://www.rfc-editor.org/rfc/rfc8785
- Graphviz DOT Language: https://graphviz.org/doc/info/lang.html
- Stack Overflow Developer Survey 2023: https://survey.stackoverflow.co/2023
- Stack Overflow Developer Survey 2025 (learning resources): https://survey.stackoverflow.co/2025/developers/
Projects
Project 1: Markdown Project Parser & Schema Normalizer
- Main Programming Language: Python 3.11
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 4 — Infrastructure builder
- Business Potential: 3 — Internal tooling and content operations
- Difficulty: Level 3 — Intermediate
- Knowledge Area: Documentation Engineering
- Software or Tool: CommonMark parser + JSON Schema validator
- Main Book: “Clean Code” by Robert C. Martin
What you’ll build: A CLI that parses a Markdown project list into deterministic, schema-validated JSON with actionable errors and line/column diagnostics.
Why it teaches this topic: It forces you to separate syntax from structure, validate assumptions with schemas, and produce repeatable outputs—core to any scalable content pipeline.
Core challenges you’ll face:
- Reliable boundary detection → avoiding false positives in headings
- Schema design → balancing flexibility and strictness
- Deterministic output → canonicalization and ordering
Real World Outcome
What you will see:
- A
build/projects.jsonfile containing a normalized list of projects - A
build/errors.jsonfile with line/column errors (if any) - A deterministic output that is byte-identical across runs
Command Line Outcome Example:
# 1. Parse and normalize
$ expander parse projects.md --out build/
[ok] Parsed 12 project blocks
[ok] Schema validation: 12/12 valid
[ok] Wrote build/projects.json (deterministic)
# 2. Validate only
$ expander validate projects.md
[ok] Schema validation passed (0 errors)
# 3. Trigger a failure
$ expander validate projects_missing.md
[error] Project 4 missing required field: "difficulty"
at line 214, column 5
exit code: 2
The Core Question You’re Answering
“How do I turn ambiguous Markdown into a reliable, machine-validated data model?”
This question forces you to confront the gap between human writing and machine expectations. Solving it teaches you to build trustable pipelines rather than one-off scripts.
Concepts You Must Understand First
- CommonMark Parsing
- Why is CommonMark a two-phase parser (block then inline)?
- What goes wrong if you parse inline formatting too early?
- Book Reference: “Clean Code” Ch. 3 — Robert C. Martin
- JSON Schema Validation
- What does
requiredactually enforce? - How do you express enums and patterns?
- Book Reference: “Code Complete” Ch. 10 — Steve McConnell
- What does
- Deterministic Serialization
- Why is key ordering important for diff-based workflows?
- How does canonicalization reduce noise?
- Book Reference: “Clean Architecture” Ch. 7 — Robert C. Martin
Questions to Guide Your Design
- Parsing Strategy
- How will you detect a new project boundary?
- How will you handle malformed or nested headings?
- Schema Design
- Which fields are required vs optional?
- How do you enforce enum values for difficulty?
- Diagnostics
- How will you map schema errors back to source lines?
- What should exit codes represent?
Thinking Exercise
Trace a broken project entry
Imagine this input:
## Project 7: Foo
- **Language**: Python
- **Difficulty**:
Questions while tracing:
- Where does the missing difficulty value become a schema error?
- What line/column should the error report?
- Should this be a warning or a hard failure?
The Interview Questions They’ll Ask
- “Why is Markdown parsing hard to do with regex?”
- “What is JSON Schema and why is it useful?”
- “How would you make JSON output deterministic?”
- “How do you surface validation errors to users?”
- “Why should CLI tools return explicit exit codes?”
Hints in Layers
Hint 1: Start with AST extraction Use a CommonMark parser and walk heading nodes first.
Hint 2: Design the schema early Write the JSON Schema before you parse. It defines your extraction targets.
Hint 3: Canonicalize output Sort object keys and use a stable serializer.
json.dumps(data, sort_keys=True, separators=(",", ":"))
Hint 4: Error positions Store line/column metadata alongside extracted tokens.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Error handling | “Clean Code” by Robert C. Martin | Ch. 7 |
| Data modeling | “Code Complete” by Steve McConnell | Ch. 10 |
| CLI design | “The Pragmatic Programmer” by Thomas & Hunt | Ch. 5 |
Common Pitfalls & Debugging
Problem 1: “Projects split incorrectly”
- Why: Headings are parsed as plain text instead of AST nodes.
- Fix: Use a CommonMark parser and detect heading nodes explicitly.
- Quick test: Count the number of parsed headings vs expected projects.
Problem 2: “Schema validation passes but output is wrong”
- Why: Schema is too permissive.
- Fix: Add
enum,minLength, andpatternconstraints. - Quick test: Inject an invalid value and ensure schema rejects it.
Problem 3: “Diffs change every run”
- Why: Output ordering is not deterministic.
- Fix: Sort keys and normalize arrays.
- Quick test: Run the pipeline twice and compare file hashes.
Definition of Done
- Parser extracts all project blocks correctly
- JSON output validates against schema
- Output is deterministic across runs
- Errors include line/column references
- CLI exits with correct exit codes
Project 2: Concept Map & Dependency Graph Builder
- Main Programming Language: Python 3.11
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 4 — Graph wizardry
- Business Potential: 2 — Internal analytics and planning
- Difficulty: Level 3 — Intermediate
- Knowledge Area: Graph Algorithms
- Software or Tool: Graphviz DOT + cycle detection
- Main Book: “Algorithms, Fourth Edition” by Sedgewick & Wayne
What you’ll build: A tool that reads normalized project JSON, extracts concept dependencies, and outputs a DAG with cycle detection and DOT/SVG visualizations.
Why it teaches this topic: It forces you to model learning dependencies explicitly, then verify they are acyclic and navigable.
Core challenges you’ll face:
- Graph modeling → designing node IDs and edge semantics
- Cycle detection → detecting and reporting cycles clearly
- Visualization → producing DOT output for graph rendering
Real World Outcome
What you will see:
build/concepts.jsonwith nodes and edgesbuild/concepts.dotfor Graphviz renderingbuild/concepts.svgvisualizing dependencies
Command Line Outcome Example:
# 1. Build concept graph
$ expander concepts build/projects.json --out build/
[ok] Loaded 12 projects
[ok] Extracted 34 concepts
[ok] Graph is acyclic
[ok] Wrote build/concepts.dot
[ok] Rendered build/concepts.svg
# 2. Detect a cycle
$ expander concepts build/projects_with_cycle.json --out build/
[error] Cycle detected: "Schema" -> "Parser" -> "Schema"
exit code: 2
The Core Question You’re Answering
“How do I make hidden learning dependencies explicit and verifiable?”
Dependency graphs turn assumptions into visual and testable structure.
Concepts You Must Understand First
- Directed Graphs
- What is a node vs an edge?
- How do you represent prerequisites?
- Book Reference: “Algorithms, Fourth Edition” Ch. 4 — Sedgewick & Wayne
- Topological Sort
- Why does a topological ordering only exist for DAGs?
- How do you compute one efficiently?
- Book Reference: “Algorithms in C” Part 5 — Sedgewick
- DOT Language
- What does
digraphmean in DOT? - How do you label nodes and edges?
- Book Reference: “Clean Code” Ch. 2 — Robert C. Martin (for naming clarity)
- What does
Questions to Guide Your Design
- Graph Model
- Will concepts be unique globally or per project?
- How will you handle synonyms?
- Cycle Reporting
- How will you output the actual cycle path?
- Should cycles be errors or warnings?
- Visualization
- What layout makes dependencies readable?
- How will you avoid node clutter?
Thinking Exercise
Sketch a concept graph with 5 nodes and one cycle. How would your tool report it?
The Interview Questions They’ll Ask
- “What is a DAG and why is it useful?”
- “How does topological sorting work?”
- “What is a cycle and how do you detect it?”
- “Why is visualization important for dependency graphs?”
- “How would you scale graph generation for 1,000 concepts?”
Hints in Layers
Hint 1: Start with adjacency lists
Build a dict[str, list[str]] of edges.
Hint 2: Use Kahn’s algorithm It gives both ordering and cycle detection.
Hint 3: Output DOT
Use digraph { "A" -> "B"; } as your minimal format.
Hint 4: Cycle detail Track parent pointers to reconstruct the cycle path.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Graphs | “Algorithms, Fourth Edition” | Ch. 4 |
| Dependency ordering | “Algorithms in C” | Part 5 |
| Naming & clarity | “Clean Code” | Ch. 2 |
Common Pitfalls & Debugging
Problem 1: “Cycle detection never triggers”
- Why: In-degree counts are not updated correctly.
- Fix: Decrement in-degree for each outgoing edge.
- Quick test: Create a 2-node cycle and confirm detection.
Problem 2: “DOT output renders blank”
- Why: Missing
digraphheader or semicolons. - Fix: Validate DOT syntax with
dot -Tsvg. - Quick test: Run
dot -Tsvg build/concepts.dot -o /tmp/test.svg.
Definition of Done
- Graph nodes and edges generated correctly
- Cycle detection works with clear error output
- DOT output renders to SVG
- CLI reports meaningful exit codes
Project 3: Template-Driven Project Expander CLI
- Main Programming Language: Python 3.11
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 5 — Content pipeline architect
- Business Potential: 4 — Productizable documentation tooling
- Difficulty: Level 4 — Advanced
- Knowledge Area: Code Generation
- Software or Tool: Template engine + filesystem writer
- Main Book: “Clean Architecture” by Robert C. Martin
What you’ll build: A CLI that reads normalized project JSON and expands each project into its own Markdown file using templates, producing deterministic output and a README index.
Why it teaches this topic: It teaches deterministic generation, template modeling, and reproducible outputs at scale.
Core challenges you’ll face:
- Template design → preventing missing sections
- File layout → deterministic naming and ordering
- Regeneration → idempotent output
Real World Outcome
What you will see:
expander.codex/directory with per-project filesexpander.codex/README.mdindex- Stable filenames like
P03-project-expander-cli.md
Command Line Outcome Example:
# 1. Expand projects into files
$ expander expand build/projects.json --out expander.codex
[ok] Expanded 12 projects
[ok] Wrote expander.codex/README.md
[ok] Deterministic output confirmed
# 2. Re-run to verify determinism
$ expander expand build/projects.json --out expander.codex
[ok] No changes detected (hash match)
The Core Question You’re Answering
“How do I turn structured data into deterministic, publishable content?”
This is the core of content automation: stable, repeatable generation.
Concepts You Must Understand First
- Template Engines
- How do you safely interpolate data?
- How do you prevent missing sections?
- Book Reference: “Clean Architecture” Ch. 7 — Robert C. Martin
- Deterministic Output
- What changes between runs?
- How do you guarantee ordering?
- Book Reference: “Code Complete” Ch. 12 — Steve McConnell
- Slugging and IDs
- Why are titles not stable identifiers?
- How do you enforce uniqueness?
- Book Reference: “Refactoring” Ch. 2 — Martin Fowler
Questions to Guide Your Design
- Template Layout
- What sections are mandatory?
- How will optional fields render?
- File Naming
- How do you prevent collisions?
- How do you handle renames?
- Regeneration
- Should old files be deleted?
- How do you detect changes?
Thinking Exercise
Design a template that ensures every project includes a “Definition of Done” section, even if the input data is missing. How would you implement a default?
The Interview Questions They’ll Ask
- “What is the difference between code generation and templating?”
- “How do you make template output deterministic?”
- “How do you handle missing data fields gracefully?”
- “What is idempotence in file generation?”
- “How do you test generated files?”
Hints in Layers
Hint 1: Start with a single template Render one project file before building the loop.
Hint 2: Sort projects by ID Stable ordering prevents diff noise.
Hint 3: Use hash comparisons Compute file hashes to detect changes.
Hint 4: Keep a README index Generate a summary file to verify all outputs.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Architecture | “Clean Architecture” | Ch. 7 |
| Maintainability | “Refactoring” | Ch. 2 |
| Quality design | “Code Complete” | Ch. 12 |
Common Pitfalls & Debugging
Problem 1: “Files change on every run”
- Why: Non-deterministic ordering or timestamps.
- Fix: Sort inputs and remove timestamps.
- Quick test: Run twice and compare checksums.
Problem 2: “Templates break on missing fields”
- Why: No defaults or missing conditional logic.
- Fix: Define defaults and fallbacks.
- Quick test: Remove optional data and render.
Definition of Done
- Per-project files generated with correct names
- README index generated and sorted
- Re-running produces no diff
- Missing fields handled gracefully
Project 4: QA + Research Enrichment Pipeline
- Main Programming Language: Python 3.11
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 5 — Quality enforcer
- Business Potential: 4 — Content QA automation
- Difficulty: Level 4 — Advanced
- Knowledge Area: Quality Engineering
- Software or Tool: Lint engine + rubric scorer
- Main Book: “Refactoring” by Martin Fowler
What you’ll build: A pipeline that lints project files for completeness, scores them against a rubric, and verifies citation provenance.
Why it teaches this topic: It forces you to encode quality expectations into machine-checkable rules.
Core challenges you’ll face:
- Rule design → avoiding false positives
- Rubric scoring → balancing depth vs breadth
- Provenance checks → ensuring sources are valid
Real World Outcome
What you will see:
build/lint.jsonwith structured lint resultsbuild/rubric.jsonwith quality scoresbuild/sources.jsonmapping claims to sources
Command Line Outcome Example:
# 1. Run lint checks
$ expander lint expander.codex
[error] P02 missing "Definition of Done" section
[error] P03 missing "Common Pitfalls" section
exit code: 1
# 2. Score quality
$ expander rubric expander.codex
[ok] Overall score: 86/100
[warn] Project 1 lacks explicit CLI failure example
# 3. Verify sources
$ expander sources expander.codex
[ok] 14 sources verified
[warn] 2 sources missing or unreachable
The Core Question You’re Answering
“How do I make content quality measurable and enforceable?”
This project turns subjective quality into objective checks.
Concepts You Must Understand First
- Lint Rule Design
- What is a deterministic rule?
- How do you avoid false positives?
- Book Reference: “Refactoring” Ch. 2 — Martin Fowler
- Rubric Scoring
- How do you weight sections?
- What is an acceptable score threshold?
- Book Reference: “Clean Code” Ch. 4 — Robert C. Martin
- Provenance Tracking
- How do you store citations?
- How do you handle dead links?
- Book Reference: “Code Complete” Ch. 20 — Steve McConnell
Questions to Guide Your Design
- Lint Rules
- Which sections are mandatory?
- Which failures should be warnings vs errors?
- Rubric
- How do you score depth of explanation?
- How do you prevent gaming?
- Sources
- Where do you store source metadata?
- How do you validate source availability?
Thinking Exercise
Design a rubric with 5 weighted criteria. How would you score a project missing “Real World Outcome”?
The Interview Questions They’ll Ask
- “How do you define content quality in measurable terms?”
- “What makes a lint rule effective?”
- “How do you handle false positives in linting?”
- “Why is provenance important in technical documentation?”
- “How would you scale QA to 1,000 projects?”
Hints in Layers
Hint 1: Start with structural checks Verify sections exist before scoring depth.
Hint 2: Use a ruleset file Store lint rules in JSON for easy updates.
Hint 3: Score by section length + keywords Combine structural checks with heuristic scoring.
Hint 4: Build a sources map
Extract links and store them in sources.json.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Quality mindset | “Refactoring” | Ch. 2 |
| Documentation clarity | “Clean Code” | Ch. 4 |
| Large-scale projects | “Code Complete” | Ch. 20 |
Common Pitfalls & Debugging
Problem 1: “Lint rule flags everything”
- Why: Rule is too strict or mismatched to input.
- Fix: Add thresholds and allow warnings.
- Quick test: Run on a known good file and confirm zero errors.
Problem 2: “Rubric scores are meaningless”
- Why: Weights are arbitrary.
- Fix: Calibrate scores against real examples.
- Quick test: Score three known files and compare.
Definition of Done
- Lint engine flags missing sections correctly
- Rubric produces stable scores
- Provenance map is generated
- CLI returns exit codes for failures