Project 13: Cognitive Orchestrator Lab (Reasoning Beyond Prompting)
Build a planning-first assistant runtime that explicitly decomposes goals, scores plans, estimates uncertainty, and self-corrects before executing costly actions.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 25-40 hours |
| Main Programming Language | Python |
| Alternative Programming Languages | TypeScript, Go |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 3. The “Service & Support” Model |
| Prerequisites | ReAct basics, graph search fundamentals, observability mindset |
| Key Topics | Tree-of-Thought, task decomposition, utility scoring, uncertainty calibration |
1. Learning Objectives
By completing this project, you will:
- Convert free-form user goals into explicit goal graphs with constraints.
- Evaluate multiple plans using utility, confidence, and policy constraints.
- Add reflection loops that repair weak intermediate decisions.
- Prevent long-horizon planning loops from becoming unbounded.
- Produce auditable traces explaining why one plan was selected.
2. Theoretical Foundation
2.1 Planning Patterns for Agents
A single LLM call can solve narrow tasks, but complex tasks require structured cognition. In this project, you formalize three planning strategies: ReAct (interleave thought and action), Plan-Execute (plan first, execute second), and Tree-of-Thought (branch and compare candidate reasoning paths). The core practical insight is that each strategy has a different cost profile and failure mode. ReAct can react quickly but may drift. Plan-Execute can satisfy constraints better but can waste effort when inputs change. Tree-of-Thought improves solution quality but can explode in compute cost if branching is unconstrained.
2.2 Utility + Uncertainty as Decision Layer
Agent decisions are better treated as policy evaluation than text generation. You define a utility function for plan quality and combine it with confidence estimates and hard constraints. Hard constraints are non-negotiable rules. Soft constraints are scored penalties. This creates a predictable decision process and avoids brittle “best sounding” answers.
3. Project Specification
3.1 What You Will Build
A CLI application with these modules:
- goal parser
- decomposition engine
- plan generator
- plan scorer
- critique loop
- executor simulator
- trace logger
3.2 Functional Requirements
- Convert one user request into a dependency graph of subgoals.
- Generate at least three candidate plans for non-trivial requests.
- Score plans using utility, confidence, and constraint checks.
- Trigger critique loop when confidence drops below threshold.
- Emit final decision trace with selected and rejected plans.
3.3 Non-Functional Requirements
- Determinism: fixed-seed mode for repeatable eval runs.
- Observability: full per-step trace with timestamps.
- Safety: max planning depth and max critique iterations.
3.4 Real World Outcome
When running the tool, you see visible planning rather than hidden chain-of-thought.
$ assistant plan "Design a 10-day trip under $2200 with no overnight layovers"
[GoalGraph] nodes=9 edges=14
[PlanGen] candidates=4
[Score] plan-A utility=0.72 confidence=0.68 violations=0
[Score] plan-B utility=0.83 confidence=0.71 violations=0
[Score] plan-C utility=0.79 confidence=0.82 violations=1
[Critique] plan-B step-5 evidence weak -> refresh data
[Select] plan-B accepted
4. Solution Architecture
4.1 High-Level Design
User Goal
|
v
Goal Parser --> Goal Graph --> Plan Generator --> Plan Scorer --> Critic Loop --> Executor
^ |
|----------------------------------|
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Goal parser | extract objectives + constraints | schema-first normalized representation |
| Plan generator | create candidate execution paths | branch cap and depth cap |
| Plan scorer | compute utility/confidence | weighted metrics with hard-gate constraints |
| Critic | detect weak steps | bounded repair loop |
5. Implementation Guide
5.1 The Core Question You’re Answering
“How can I make agent decisions explicit, testable, and robust across long-horizon tasks?”
5.2 Concepts You Must Understand First
- ReAct vs Plan-Execute
- Constraint satisfaction (hard vs soft)
- Utility function design
- Confidence calibration basics
5.3 Questions to Guide Your Design
- Which constraints can never be violated?
- What minimum confidence should trigger re-plan?
- How do you stop loops while preserving quality?
5.4 Thinking Exercise
Draw a planning tree for one travel task and one hiring task. Compare where greedy planning fails and where branching helps.
5.5 The Interview Questions They’ll Ask
- Why is explicit planning valuable for agents?
- How do you score competing plans?
- What is the risk of over-reflection loops?
- How do you calibrate confidence for tool outputs?
- When should the system ask for human input?
5.6 Hints in Layers
Hint 1: Start with one hard constraint and one utility metric.
Hint 2: Add confidence gates before adding more metrics.
Hint 3: Keep critique isolated from execution.
Hint 4: Record rejection reasons for every discarded plan.
5.7 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Planning frameworks | “Building AI Agents” | Ch. 2-3 |
| Utility and trade-offs | “AI Engineering” | Ch. 6 |
| Search strategies | “Algorithms, Fourth Edition” | Graph search |
5.8 Common Pitfalls and Debugging
Problem 1: planner never settles
- Why: no iteration budget.
- Fix: enforce max-depth and max-repair.
- Quick test: ambiguous task terminates in bounded steps.
Problem 2: same plan always wins
- Why: scoring features too narrow.
- Fix: add diversity penalty and evidence quality factor.
- Quick test: task variants yield varied winners.
5.9 Definition of Done
- Multi-plan generation and scoring works
- Constraint violations are explicit and traceable
- Critique loop is bounded and useful
- Final decision trace is reproducible