Project 13: Cognitive Orchestrator Lab (Reasoning Beyond Prompting)

Build a planning-first assistant runtime that explicitly decomposes goals, scores plans, estimates uncertainty, and self-corrects before executing costly actions.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	25-40 hours
Main Programming Language	Python
Alternative Programming Languages	TypeScript, Go
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	3. The “Service & Support” Model
Prerequisites	ReAct basics, graph search fundamentals, observability mindset
Key Topics	Tree-of-Thought, task decomposition, utility scoring, uncertainty calibration

1. Learning Objectives

By completing this project, you will:

Convert free-form user goals into explicit goal graphs with constraints.
Evaluate multiple plans using utility, confidence, and policy constraints.
Add reflection loops that repair weak intermediate decisions.
Prevent long-horizon planning loops from becoming unbounded.
Produce auditable traces explaining why one plan was selected.

2. Theoretical Foundation

2.1 Planning Patterns for Agents

A single LLM call can solve narrow tasks, but complex tasks require structured cognition. In this project, you formalize three planning strategies: ReAct (interleave thought and action), Plan-Execute (plan first, execute second), and Tree-of-Thought (branch and compare candidate reasoning paths). The core practical insight is that each strategy has a different cost profile and failure mode. ReAct can react quickly but may drift. Plan-Execute can satisfy constraints better but can waste effort when inputs change. Tree-of-Thought improves solution quality but can explode in compute cost if branching is unconstrained.

2.2 Utility + Uncertainty as Decision Layer

Agent decisions are better treated as policy evaluation than text generation. You define a utility function for plan quality and combine it with confidence estimates and hard constraints. Hard constraints are non-negotiable rules. Soft constraints are scored penalties. This creates a predictable decision process and avoids brittle “best sounding” answers.

3. Project Specification

3.1 What You Will Build

A CLI application with these modules:

goal parser
decomposition engine
plan generator
plan scorer
critique loop
executor simulator
trace logger

3.2 Functional Requirements

Convert one user request into a dependency graph of subgoals.
Generate at least three candidate plans for non-trivial requests.
Score plans using utility, confidence, and constraint checks.
Trigger critique loop when confidence drops below threshold.
Emit final decision trace with selected and rejected plans.

3.3 Non-Functional Requirements

Determinism: fixed-seed mode for repeatable eval runs.
Observability: full per-step trace with timestamps.
Safety: max planning depth and max critique iterations.

3.4 Real World Outcome

When running the tool, you see visible planning rather than hidden chain-of-thought.

$ assistant plan "Design a 10-day trip under $2200 with no overnight layovers"
[GoalGraph] nodes=9 edges=14
[PlanGen] candidates=4
[Score] plan-A utility=0.72 confidence=0.68 violations=0
[Score] plan-B utility=0.83 confidence=0.71 violations=0
[Score] plan-C utility=0.79 confidence=0.82 violations=1
[Critique] plan-B step-5 evidence weak -> refresh data
[Select] plan-B accepted

4. Solution Architecture

4.1 High-Level Design

User Goal
   |
   v
Goal Parser --> Goal Graph --> Plan Generator --> Plan Scorer --> Critic Loop --> Executor
                                  ^                                  |
                                  |----------------------------------|

4.2 Key Components

Component	Responsibility	Key Decisions
Goal parser	extract objectives + constraints	schema-first normalized representation
Plan generator	create candidate execution paths	branch cap and depth cap
Plan scorer	compute utility/confidence	weighted metrics with hard-gate constraints
Critic	detect weak steps	bounded repair loop

5. Implementation Guide

5.1 The Core Question You’re Answering

“How can I make agent decisions explicit, testable, and robust across long-horizon tasks?”

5.2 Concepts You Must Understand First

ReAct vs Plan-Execute
Constraint satisfaction (hard vs soft)
Utility function design
Confidence calibration basics

5.3 Questions to Guide Your Design

Which constraints can never be violated?
What minimum confidence should trigger re-plan?
How do you stop loops while preserving quality?

5.4 Thinking Exercise

Draw a planning tree for one travel task and one hiring task. Compare where greedy planning fails and where branching helps.

5.5 The Interview Questions They’ll Ask

Why is explicit planning valuable for agents?
How do you score competing plans?
What is the risk of over-reflection loops?
How do you calibrate confidence for tool outputs?
When should the system ask for human input?

5.6 Hints in Layers

Hint 1: Start with one hard constraint and one utility metric.

Hint 2: Add confidence gates before adding more metrics.

Hint 3: Keep critique isolated from execution.

Hint 4: Record rejection reasons for every discarded plan.

5.7 Books That Will Help

Topic	Book	Chapter
Planning frameworks	“Building AI Agents”	Ch. 2-3
Utility and trade-offs	“AI Engineering”	Ch. 6
Search strategies	“Algorithms, Fourth Edition”	Graph search

5.8 Common Pitfalls and Debugging

Problem 1: planner never settles

Why: no iteration budget.
Fix: enforce max-depth and max-repair.
Quick test: ambiguous task terminates in bounded steps.

Problem 2: same plan always wins

Why: scoring features too narrow.
Fix: add diversity penalty and evidence quality factor.
Quick test: task variants yield varied winners.

5.9 Definition of Done

Multi-plan generation and scoring works
Constraint violations are explicit and traceable
Critique loop is bounded and useful
Final decision trace is reproducible