Project 13: Cognitive Orchestrator Lab (Reasoning Beyond Prompting)

Build a planning-first assistant runtime that explicitly decomposes goals, scores plans, estimates uncertainty, and self-corrects before executing costly actions.

Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 25-40 hours
Main Programming Language Python
Alternative Programming Languages TypeScript, Go
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 3. The “Service & Support” Model
Prerequisites ReAct basics, graph search fundamentals, observability mindset
Key Topics Tree-of-Thought, task decomposition, utility scoring, uncertainty calibration

1. Learning Objectives

By completing this project, you will:

  1. Convert free-form user goals into explicit goal graphs with constraints.
  2. Evaluate multiple plans using utility, confidence, and policy constraints.
  3. Add reflection loops that repair weak intermediate decisions.
  4. Prevent long-horizon planning loops from becoming unbounded.
  5. Produce auditable traces explaining why one plan was selected.

2. Theoretical Foundation

2.1 Planning Patterns for Agents

A single LLM call can solve narrow tasks, but complex tasks require structured cognition. In this project, you formalize three planning strategies: ReAct (interleave thought and action), Plan-Execute (plan first, execute second), and Tree-of-Thought (branch and compare candidate reasoning paths). The core practical insight is that each strategy has a different cost profile and failure mode. ReAct can react quickly but may drift. Plan-Execute can satisfy constraints better but can waste effort when inputs change. Tree-of-Thought improves solution quality but can explode in compute cost if branching is unconstrained.

2.2 Utility + Uncertainty as Decision Layer

Agent decisions are better treated as policy evaluation than text generation. You define a utility function for plan quality and combine it with confidence estimates and hard constraints. Hard constraints are non-negotiable rules. Soft constraints are scored penalties. This creates a predictable decision process and avoids brittle “best sounding” answers.


3. Project Specification

3.1 What You Will Build

A CLI application with these modules:

  • goal parser
  • decomposition engine
  • plan generator
  • plan scorer
  • critique loop
  • executor simulator
  • trace logger

3.2 Functional Requirements

  1. Convert one user request into a dependency graph of subgoals.
  2. Generate at least three candidate plans for non-trivial requests.
  3. Score plans using utility, confidence, and constraint checks.
  4. Trigger critique loop when confidence drops below threshold.
  5. Emit final decision trace with selected and rejected plans.

3.3 Non-Functional Requirements

  • Determinism: fixed-seed mode for repeatable eval runs.
  • Observability: full per-step trace with timestamps.
  • Safety: max planning depth and max critique iterations.

3.4 Real World Outcome

When running the tool, you see visible planning rather than hidden chain-of-thought.

$ assistant plan "Design a 10-day trip under $2200 with no overnight layovers"
[GoalGraph] nodes=9 edges=14
[PlanGen] candidates=4
[Score] plan-A utility=0.72 confidence=0.68 violations=0
[Score] plan-B utility=0.83 confidence=0.71 violations=0
[Score] plan-C utility=0.79 confidence=0.82 violations=1
[Critique] plan-B step-5 evidence weak -> refresh data
[Select] plan-B accepted

4. Solution Architecture

4.1 High-Level Design

User Goal
   |
   v
Goal Parser --> Goal Graph --> Plan Generator --> Plan Scorer --> Critic Loop --> Executor
                                  ^                                  |
                                  |----------------------------------|

4.2 Key Components

Component Responsibility Key Decisions
Goal parser extract objectives + constraints schema-first normalized representation
Plan generator create candidate execution paths branch cap and depth cap
Plan scorer compute utility/confidence weighted metrics with hard-gate constraints
Critic detect weak steps bounded repair loop

5. Implementation Guide

5.1 The Core Question You’re Answering

“How can I make agent decisions explicit, testable, and robust across long-horizon tasks?”

5.2 Concepts You Must Understand First

  1. ReAct vs Plan-Execute
  2. Constraint satisfaction (hard vs soft)
  3. Utility function design
  4. Confidence calibration basics

5.3 Questions to Guide Your Design

  1. Which constraints can never be violated?
  2. What minimum confidence should trigger re-plan?
  3. How do you stop loops while preserving quality?

5.4 Thinking Exercise

Draw a planning tree for one travel task and one hiring task. Compare where greedy planning fails and where branching helps.

5.5 The Interview Questions They’ll Ask

  1. Why is explicit planning valuable for agents?
  2. How do you score competing plans?
  3. What is the risk of over-reflection loops?
  4. How do you calibrate confidence for tool outputs?
  5. When should the system ask for human input?

5.6 Hints in Layers

Hint 1: Start with one hard constraint and one utility metric.

Hint 2: Add confidence gates before adding more metrics.

Hint 3: Keep critique isolated from execution.

Hint 4: Record rejection reasons for every discarded plan.

5.7 Books That Will Help

Topic Book Chapter
Planning frameworks “Building AI Agents” Ch. 2-3
Utility and trade-offs “AI Engineering” Ch. 6
Search strategies “Algorithms, Fourth Edition” Graph search

5.8 Common Pitfalls and Debugging

Problem 1: planner never settles

  • Why: no iteration budget.
  • Fix: enforce max-depth and max-repair.
  • Quick test: ambiguous task terminates in bounded steps.

Problem 2: same plan always wins

  • Why: scoring features too narrow.
  • Fix: add diversity penalty and evidence quality factor.
  • Quick test: task variants yield varied winners.

5.9 Definition of Done

  • Multi-plan generation and scoring works
  • Constraint violations are explicit and traceable
  • Critique loop is bounded and useful
  • Final decision trace is reproducible