Project 13: Durable Workflow Agent Runtime
Build a workflow-backed agent runtime that survives restarts, supports pauses, and records auditable transitions.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 16-28 hours |
| Language | TypeScript (alt: Python, Java) |
| Prerequisites | Projects 5, 6, 9 |
| Key Topics | durable execution, retries, compensation, human gates |
Learning Objectives
- Model agent execution as explicit workflow states.
- Separate deterministic state transitions from side effects.
- Resume blocked workflows safely after outages/restarts.
- Implement compensation for partial failure paths.
The Core Question You’re Answering
“How do you make long-running agents reliable when they can be interrupted at any point?”
Concepts You Must Understand First
| Concept | Why It Matters | Where to Learn |
|---|---|---|
| Durable execution | Survive process/runtime failures | LangGraph docs |
| Compensation patterns | Recover from partial side effects | Workflow architecture references |
| Human pause/resume | Required for high-risk actions | Production approval workflow patterns |
Theoretical Foundation
Goal -> Workflow Start -> Step N -> (Blocked?) -> Resume -> Complete
Durable workflows are state machines with persistence and deterministic replay.
Project Specification
What You’ll Build
A workflow engine integration that:
- Encodes plan/act/observe as workflow nodes
- Saves checkpoints after each state transition
- Handles human approval waits
- Replays failed runs for debugging
Functional Requirements
- State transitions with persisted checkpoints
- Retry policies per step type
- Human approval wait/resume endpoint
- Failure classification and compensation
Non-Functional Requirements
- Deterministic replayable transitions
- Idempotent side-effect adapters
- Full transition history export
Real World Outcome
$ pnpm run p13 --goal "vendor onboarding review"
[workflow] id=wf_001 started
[step] docs_collect=ok
[step] policy_screen=blocked(manual approval)
[resume] approval_received
[step] final_decision=approved_with_controls
[artifact] onboarding_decision.json
Architecture Overview
Workflow Engine
|- Planner Node
|- Executor Node
|- Policy Gate Node
|- Approval Wait Node
|- Finalizer Node
Implementation Guide
Phase 1: State Model
- Define states, transitions, and persisted payload shapes.
Phase 2: Side-Effect Boundaries
- Move external calls into idempotent activity wrappers.
Phase 3: Recovery + Replay
- Implement replay mode and compensation actions.
Testing Strategy
- Restart tests during blocked states
- Duplicate message/retry tests
- Compensation correctness tests
Common Pitfalls & Debugging
| Pitfall | Symptom | Fix |
|---|---|---|
| Replay repeats side effects | duplicate external actions | side-effect isolation + idempotency keys |
| Lost approval context | cannot resume | checkpoint before wait state |
| Retry storms | cascading failures | bounded retries + exponential backoff |
Interview Questions They’ll Ask
- Why workflow engines for agents?
- What makes replay deterministic?
- How do you design compensation safely?
- What metrics indicate stuck workflows?
Hints in Layers
- Hint 1: Encode each step outcome as typed status.
- Hint 2: Persist before external actions.
- Hint 3: Add manual intervention hooks early.
- Hint 4: Build replay UI/logs for failed runs.
Submission / Completion Criteria
Minimum Completion
- Workflow persists and resumes across restart
Full Completion
- Side-effect isolation + compensation + approval pause/resume
Excellence
- One-command replay for root-cause analysis