Project 13: Durable Workflow Agent Runtime

Build a workflow-backed agent runtime that survives restarts, supports pauses, and records auditable transitions.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	16-28 hours
Language	TypeScript (alt: Python, Java)
Prerequisites	Projects 5, 6, 9
Key Topics	durable execution, retries, compensation, human gates

Learning Objectives

Model agent execution as explicit workflow states.
Separate deterministic state transitions from side effects.
Resume blocked workflows safely after outages/restarts.
Implement compensation for partial failure paths.

The Core Question You’re Answering

“How do you make long-running agents reliable when they can be interrupted at any point?”

Concepts You Must Understand First

Concept	Why It Matters	Where to Learn
Durable execution	Survive process/runtime failures	LangGraph docs
Compensation patterns	Recover from partial side effects	Workflow architecture references
Human pause/resume	Required for high-risk actions	Production approval workflow patterns

Theoretical Foundation

Goal -> Workflow Start -> Step N -> (Blocked?) -> Resume -> Complete

Durable workflows are state machines with persistence and deterministic replay.

Project Specification

What You’ll Build

A workflow engine integration that:

Encodes plan/act/observe as workflow nodes
Saves checkpoints after each state transition
Handles human approval waits
Replays failed runs for debugging

Functional Requirements

State transitions with persisted checkpoints
Retry policies per step type
Human approval wait/resume endpoint
Failure classification and compensation

Non-Functional Requirements

Deterministic replayable transitions
Idempotent side-effect adapters
Full transition history export

Real World Outcome

$ pnpm run p13 --goal "vendor onboarding review"
[workflow] id=wf_001 started
[step] docs_collect=ok
[step] policy_screen=blocked(manual approval)
[resume] approval_received
[step] final_decision=approved_with_controls
[artifact] onboarding_decision.json

Architecture Overview

Workflow Engine
  |- Planner Node
  |- Executor Node
  |- Policy Gate Node
  |- Approval Wait Node
  |- Finalizer Node

Implementation Guide

Phase 1: State Model

Define states, transitions, and persisted payload shapes.

Phase 2: Side-Effect Boundaries

Move external calls into idempotent activity wrappers.

Phase 3: Recovery + Replay

Implement replay mode and compensation actions.

Testing Strategy

Restart tests during blocked states
Duplicate message/retry tests
Compensation correctness tests

Common Pitfalls & Debugging

Pitfall	Symptom	Fix
Replay repeats side effects	duplicate external actions	side-effect isolation + idempotency keys
Lost approval context	cannot resume	checkpoint before wait state
Retry storms	cascading failures	bounded retries + exponential backoff

Interview Questions They’ll Ask

Why workflow engines for agents?
What makes replay deterministic?
How do you design compensation safely?
What metrics indicate stuck workflows?

Hints in Layers

Hint 1: Encode each step outcome as typed status.
Hint 2: Persist before external actions.
Hint 3: Add manual intervention hooks early.
Hint 4: Build replay UI/logs for failed runs.

Submission / Completion Criteria

Minimum Completion

Workflow persists and resumes across restart

Full Completion

Side-effect isolation + compensation + approval pause/resume

Excellence

One-command replay for root-cause analysis