Project 13: Durable Workflow Agent Runtime

Build a workflow-backed agent runtime that survives restarts, supports pauses, and records auditable transitions.


Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 16-28 hours
Language TypeScript (alt: Python, Java)
Prerequisites Projects 5, 6, 9
Key Topics durable execution, retries, compensation, human gates

Learning Objectives

  1. Model agent execution as explicit workflow states.
  2. Separate deterministic state transitions from side effects.
  3. Resume blocked workflows safely after outages/restarts.
  4. Implement compensation for partial failure paths.

The Core Question You’re Answering

“How do you make long-running agents reliable when they can be interrupted at any point?”


Concepts You Must Understand First

Concept Why It Matters Where to Learn
Durable execution Survive process/runtime failures LangGraph docs
Compensation patterns Recover from partial side effects Workflow architecture references
Human pause/resume Required for high-risk actions Production approval workflow patterns

Theoretical Foundation

Goal -> Workflow Start -> Step N -> (Blocked?) -> Resume -> Complete

Durable workflows are state machines with persistence and deterministic replay.


Project Specification

What You’ll Build

A workflow engine integration that:

  • Encodes plan/act/observe as workflow nodes
  • Saves checkpoints after each state transition
  • Handles human approval waits
  • Replays failed runs for debugging

Functional Requirements

  1. State transitions with persisted checkpoints
  2. Retry policies per step type
  3. Human approval wait/resume endpoint
  4. Failure classification and compensation

Non-Functional Requirements

  • Deterministic replayable transitions
  • Idempotent side-effect adapters
  • Full transition history export

Real World Outcome

$ pnpm run p13 --goal "vendor onboarding review"
[workflow] id=wf_001 started
[step] docs_collect=ok
[step] policy_screen=blocked(manual approval)
[resume] approval_received
[step] final_decision=approved_with_controls
[artifact] onboarding_decision.json

Architecture Overview

Workflow Engine
  |- Planner Node
  |- Executor Node
  |- Policy Gate Node
  |- Approval Wait Node
  |- Finalizer Node

Implementation Guide

Phase 1: State Model

  • Define states, transitions, and persisted payload shapes.

Phase 2: Side-Effect Boundaries

  • Move external calls into idempotent activity wrappers.

Phase 3: Recovery + Replay

  • Implement replay mode and compensation actions.

Testing Strategy

  • Restart tests during blocked states
  • Duplicate message/retry tests
  • Compensation correctness tests

Common Pitfalls & Debugging

Pitfall Symptom Fix
Replay repeats side effects duplicate external actions side-effect isolation + idempotency keys
Lost approval context cannot resume checkpoint before wait state
Retry storms cascading failures bounded retries + exponential backoff

Interview Questions They’ll Ask

  1. Why workflow engines for agents?
  2. What makes replay deterministic?
  3. How do you design compensation safely?
  4. What metrics indicate stuck workflows?

Hints in Layers

  • Hint 1: Encode each step outcome as typed status.
  • Hint 2: Persist before external actions.
  • Hint 3: Add manual intervention hooks early.
  • Hint 4: Build replay UI/logs for failed runs.

Submission / Completion Criteria

Minimum Completion

  • Workflow persists and resumes across restart

Full Completion

  • Side-effect isolation + compensation + approval pause/resume

Excellence

  • One-command replay for root-cause analysis