The first generation of “failure handling” in our agent stack was one function: retryWithBackoff(call, maxAttempts: 3). It caught timeouts. It also caught idempotency conflicts on a refund call and retried until the customer got refunded three times. The fix was not better backoff. The fix was typing the verdict.
Different failures need different next actions. A timeout might warrant a retry; an idempotency conflict means the call already succeeded somewhere upstream; a stale-evidence error means rebuild the context, not try again; a policy denial means escalate to a human. A single retry loop treats them as the same category, and that is the source of every surprising compensation bug I have debugged.
This post is the typed verdict map and the dispatcher in code. The full taxonomy lives in Failure Playbooks; this post is the smallest version that covers the failures you will see in the first quarter of production.
The shape of the contract
ToolResult.error.kind → FailureVerdict → CompensationAction
(with idempotency + reversal-token guard)Two layers. The first names what kind of failure occurred — the verdict. The second decides what to do about it — the compensation. The dispatcher is the boundary between them, and it is the part the team gets to be opinionated about.
The verdict enum
Six verdicts cover ~90% of the failure patterns I have seen in production. The shape is the contract; you can extend it later.
import type { ToolCall, ToolResult } from "@/types"
export type FailureVerdict =
| "transient_timeout" // network blip, server too slow; safe to retry
| "server_error_5xx" // upstream emitted an error response
| "idempotency_conflict" // we sent this Idempotency-Key already
| "evidence_stale" // evidence snapshot disagrees with current world
| "policy_denied" // gateway / external policy refused execution
| "schema_mismatch" // response shape did not match declared schema
export function classify(call: ToolCall, result: ToolResult): FailureVerdict | null {
if (result.status === "ok") return null
const err = result.error
if (!err) return "server_error_5xx"
if (err.kind === "timeout") return "transient_timeout"
if (err.kind === "5xx") return "server_error_5xx"
if (err.kind === "409_conflict" && err.message.includes("idempotency")) {
return "idempotency_conflict"
}
if (err.kind === "412_precondition" && err.message.includes("evidence")) {
return "evidence_stale"
}
if (err.kind === "403_forbidden") return "policy_denied"
if (err.kind === "schema_validation") return "schema_mismatch"
return "server_error_5xx" // unknown error → conservative bucket
}The classifier is intentionally narrow. Each branch maps a specific upstream signal — HTTP status + a token in the message — to a typed verdict. The unknown-error fallback is server_error_5xx, the most cautious bucket: not retried automatically, escalated to the dispatcher’s human path.
The classifier reads the error.kind string the adapter set, not the raw HTTP response. Adapters earn their keep by translating their upstream’s vocabulary into this six-verdict language. The dispatcher does not care that one adapter calls it 409 and another calls it IdempotencyConflict; it cares about idempotency_conflict.
The compensation actions
Each verdict maps to one named compensation. There is no “fallthrough” path — every verdict has a declared next action.
import type { FailureVerdict } from "./verdict"
export type CompensationAction =
| { kind: "retry_with_backoff"; max_attempts: number; backoff_ms: number[] }
| { kind: "refresh_evidence"; then: "retry" | "abort" }
| { kind: "escalate_to_human"; queue: string }
| { kind: "issue_reversal"; reason: string }
| { kind: "deprecate_tool_call"; reason: string; replan: true }
export const PLAYBOOK: Record<FailureVerdict, CompensationAction> = {
transient_timeout: {
kind: "retry_with_backoff",
max_attempts: 3,
backoff_ms: [200, 600, 1800],
},
server_error_5xx: {
kind: "retry_with_backoff",
max_attempts: 2,
backoff_ms: [500, 2000],
},
idempotency_conflict: {
// do NOT retry. The call already happened on the upstream side; treat
// it as success and reconcile by reading the upstream state.
kind: "deprecate_tool_call",
reason: "upstream already processed this idempotency_key",
replan: true,
},
evidence_stale: {
kind: "refresh_evidence",
then: "retry",
},
policy_denied: {
kind: "escalate_to_human",
queue: "policy_review",
},
schema_mismatch: {
// adapter contract violation; do not retry, file a ticket
kind: "deprecate_tool_call",
reason: "adapter response failed schema validation",
replan: true,
},
}Three design decisions in this map are worth their own bullet points.
idempotency_conflict is not a retry. This is the bug that bit us — retrying an idempotency conflict is the same as retrying a 200 OK that already drained your wallet. The right next action is to assume the call already succeeded, deprecate the local attempt, and let the planner re-plan based on the upstream’s current state.
evidence_stale is a refresh, not a retry. If the gateway refused because the evidence hash drifted, the right action is to re-retrieve the evidence (which may now have changed), recompile the context, and replan. A retry on stale evidence is a guarantee of another stale-evidence error.
policy_denied always escalates. A policy denial is the system saying “this is not for you to decide.” The harness must not work around it. The escalate-to-human path is the only correct compensation; the model must not get a chance to rephrase.
The dispatcher
The dispatcher is the function the runtime calls when a ToolResult returns with status: "error". It is one switch:
import type { ToolCall, ToolResult, RunContext } from "@/types"
import { classify } from "./verdict"
import { PLAYBOOK } from "./compensation"
import {
redeemReversalToken,
refreshEvidenceFor,
enqueueForHuman,
retryDispatch,
emitDecisionEvent,
} from "@/runtime"
export type DispatchOutcome =
| { kind: "succeeded_after_compensation"; result: ToolResult }
| { kind: "deprecated"; reason: string; replan: true }
| { kind: "escalated"; queue: string }
| { kind: "reversed"; reason: string }
| { kind: "exhausted"; final_error: ToolResult["error"] }
export async function dispatchFailure(
ctx: RunContext,
call: ToolCall,
result: ToolResult,
): Promise<DispatchOutcome> {
const verdict = classify(call, result)
if (!verdict) {
return { kind: "succeeded_after_compensation", result } // no failure
}
const action = PLAYBOOK[verdict]
await emitDecisionEvent(ctx, { kind: "failure_classified", call_id: call.call_id, verdict })
switch (action.kind) {
case "retry_with_backoff": {
for (let attempt = 0; attempt < action.max_attempts; attempt++) {
await sleep(action.backoff_ms[attempt] ?? 1000)
const r = await retryDispatch(call, attempt + 1)
if (r.status === "ok") return { kind: "succeeded_after_compensation", result: r }
}
return { kind: "exhausted", final_error: result.error }
}
case "refresh_evidence": {
await refreshEvidenceFor(ctx, call.evidence_refs)
if (action.then === "retry") {
const r = await retryDispatch(call, /* attempt */ 1)
return r.status === "ok"
? { kind: "succeeded_after_compensation", result: r }
: { kind: "deprecated", reason: "post-refresh retry still failing", replan: true }
}
return { kind: "deprecated", reason: "refresh_evidence requested abort", replan: true }
}
case "escalate_to_human":
await enqueueForHuman({ queue: action.queue, ctx, call, result })
return { kind: "escalated", queue: action.queue }
case "issue_reversal": {
// critical guard: only attempt reversal if a token was issued
if (!call.reversal_token) {
await emitDecisionEvent(ctx, {
kind: "reversal_refused",
call_id: call.call_id,
reason: "no reversal_token on the call envelope",
})
return { kind: "exhausted", final_error: result.error }
}
await redeemReversalToken(call.reversal_token, action.reason)
return { kind: "reversed", reason: action.reason }
}
case "deprecate_tool_call":
return { kind: "deprecated", reason: action.reason, replan: action.replan }
}
}
const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms))The dispatcher emits a failure_classified decision event before any compensation runs. That event lands in the trace store, which means replay reproduces the verdict and the compensation it triggered. Failure handling stops being a black box.
The reversal-token guard
The issue_reversal compensation is the most dangerous one — it makes a write to the upstream system. The dispatcher’s guard prevents the most-common rollback bug: a code path that issues a reversal without a paired token from the original call.
// excerpt from dispatch.ts above
case "issue_reversal": {
if (!call.reversal_token) {
// no token = no rollback. Emit the event, refuse the action.
await emitDecisionEvent(ctx, {
kind: "reversal_refused",
call_id: call.call_id,
reason: "no reversal_token on the call envelope",
})
return { kind: "exhausted", final_error: result.error }
}
await redeemReversalToken(call.reversal_token, action.reason)
return { kind: "reversed", reason: action.reason }
}Two properties this guard enforces.
A reversal never fires without a paired token from the original call. The token was issued by the gateway when the destructive call was authorized; redeeming it is the only way to invoke the reversal endpoint. No token, no reversal, period.
A reversal_refused event lands in the trace. If the on-call engineer needs to manually compensate, the event tells them why the automatic path declined. Visible refusal beats silent refusal every time.
The pairing logic lives in Build the Tool Gateway — the gateway is the only thing that issues reversal tokens, and it only issues them on destructive calls.
A worked refund-with-conflict
Walking through a real failure end-to-end:
import { dispatchFailure } from "@/harness/failure/dispatch"
const call = /* ToolCall for adp_payments.issue_refund, with reversal_token rev_x7y */
const result = /* ToolResult with status=error, error.kind="409_conflict",
message="idempotency_key already processed" */
const outcome = await dispatchFailure(ctx, call, result)
// outcome.kind === "deprecated"
// outcome.reason === "upstream already processed this idempotency_key"
// outcome.replan === true
// Planner re-plans: instead of issuing the refund again, it now reads the
// upstream's payment state, confirms the refund is recorded, and finalizes
// the DecisionRecord with status=COMPLETED rather than RETRIED.What did not happen: the dispatcher did not retry the call. Did not double-refund. Did not blindly succeed. It typed the failure, looked up the playbook, deprecated the local attempt, and signaled the planner to reconcile against the upstream truth.
That difference is the difference between “we have failure handling” and “our failure handling is right.”
Extending the playbook
Three rules for adding to the verdict and playbook.
Verdicts go through change control. Adding a new FailureVerdict is a code change with an ADR explaining what failure pattern it captures, what an example looks like, and why the existing six did not cover it. Random additions cause the dispatcher to grow into the same retry-everything mess the typed approach was supposed to fix.
Compensations are reviewed against goldens. A new compensation needs a fixture run showing the dispatcher’s outcome on a representative failure. Without that, you do not know whether refresh_evidence actually replans correctly on the intent it was added for.
Production failures generate proposals, not patches. A failure pattern the existing dispatcher handles wrong becomes a FeedbackEntry in the Improvement Loop. It runs through the full proposal/replay/release path. The dispatcher itself does not get a hand-edit on Tuesday afternoon.
What this changes
Three things on day one of running a typed dispatcher in production.
The double-refund class of bug becomes structurally impossible. Idempotency conflicts deprecate the local call and replan against the upstream truth. The model never gets a chance to retry a write that already succeeded.
Compensation becomes auditable. Every failure produces a typed failure_classified event in the trace, every compensation produces a typed outcome, replay reproduces both. When the regulator asks why the agent retried, you point at the verdict.
Adapters get a contract. New adapters know exactly what error.kind strings the dispatcher recognizes. There is one taxonomy across the harness; the adapter’s job is to translate upstream errors into it.
Six verdicts. Five compensations. One switch. Less than 200 lines of code, including types. That is the whole shape — pick a typed verdict map over a single retry loop and the surprising-compensation class of bug stops shipping.
Wire this up after the gateway. Pick the destructive capability with the highest failure rate; that is your highest-leverage starting point. The first dispatch run that catches an idempotency conflict and refuses to retry pays for the whole thing.