End-to-End Refund: How 12 Primitives Compose in One Production Run

Twelve posts have built individual pieces of the harness. This one is the run that ties them together.

The reader is doing the hard part: keeping a model of how the Context Pack Compiler, the Tool Gateway, the Critic, the evaluators, the failure dispatcher, and the replay harness actually compose. The spec lives in Harness Engineering; the operator’s compressed map is the post you are reading now.

I am going to walk a single ₹24,500 refund from invokeAgent envelope to a byte-equal replay five minutes later. Every transition is a code snippet. Every snippet links back to the post that owns the primitive, so you can drill in where the detail matters.

Setup

const ctx: RunContext = {
  trace_id: "4bf92f3577b34da6a3ce929d0e0e4736",
  run_id: "run_2026_05_08_a17",
  tenant_id: "tenant_acme_prod",
  user_id: "agent_support_77",
  on_behalf_of: "cust_8861",
  intent: "support.refund.execute",
  intent_class: "support_high_value",
  safety_mode: "destructive",
  pack_version: "ctxpack.support@5.2.0",
  kg_snapshot_id: "snapshot_kg_2026_05_08_T0930",
  run_budget: { bucket_tokens: 8000, max_steps: 10 },
  permissions: [
    { adapter_id: "adp_orders",   capability: "lookup",       allow: true },
    { adapter_id: "adp_payments", capability: "issue_refund", allow: true },
    { adapter_id: "adp_policy",   capability: "eval",         allow: true },
  ],
}
 
const req: InvokeRequest = {
  input: {
    intent: "support.refund.execute",
    message: "Refund ₹24,500 on order ord_881 — customer says wrong size delivered.",
    context_query: { order_id: "ord_881" },
  },
}

The RunContext is the typed envelope that travels through every plane. safety_mode: "destructive" is the ceiling — nothing the gateway surfaces or executes will exceed it. pack_version and kg_snapshot_id are the pins that make replay deterministic.

Step 1 — Compile

The runtime calls the Context Pack Compiler. All eight stages run; the output is one CompiledContext:

const compiled = await compile(pack, ctx, req)
// compiled.manifests.tool_manifest:
//   [
//     { adapter_id: "adp_orders",   capability_id: "lookup",       approval_mode: "read_only" },
//     { adapter_id: "adp_policy",   capability_id: "eval",         approval_mode: "read_only" },
//     { adapter_id: "adp_payments", capability_id: "issue_refund", approval_mode: "destructive" },
//   ]
//
// compiled.runtime_controls.must_escalate: ["R_REFUND_OVER_LIMIT"]
// compiled.runtime_controls.approval_gates_active: ["GATE_HIGH_VALUE"]
// compiled.runtime_controls.redaction_rules_active: ["pan", "credit_card"]
// compiled.budget_report.bucket_truncations: {}    // nothing truncated

Every primitive that touches the model is in this envelope and nowhere else. The model never sees payments.bulk_refund (not in surface). The compile is replay-deterministic — same pack + same snapshot + same request produces the same CompiledContext. See Build the Context Pack Compiler for the eight-stage pipeline.

Step 2 — Plan

The Planner reads the compiled context and proposes a typed plan:

const plan: Plan = await runPlanner(compiled, req)
// {
//   steps: [
//     { kind: "tool", adapter_id: "adp_orders",   capability_id: "lookup",       args: {id:"ord_881"}, evidence_refs:[] },
//     { kind: "tool", adapter_id: "adp_policy",   capability_id: "eval",         args:{}, evidence_refs:["kg:order:ord_881#snapshot_kg_2026_05_08_T0930"] },
//     { kind: "tool", adapter_id: "adp_payments", capability_id: "issue_refund", args:{id:"pay_8861", amount_inr:24500}, evidence_refs:["kg:order:ord_881#snapshot_kg_2026_05_08_T0930"], requires_evidence:["refund_window_evidence"] },
//   ],
//   declared_outputs: ["refund_amount_inr", "refund_reason_class"],
//   estimated_steps: 3,
// }

Notice the third step declares it requires_evidence: ["refund_window_evidence"] but only pins one ref — the order, not the refund-window evidence. The Critic catches this next.

Step 3 — Verify (the Critic, before any side effect)

const verifyV = verify(ctx, compiled, decisionSpec, plan)
// {
//   ok: false,
//   kind: "missing_evidence",
//   reasons: ["step 2 requires evidence class refund_window_evidence, none pinned"],
//   offending_step: 2
// }

The Critic refuses before any tool runs. The plan never reaches the Executor. See The Critic for the five mechanical checks verify runs.

Step 4 — Re-plan

The Planner gets the typed verdict back, retrieves the missing evidence (using the same caller-supplied retriever stage 4 of the compile uses), and pins it:

const replan: Plan = await runPlanner(compiled, req, { previous_verdict: verifyV })
// step 2 now carries:
//   evidence_refs: [
//     "kg:order:ord_881#snapshot_kg_2026_05_08_T0930",
//     "kg:refund_window:rw_881#snapshot_kg_2026_05_08_T0930",
//   ]
 
const verify2 = verify(ctx, compiled, decisionSpec, replan)
// { ok: true, reasons: ["3 steps, 2 required outputs covered"] }

Verify passes. The plan reaches the Executor.

Step 5 — Execute steps 0 and 1 (read-only)

for (const step of replan.steps.slice(0, 2)) {
  const verdict = await resolveToolCall(ctx, step)
  if (!verdict.ok) return finalizeWithDenial(ctx, verdict)
  const result = await dispatch(verdict.call)
  // step 0: orders.lookup → { id:"ord_881", amount:24500, shipped:true, ... }
  // step 1: policy.eval   → { allow:true, requires:["GATE_HIGH_VALUE"] }
}

Both calls are read_only, both pass through the Tool Gateway with typed envelopes. The transcripts land in the trace store. See Build the Tool Gateway for the resolver and dispatch code.

Step 6 — Approval gate

The third step is destructive. The gateway refuses to execute without a signed approval gate. The runtime emits a gate request:

const gateRequest = {
  gate_id: "GATE_HIGH_VALUE",
  trace_id: ctx.trace_id,
  proposed_call: {
    adapter_id: "adp_payments",
    capability_id: "issue_refund",
    args: { id: "pay_8861", amount_inr: 24500 },
  },
  evidence_snapshot_hash: await freezeEvidence(replan.steps[2].evidence_refs),
  // sha256:b2a1cf...
  recommended_by: ["compliance.v1:pass", "reliability.v1:pass"],
  expires_at: new Date(Date.now() + 15 * 60_000).toISOString(),
}

The operator UI renders this with the proposed call, the evidence the agent used, and the reviewer recommendations. A finance lead signs:

const signature = {
  gate_id: "GATE_HIGH_VALUE",
  approver: "user_finance_lead_77",
  evidence_snapshot_hash: "sha256:b2a1cf...",
  signed_at: "2026-05-08T09:31:30Z",
  signature: "ed25519:..." /* over the gateRequest hash */,
}

Reviewer agents like the compliance reviewer and the reliability reviewer emit recommendations into this surface but do not sign — only humans sign destructive gates.

Step 7 — Redeem and execute

The runtime redeems the signed gate against a fresh evidence-snapshot hash and dispatches:

const verdict = await resolveToolCall(ctx, {
  ...replan.steps[2],
  approval_gate: signature,
})
// resolver re-computes freezeEvidence(replan.steps[2].evidence_refs)
// and refuses if it does not match signature.evidence_snapshot_hash.
// Drift between sign-time and redeem-time = refusal.
 
const result = await dispatch(verdict.call)
// { status: "ok", data: { refund_id: "rfd_x7y3", reversal_token: "rev_x7y" } }

The reversal token is issued by the gateway, not by the adapter. If anything goes wrong post-execution, the failure playbooks issue_reversal action redeems it against payments.reversal_op.

Step 8 — Score

The run is complete. The Critic runs the five evaluators:

const scoreV = score({ /* the run, with all transcripts and the compiled context */ })
// {
//   ok: true,
//   scorecard: {
//     scores: {
//       policy:  { status: "pass", score: 1.00, findings: [] },
//       safety:  { status: "pass", score: 1.00, findings: [] },
//       utility: { status: "pass", score: 0.94, findings: [] },
//       latency: { status: "pass", score: 0.96, findings: [] },     // 1830 ms vs 4000 ms target
//       cost:    { status: "pass", score: 0.98, findings: [] },     // $0.0091 vs $0.02 budget
//     },
//   },
// }

See Wiring the Five Evaluators for each evaluator’s code. Hard-fail evaluators (Policy, Safety) are at 1.00. Soft evaluators are above the warn thresholds.

Step 9 — Consolidate

const report = consolidate({
  trace_id: ctx.trace_id,
  decision_key: decisionSpec.id,
  verify: verify2,
  score: scoreV,
})
// {
//   status: "completed",
//   rationale: "verify and score passed",
//   ...
// }

The CriticReport is the final input to the DecisionRecord.

Step 10 — DecisionRecord

const dr: DecisionRecord = {
  record_id: "dr_2026_05_08_a17",
  decision_key: "support.refund.execute",
  trace_id: "4bf92f3577b34da6a3ce929d0e0e4736",
  status: "completed",
  evidence_refs: [
    "kg:order:ord_881#snapshot_kg_2026_05_08_T0930",
    "kg:refund_window:rw_881#snapshot_kg_2026_05_08_T0930",
    "tool:orders.lookup:tc_117",
    "tool:policy.eval:tc_119",
    "tool:payments.refund:tc_121",
  ],
  policy_decisions: [
    { policy_decision_id: "pol_9900", rule_id: "R_REFUND_REQUIRES_IDV" },
    { policy_decision_id: "pol_9901", rule_id: "R_REFUND_OVER_LIMIT" },
  ],
  approvals: [{
    gate_id: "GATE_HIGH_VALUE",
    approver: "user_finance_lead_77",
    approval_mode_effective: "destructive",
    evidence_snapshot_hash: "sha256:b2a1cf...",
    decided_at: "2026-05-08T09:31:30Z",
  }],
  outputs: {
    refund_amount_inr: 24500,
    refund_reason_class: "wrong_size_delivered",
    refund_id: "rfd_x7y3",
  },
  reversal_token: "rev_x7y",
  scorecard: scoreV.scorecard,
  lineage: {
    pack_version: "ctxpack.support@5.2.0",
    snapshot_version: "snapshot_kg_2026_05_08_T0930",
    model: { id: "model_xyz", sha256: "..." },
  },
  audit: {
    chain_prev_hash: "sha256:7c4af1...",
    signed_by: "kid_runtime_2026Q2",
  },
}

This is the artifact every audit, every replay, every Improvement-Loop proposal reads. Prose is absent on purpose. Every field is typed; every field is replayable.

Step 11 — Trace and scorecard land

The Decision Record, the OTEL trace, the tool envelopes, and the scorecard write to the experience store under the trace_id. The chain hash extends; the prior hash becomes the next entry’s pointer. Tail-based sampling retains this run because it crossed destructive (the IR-relevant retention rule).

Step 12 — Replay, five minutes later

> replay("4bf92f3577b34da6a3ce929d0e0e4736")
{
  trace_id: "4bf92f3577b34da6a3ce929d0e0e4736",
  status: "byte_equal",
  verdict: {
    ok: true,
    record_hash: "sha256:7c4af1f9...",
  },
}

Eighteen seconds. The replay re-runs the same eight compile stages, the same Critic phases, against the same recorded transcripts. The recomputed DecisionRecord matches the persisted one byte-for-byte. See Replay Harness in Code for the four-file harness that produced this verdict.

What just happened, summarized

invokeAgent envelope
  → CompiledContext           [build-the-context-pack-compiler]
    → Plan
      → Critic.verify          [the-critic-verify-score-consolidate]   refused
        → Plan'                                                          re-planned
          → ToolCall × 2       [build-the-tool-gateway]                  read_only
            → ApprovalGate     [approval-gates-in-code]                  signed
              → ToolCall       [build-the-tool-gateway]                  destructive + reversal
                → Critic.score  [wiring-the-five-evaluators]              passed
                  → CriticReport
                    → DecisionRecord
                      → trace + scorecard      [pack-rollout-in-five-stages]
                        → replay byte-equal     [replay-harness-in-code]

Twelve primitives. One run. Every transition typed. Every artifact replayable. Every refusal auditable.

What this composition gives you

Three properties only the whole pipeline produces, not any single primitive.

Refusal at the right boundary. The missing evidence was caught by the Critic before the gateway even saw a destructive call. The destructive call was caught by the gateway before any HTTP hit payments. Each boundary refuses in its own typed vocabulary, and refusals never silently retry.

Audit by construction. The DecisionRecord is not assembled at the end; it is built up from typed envelopes that every primitive emits along the way. The auditor reads what the runtime produced, not what the team thinks happened.

Improvement compounds. The same refund a week later runs faster because pack 5.2.0 had its budget bumped after the last rollout cycle; the refund goes through the same gate but with lower latency because the Improvement Loop released a tighter evidence rubric. Nothing about the run code changed; the harness around it did.

That is the whole shape. Twelve posts of pieces; one post that shows them composing. The first time you trace a real run through it on your own stack — with your own pack, your own gateway, your own evaluators — the diagram above stops being a diagram and starts being the runbook your on-call team uses at 2 a.m. when the regulator emails.

Build it piece by piece. The compose comes for free once the typed envelopes are in place.