Replay Harness in Code: Reproducing a DecisionRecord Byte-for-Byte

Replay sounds expensive. It is not. The first replay harness I shipped ran in 18 seconds against an 8-minute production trace, and that was on a laptop. The cost is not in re-execution; it is in re-running the canonical loop against recorded transcripts. The gap between those two sentences is what most teams miss.

I wrote Replay Is the Real Audit Log about why this matters. This post is the build-along — the four files that turn the contract into running code, plus the worked example of a clean replay and a tampered one.

The contract, restated

Given a trace_id, the runtime reconstructs the DecisionRecord byte-for-byte from pinned inputs.

Four pinned inputs:

Input	Where it lives	Refused if
Context Pack version	`harness/packs/{name}@{version}/`	unsigned or unpinned
Knowledge Graph snapshot	content-addressed snapshot store	snapshot has been GC’d
`invokeAgent` envelope	trace store, keyed by `trace_id`	envelope hash mismatches
Tool transcripts	trace store, append-only	chain hash breaks

Given those four, the canonical loop runs offline against the recorded transcripts and produces a DecisionRecord. The replay harness compares it to the persisted record. Equal: the run is reproducible. Not equal: tamper detected, and the harness names which step diverged.

File 1 — the input loader

The first job is to assemble the four pinned inputs into a typed bundle. This is a read-only walk over the trace store and the pack registry:

harness/replay/load.ts

import type {
  ContextPack, RunContext, InvokeRequest,
  ToolCall, ToolResult, DecisionRecord,
} from "@/types"
import { loadPack, loadKgSnapshot, readTrace } from "@/store"
 
export type ReplayInput = {
  trace_id: string
  pack: ContextPack
  pack_version: string
  kg_snapshot_id: string
  invoke: { ctx: RunContext; req: InvokeRequest; envelope_hash: string }
  transcripts: Array<{ call: ToolCall; result: ToolResult }>
  persisted_record: DecisionRecord
  // hashes the live runtime emitted alongside each artifact
  chain_root_hash: string
  chain_audit: { signed_by: string; chain_prev_hash: string }
}
 
export async function loadReplayInput(trace_id: string): Promise<ReplayInput> {
  const trace = await readTrace(trace_id)
  if (!trace) throw new Error(`trace ${trace_id} not found`)
 
  const pack = await loadPack(trace.pack_version)        // refuses if GC'd / unsigned
  const snap = await loadKgSnapshot(trace.kg_snapshot_id) // refuses if GC'd
  if (!pack || !snap) {
    throw new Error(`refusing to replay: pack or snapshot not retained`)
  }
 
  return {
    trace_id,
    pack,
    pack_version: trace.pack_version,
    kg_snapshot_id: trace.kg_snapshot_id,
    invoke: {
      ctx: trace.invoke.run_context,
      req: trace.invoke.request,
      envelope_hash: trace.invoke.envelope_hash,
    },
    transcripts: trace.tool_envelopes.map((te) => ({
      call: te.call, result: te.result,
    })),
    persisted_record: trace.decision_record,
    chain_root_hash: trace.audit.chain_root_hash,
    chain_audit: trace.audit,
  }
}

The loader refuses early when retention has eaten one of the inputs. The right behavior is to throw a typed “cannot replay” rather than to produce a non-deterministic record from partial inputs. A failed replay is information; a fake replay is a liability.

File 2 — the hash-chain verifier

Before any re-execution, verify that the persisted chain has not been tampered with. The chain is per-trace_id, append-only, and each entry’s hash includes the previous entry’s hash.

harness/replay/verify-chain.ts

import { createHash } from "node:crypto"
import type { ReplayInput } from "./load"
 
export type ChainVerdict =
  | { ok: true }
  | { ok: false; broke_at: number; expected: string; observed: string }
 
const sha256 = (b: Buffer | string) =>
  "sha256:" + createHash("sha256").update(b).digest("hex")
 
const canonical = (obj: unknown) =>
  JSON.stringify(obj, Object.keys(obj as object).sort())   // stable JSON
 
export function verifyChain(input: ReplayInput): ChainVerdict {
  // re-derive the chain hash from the recorded artifacts
  // Order: invokeAgent envelope, then each tool envelope in order, then DR
  let prev = ""
  const entries: Array<{ idx: number; payload: unknown; recorded: string }> = []
 
  entries.push({
    idx: 0,
    payload: { kind: "invoke", ctx: input.invoke.ctx, req: input.invoke.req },
    recorded: input.invoke.envelope_hash,
  })
  input.transcripts.forEach((t, i) => {
    entries.push({
      idx: i + 1,
      payload: { kind: "tool_call", call: t.call },
      recorded: t.call.payload_hash ?? "",
    })
    entries.push({
      idx: i + 1.5,
      payload: { kind: "tool_result", result: t.result },
      recorded: t.result.payload_hash ?? "",
    })
  })
  entries.push({
    idx: entries.length,
    payload: { kind: "decision_record", dr: input.persisted_record },
    recorded: input.chain_audit.chain_prev_hash,
  })
 
  for (const e of entries) {
    const computed = sha256(prev + canonical(e.payload))
    if (computed !== e.recorded) {
      return { ok: false, broke_at: e.idx, expected: computed, observed: e.recorded }
    }
    prev = computed
  }
  return { ok: true }
}

Two design notes that earn their keep.

The serialization is stable JSON — keys sorted, no whitespace variation. Without that, two semantically identical payloads produce different hashes, and the chain becomes a flaky test. Pick one canonicalization and never change it.

The verifier returns the index where the chain broke and both the expected and observed hashes. Diagnostics matter; “tamper detected” is not enough information for the on-call engineer to act on.

File 3 — the canonical loop runner

This is the part most people get wrong. Replay does not re-execute tools; it replays the recorded tool_result envelopes through the same Planner / Critic / Executor logic the live runtime ran. The runner’s tool-call function is a lookup, not a network call:

harness/replay/run.ts

import type { ReplayInput } from "./load"
import type { CompiledContext, DecisionRecord, ToolResult } from "@/types"
import { compilePack } from "@/runtime/compile"
import { runCanonicalLoop } from "@/runtime/loop"
 
export type ReplayedRecord = {
  record: DecisionRecord
  compiled_at: string
  ran_at: string
}
 
export async function runReplay(input: ReplayInput): Promise<ReplayedRecord> {
  // 1. Re-compile the CompiledContext from the pinned pack + recorded inputs
  const compiled: CompiledContext = await compilePack({
    pack: input.pack,
    run_context: input.invoke.ctx,
    request: input.invoke.req,
    // evidence retrieval reads from the pinned KG snapshot, not live KG
    kg: { snapshot_id: input.kg_snapshot_id },
    // memory recall uses what was actually recalled at run time
    promoted_memory: input.invoke.ctx.promoted_memory ?? [],
  })
 
  // 2. Provide a tool-call function that looks up recorded results
  const transcriptIndex = new Map<string, ToolResult>()
  for (const t of input.transcripts) {
    // key by capability + canonical args; multi-attempt callers use
    // call_id-suffixed keys, but for replay we trust call_id ordering
    transcriptIndex.set(t.call.call_id, t.result)
  }
 
  const replayToolFn = async (call: { call_id: string }) => {
    const recorded = transcriptIndex.get(call.call_id)
    if (!recorded) {
      throw new Error(`replay diverged: no recorded result for ${call.call_id}`)
    }
    return recorded
  }
 
  // 3. Run the same canonical loop the live runtime used, with the recorded
  //    tool-call function. Determinism guarantees: no clock, no random, no IO.
  const record = await runCanonicalLoop({
    compiled,
    invoke: input.invoke,
    toolCallFn: replayToolFn,
    deterministic: true,
  })
 
  return {
    record,
    compiled_at: new Date().toISOString(),
    ran_at: new Date().toISOString(),
  }
}

Three properties of the loop runner are non-negotiable.

It cannot make I/O calls. No HTTP, no DB writes, no external clocks. Every needed value comes from ReplayInput. The runCanonicalLoop core is pure — same inputs, same output, every time. If the live loop has a Date.now() call, replay either pins the run start time from the trace or fails loudly.

It cannot read live KG state. The pinned snapshot is the only authority. Reading current KG would mean replay drifts as the data drifts; the snapshot is what makes the verdict stable across time.

It fails on missing transcripts, never substitutes. If the live run made a tool call that is not in the trace, that is a recording bug or a tamper, and replay must not paper over it. The replay diverges; the diff reporter names which call was missing.

File 4 — the differ and the verdict

The output of the replay is a ReplayedRecord. The persisted record is in the trace. The differ compares them and produces a typed verdict:

harness/replay/diff.ts

import { createHash } from "node:crypto"
import type { ReplayInput } from "./load"
import type { ReplayedRecord } from "./run"
import type { DecisionRecord } from "@/types"
 
export type ReplayVerdict =
  | { ok: true; trace_id: string; record_hash: string }
  | { ok: false; trace_id: string; diverged_at: string; diff: Record<string, [unknown, unknown]> }
 
const stable = (o: unknown) => JSON.stringify(o, Object.keys(o as object).sort())
const hash = (o: unknown) =>
  "sha256:" + createHash("sha256").update(stable(o)).digest("hex")
 
export function diffRecords(input: ReplayInput, replayed: ReplayedRecord): ReplayVerdict {
  const a = input.persisted_record
  const b = replayed.record
  if (hash(a) === hash(b)) {
    return { ok: true, trace_id: input.trace_id, record_hash: hash(a) }
  }
 
  // walk the record fields and report the first divergence
  const fields: Array<keyof DecisionRecord> = [
    "decision_key", "evidence_refs", "policy_decisions", "approvals",
    "lineage", "outputs", "status",
  ]
  const diff: Record<string, [unknown, unknown]> = {}
  let firstDiverged: string | undefined
  for (const f of fields) {
    if (stable(a[f]) !== stable(b[f])) {
      if (!firstDiverged) firstDiverged = String(f)
      diff[String(f)] = [a[f], b[f]]
    }
  }
  return {
    ok: false,
    trace_id: input.trace_id,
    diverged_at: firstDiverged ?? "unknown",
    diff,
  }
}

The diff is structured, not stringified. The on-call engineer wants to know whether the verdict diverged on outputs versus policy_decisions versus lineage; those are very different incidents. A single Boolean “matched / did not match” is the same level of information as “we believe nothing changed” — useless when it matters.

The end-to-end driver

The four files compose into a 12-line replay driver:

harness/replay/index.ts

import { loadReplayInput } from "./load"
import { verifyChain } from "./verify-chain"
import { runReplay } from "./run"
import { diffRecords } from "./diff"
 
export async function replay(trace_id: string) {
  const input = await loadReplayInput(trace_id)
 
  const chain = verifyChain(input)
  if (!chain.ok) {
    return { trace_id, status: "tamper_detected" as const, chain }
  }
 
  const replayed = await runReplay(input)
  const verdict = diffRecords(input, replayed)
  return { trace_id, status: verdict.ok ? "byte_equal" : "diverged" as const, verdict }
}

replay(trace_id) is the function the on-call engineer runs. It returns one of three statuses: byte_equal (the run reproduced), diverged (the persisted record and the recomputed record differ — names which field), or tamper_detected (the chain hash is broken — names the index).

A clean replay

A successful replay against a real refund:

> replay("4bf92f3577b34da6a3ce929d0e0e4736")
{
  trace_id: "4bf92f3577b34da6a3ce929d0e0e4736",
  status:   "byte_equal",
  verdict: {
    ok: true,
    trace_id: "4bf92f3577b34da6a3ce929d0e0e4736",
    record_hash: "sha256:7c4af1...",
  },
}

That return value is what the regulator sees. Not a paragraph, not a screenshot — a typed object signed by the runtime. If you can produce that for any historical trace_id, you have audit.

A tampered replay

For testing, edit one byte in a persisted evidence_refs array. Run replay:

> replay("4bf92f3577b34da6a3ce929d0e0e4736")
{
  trace_id: "4bf92f3577b34da6a3ce929d0e0e4736",
  status:   "tamper_detected",
  chain: {
    ok: false,
    broke_at: 1,
    expected: "sha256:9b21...",
    observed: "sha256:c0fe...",
  },
}

The chain verifier caught it before re-execution even started. The diff is at index 1 — the first tool envelope. The on-call engineer pulls that envelope, sees the diff, and the IR is on a typed footing 30 seconds in.

What this discipline costs

Three operational commitments make replay real, not theoretical.

Tail-based sampling retains every IR-relevant run. Any run that crossed destructive, hit a loop guard, or failed scorecard thresholds gets retained in full. If sampling drops them, replay drops with them. The retention rule is part of the contract.

Time-to-replay is an SLO. If the team cannot run replay(trace_id) against a one-month-old trace in under five minutes, the contract is theoretical. Most failures here are KG snapshot retention; budget for it.

Signing keys rotate, replay queries historical keys. Replay reads chain audit signatures whose keys may have rotated since the run. The signing-key registry needs effective windows and a “revoked-keys remain queryable for replay” rule. Unglamorous, missed every time until the first key rotation.

A quarterly drill on a real production trace keeps the contract honest. If the team has not done it cold in six months, they probably cannot.

What this changes

Three things on day one.

Audit becomes a query. The regulator says “explain trace_id=abc.” The team runs replay(abc) and produces the recomputed record. Time to answer drops from days to seconds.

Drift detection becomes free. Sample 1% of historical traces every night, replay them on today’s runtime, and any non-byte_equal is signal. A model upgrade silently changed a verdict? The replay catches it. A pack edit changed which tools surface? The replay catches it. This is the substrate the Improvement Loop consumes for free.

Incident response stops being narrative. The on-call engineer asks “did this run reproduce?” and gets a typed answer. The runbook for “the agent did something weird” starts with replay(trace_id), not with grep.

Four files. Twelve-line driver. Eighteen seconds against an eight-minute trace. That is the whole thing. Wire it up this sprint; the leverage shows up the first time you need it.