Replay Is the Real Audit Log

The first time a regulator asked us to explain a decision the agent had made, the team spent a day and a half producing the answer. The trace was in three different systems. The model output was in a chat log. The retrieval was in an Elasticsearch index that had been re-indexed twice since the run. The policy version had been edited. We pieced together a narrative that was probably correct, and the regulator probably believed us, but I would not bet money on either claim.

The lesson I took away was not that we needed better logs. We had plenty of logs. The lesson was that logs are a feeling of audit, not the thing itself.

This post is about what the thing itself looks like.

2026 update: traces are the carrier, records are the contract

Open standards have already solved part of the problem. W3C Trace Context defines a common way to propagate trace identity between services, and OpenTelemetry gives teams a mature vocabulary for traces, metrics, and logs. Use them.

But do not mistake propagation for audit. A trace lets you correlate the run. A Decision Record tells you what was decided, which evidence and policy it relied on, which approvals were collected, and whether replay can reproduce it. ContextOS uses the trace as the spine and the Decision Record as the audit contract.

The shape of the problem

There is a moment in any post-incident review where someone looks at the on-call channel and asks: “what did the agent actually do, and why?” The default tools answer that question with prose — model output, paraphrased reasoning, a stack trace, a screenshot of the user’s chat. It feels like an answer. It is not.

Three failure modes show up reliably.

The first is that the logs are unstructured. Some lines are JSON, some are stack traces, some are the model’s recollection of its own behavior. Reconstructing the run requires interpretation, which means the answer depends on who reads the file. Two engineers given the same logs will reach different conclusions in non-trivial cases. That is bad enough internally; it is a disaster in front of a regulator.

The second is that the logs are not tamper-evident. Anyone with shell access to the log store could edit the JSON. The auditor knows this. So do you. The polite thing to do at this point is to stop calling them an audit trail.

The third is that the run is not reproducible. The model has been swapped. The retrieval index has been rebuilt. The policy has been edited. Even if the logs were perfect, the world they describe no longer exists. The team has to assume the logs match what happened; they cannot prove it.

These are different problems with the same root cause: logs were never a contract. They were a hope.

The contract

The contract that fixes this is short enough to fit on a notecard:

Given a trace_id, the runtime reconstructs the DecisionRecord byte-for-byte from pinned inputs.

If you can do that, you have audit. If you cannot, you do not. Everything else in this post is the consequence of that one sentence.

Replay rests on four pinned inputs:

Input	What it pins	Refused if
Context Pack version	`ctxpack.support@5.2.0` with content hash	unsigned or unpinned
Knowledge Graph snapshot	`kg_2026_04_07_T0930`	snapshot has been garbage-collected
Recorded `invokeAgent` envelope	`RunContext`, claims, scopes, intent, budgets	envelope hash mismatches
Recorded tool transcripts	every `toolCall` / `toolResult` pair	chain hash breaks

Given those four, the canonical loop runs offline against the recorded transcripts and produces a Decision Record. The replay harness compares it to the persisted record. Equal: the run is reproducible. Not equal: the chain has been altered or the runtime has drifted, and the harness returns tamper_detected rather than a recomputed verdict.

What a replayable Decision Record looks like

{
  "record_id": "dr_2026_04_08_a17",
  "decision_key": "support.refund.execute",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "evidence_refs": [
    "kg:order:ord_881#snapshot_kg_2026_04_07_T0930",
    "tool:orders.lookup:tc_117",
    "tool:policy.eval:tc_119",
    "tool:payments.refund:tc_121"
  ],
  "policy_decisions": [
    { "policy_decision_id": "pol_9900", "rule_ids": ["R_REFUND_REQUIRES_IDV"] },
    { "policy_decision_id": "pol_9901", "rule_ids": ["R_REFUND_LIMIT_BY_ROLE"] }
  ],
  "approvals": [
    {
      "gate_id": "GATE_HIGH_VALUE",
      "approver": "user_finance_lead_77",
      "approval_mode_effective": "destructive",
      "evidence_snapshot_hash": "sha256:b2a1...",
      "decided_at": "2026-04-08T09:31:30Z"
    }
  ],
  "lineage": {
    "pack_version": "ctxpack.support@5.2.0",
    "snapshot_version": "kg_2026_04_07_T0930",
    "model": { "id": "model_xyz", "sha256": "..." }
  },
  "audit": {
    "chain_prev_hash": "sha256:7c4a...",
    "signed_by": "kid_runtime_2026Q2"
  }
}

What is not in this record is prose. There is no “the model decided X because Y.” There is a verdict, the evidence, the policies, the approvals, and a hash chain. Those are facts. Prose is interpretation.

Hash chains, in plain language

Decision Records, tool transcripts, and policy decisions for a given run form an append-only chain keyed by trace_id. Each entry’s hash includes the previous entry’s hash. Tampering with any entry breaks the chain at that point, which the harness detects on replay.

The practical effect is the difference between “we believe nothing was changed” and “we can prove nothing was changed.” Auditors care about the second sentence in a way that is easy to underestimate until you sit through a real audit.

What changes when replay is the IR primitive

Incident response stops being narrative reconstruction and starts being a query.

The on-call engineer takes the trace_id and runs the replay. If the recomputed record matches, behavior is intact, and the issue is downstream — data drift, model swap, configuration change. If it does not match, the harness names the diff: which step diverged, which transcript hash broke, which lineage entry no longer matches today’s runtime. Either way, the engineer has a typed answer in minutes, not days.

This is what I wished we had for the regulator the first time around. Not better logs — a clock that started at “give me the trace id” and stopped at “here is the recomputed Decision Record, signed, byte-equal.”

What changes for compliance

Compliance asks change shape entirely. A request like “pull every refund over 10,000 rupees in Q1, who approved them, and what evidence was on file” used to be a multi-week join across systems. With Decision Records, it is a projection:

SELECT trace_id, approvals, evidence_refs, lineage
FROM decision_records
WHERE decision_key = 'support.refund.execute'
  AND outputs->>'refund_amount_inr' > 10000
  AND timestamp BETWEEN '2026-01-01' AND '2026-04-01';

The audit story was already written by the runtime. The compliance team is reading what the runtime emitted, not reconstructing what the team thinks happened.

Replay also catches drift before the regulator does

This is the part most teams under-use. Replay is not only an incident tool; it is a continuous test against the live runtime. Sample historical trace_ids, replay them on today’s runtime, and any non-match is signal:

A model upgrade silently changed verdicts on inputs the team thought were canonical. A policy bundle change made a previously-allowed call a denial. A pack edit shifted which tools the model surfaces. A retrieval change altered which evidence the run grounded against.

Each of these has happened to every team I know that runs an agent in production. With replay as a continuous check, they become release-gate verdicts. Without it, they become Slack threads three weeks later, after a customer notices.

For more on this loop, see evaluation and observability.

What it costs you operationally

A few commitments are non-negotiable.

Tail-based sampling has to retain every run that crossed destructive, hit a loop guard, or failed scorecard thresholds. These are the IR-relevant runs by construction. If your sampling pipeline drops them, your audit story drops with them.

Time-to-replay for a given trace_id becomes an SLO. If the team cannot stand up a replay within hours, the contract is theoretical.

Signing keys rotate, and replay has to remain valid against historical keys. The signing-key registry needs effective windows and a “revoked-keys remain queryable for replay” rule. This is unglamorous and the kind of thing that gets missed until the first key rotation.

A quarterly drill end-to-end on a real production trace_id keeps the contract honest. If the team has not done it cold in six months, they probably cannot.

What people get wrong

The two arguments against this approach are both wrong, and they are wrong in the same way.

“OTEL spans are enough.” Spans are part of the audit. They are not the audit. Spans without a Decision Record give you observability without provenance — you can see what happened, but you cannot prove the verdict.

“Replay is too expensive.” Replay does not re-execute tools. It re-runs the canonical loop against recorded transcripts. The cost is small relative to live execution, and you amortize it across every audit and every incident you would otherwise spend a day reconstructing.

Both arguments treat replay as an optional addition to a logging-based audit. It is not an addition. It is a different kind of artifact that does the work logs cannot.

Audit-readiness checklist

Check	Required evidence
Trace propagation	One `trace_id` connects compile, plan, policy, tool, approval, evaluator, and final record spans.
Pinned inputs	Pack version, policy bundle, KG snapshot, model profile, and request envelope hashes are recorded.
Tool transcripts	Every tool call/result pair is stored with schema version, idempotency key, and hash chain position.
Human approvals	Approver, gate id, effective mode, frozen evidence hash, and timestamp are first-class fields.
Replay result	Historical `trace_id`s replay byte-equal or return a typed diff/tamper verdict.
Retention	Destructive, denied, escalated, and failed-scorecard runs are never dropped by sampling.

Closing

When the regulator, the auditor, or the customer asks “what happened on this run?”, the right answer is not a paragraph. It is a trace_id, a replay command, and a typed Decision Record that matches the persisted one byte-for-byte.

If you can produce that, you have audit. If you cannot, no quantity of logs will save you when it matters.