Skip to content
Back to Blog
Trust, audit, governance
May 30, 2026
·by ·18 min read

Reversibility Is the Missing Safety Primitive for AI Agents

Share:XBSMRedditHNEmail
Reversibility Is the Missing Safety Primitive for AI Agents illustration

Guardrails reduce bad actions. Reversal contracts reduce the cost of being wrong.

TL;DR

The agentic-safety conversation is almost entirely about prevention: keep the bad action from happening. Surfaced tools, deterministic policy, argument constraints, approval gates, identity boundaries — every one of them answers the question “may this call happen?” before it does.

That question is necessary and it is not sufficient. Prevention is a probability-reduction mechanism, and its residual failure rate is never zero. Worse, an action can pass every gate, be fully authorized, and still turn out wrong — because the evidence it was authorized on was stale, the world moved, or the approver signed a snapshot that drifted. Prevention has nothing to say about that case. Neither does audit; replay tells you what happened, not how to take it back.

The missing primitive is reversibility: designing every consequential action so that a decision which was correct at commit time but wrong in hindsight can be undone or compensated, and bounding blast radius so the cost of being wrong stays small.

Prevention decides whether the action happens. Reversibility decides whether you survive being wrong.

The mechanisms are not new. Distributed systems have used compensating transactions, sagas, reservations, and idempotent recovery for decades. The contribution of this paper is narrower and more specific: making reversibility a declared, executable contract on the Action plane — checked at tool registration, carried in typed mutation refs, and enforced by a governed recovery loop — rather than leaving it to ad hoc cleanup code and human heroics. This is not “agents need an undo button.” It is: high-authority agents need executable reversal contracts, bounded blast radius, explicit assumption invalidation, and governed recovery.

Infographic explaining why prevention is not enough for AI agent safety, the expected harm model, the reversibility ladder, high-authority agent requirements, and the key takeaway that agent safety means surviving authorized actions that become wrong later.

Visual summary of reversibility as the missing safety primitive for AI agents.


1. Why prevention is not enough

A mature agent runtime already stops most bad actions. The Tool Gateway surfaces only permitted capabilities, re-evaluates every call against deterministic policy, enforces arg_constraints, and routes high-risk calls through an approval gate. Prompt injection becomes a boundary problem, not a prompt problem. This is the correct architecture and you should build it.

But look at what every one of those controls has in common: they all run before the side effect, and they all decide on the information available at decision time. That leaves three gaps that no amount of pre-execution rigor can close.

Gap 1 — Correct at commit, wrong in hindsight. An action can satisfy every policy and still be the wrong thing to do, because the evidence that justified it was already stale. The approval-gates post tells this story directly: the evidence said the order had not shipped; by the time the human clicked approve, the warehouse had loaded it onto a truck; the refund went through anyway. Nothing was violated. The decision was simply overtaken by reality. Prevention cannot catch this, because at the instant of the check, the action was admissible.

Gap 2 — Partial failure. A consequential outcome is rarely one tool call. It is reserve inventory → charge card → book courier → notify customer. When step three fails, steps one and two have already happened. The run did not do something forbidden; it did half of something permitted, and half-done is its own kind of damage.

Gap 3 — The residual failure rate of prevention itself. No boundary is perfect. A new capability ships without an arg_constraint. A policy rule has a gap. A model finds a phrasing that an approver waves through. If you deploy enough agents for enough requests, the probability that some unintended action executes approaches one. A safety model that assumes prevention is total is not a safety model; it is optimism.

The expected harm of an agent fleet is, roughly:

expected_harm  ≈  P(unintended action) × blast_radius × recovery_cost

Prevention attacks only the first term, and can never drive it to zero. Reversibility attacks the other two — and while neither term reaches zero (compensation has latency, and residual harm is real), both can be bounded, observed, and reduced enough to make autonomy operationally safe. That is the entire argument of this paper.


2. Definition

Reversibility is the property of a committed action whereby its effects can be undone, or compensated to an equivalent state, within a bounded time and cost.

Reversibility is not the same as approval mode, and conflating them is the common mistake. Approval mode answers “is this actor allowed to cause this effect?” Reversibility answers “if this effect turns out to be wrong, what does it cost to take it back?” They are orthogonal axes:

  • A read_only lookup is low-authority and has zero domain mutation to reverse — though a read is not zero risk; it can still carry audit, privacy, or exposure risk.
  • A local_write that overwrites a record without capturing the prior value is low-authority but hard to reverse — you cannot restore a state you did not retain.
  • A destructive data deletion is high-authority and irreversible — the worst quadrant.
  • A delegated payment is high-authority but can be compensable if the system registers a bounded refund or reversal capability, captures the required transaction handle, and verifies the compensation outcome. It is rarely fully reversible: settlement delay, fees, disputes, customer comms, and reconciliation state usually remain.

Two systems with identical approval-mode policy can have wildly different safety profiles depending on whether their high-authority actions are compensable. The Action plane has been graded on one axis. This paper adds the second.


3. The reversibility taxonomy

Every side-effecting capability falls into one of three classes. The class, not the approval mode, determines the recovery strategy.

ClassDefinitionExamplesRecovery strategy
ReversibleA true inverse exists and restores the prior state exactly.Add/remove a label, set/unset a flag, soft-delete with tombstone, update with prior value retained.Execute the inverse capability.
CompensableNo exact inverse, but a compensating action restores an equivalent business state.Charge → refund, ship → recall, book → cancel-with-fee, grant access → revoke access.Execute the registered compensation, then verify.
IrreversibleThe effect cannot be undone or meaningfully compensated once committed.Email/SMS sent, funds wired to an external bank, physical action taken, hard delete of the only copy.None after commit. Must be made cancellable before commit (see §6) or gated to a human.

A caution that the distributed-systems literature is emphatic about: compensation is not a rollback. As Azure’s compensating-transaction guidance notes, a compensating transaction is application-specific, may not restore the original state exactly, must be idempotent, can itself fail, and sometimes requires manual intervention. “Compensable” means recoverable to an equivalent state, not un-happened.

The design goal is explicit: drive every consequential capability as far up this table as it will go. Many actions classified as “irreversible” are only irreversible because of how they were built, not because of physics. The capability does not change; the envelope around it does.


4. The reversal contract

Reversibility cannot be inferred by the model at runtime, and it must not be guessed. It is declared once, at tool registration, as part of the adapter contract — the same place capability, schema, and approval_mode already live. A capability with side effects that does not declare its reversal class fails registration. Default-deny applies to reversibility exactly as it applies to authority.

A boolean is not enough. A serious reversal contract carries the compensation capability, its own approval mode, expiry semantics, a verification plan, and a blast-radius bound:

// Proposed extension to the AdapterContract reference in src/lib/contextos/types.ts.
// Reversal is declared alongside approval_mode, not discovered at runtime.
 
type ReversalClass = "reversible" | "compensable" | "irreversible";
 
interface ReversalSpec {
  class: ReversalClass;
 
  /** For "reversible": the capability that exactly inverts this one. */
  inverse_capability?: string;          // e.g. "adp_tags.remove_label"
  /** For "compensable": the capability that restores equivalent state. */
  compensation_capability?: string;     // e.g. "adp_payments.issue_refund"
 
  /**
   * Recovery is not privileged. A compensation can be MORE sensitive than the
   * action it reverses, so it declares its own governance rather than
   * inheriting the original's.
   */
  recovery_approval_mode?: "auto" | "human" | "dual_control";
  recovery_policy_ref?: string;
 
  /**
   * Pre-commit cancellability. A non-zero hold lets the recovery loop CANCEL
   * the effect before it is emitted. It does NOT make a committed effect
   * compensable — see §5.
   */
  commit_hold_ms?: number;
  cancel_capability?: string;
 
  /** Reversal handles expire: refund windows, vendor SLAs, settlement cutoffs. */
  reversal_window_ms?: number;
  expires_after_commit_ms?: number;
 
  requires_prior_value_capture?: boolean;
  requires_external_handle?: boolean;
 
  /** Compensation is verified, not assumed. */
  verification: {
    verify_capability: string;
    success_condition: string;
    max_attempts?: number;
  };
 
  /** Bounds the cost of being wrong even before recovery runs. */
  blast_radius: {
    max_calls_per_run?: number;
    max_value_per_run?: number;
    max_subjects_per_run?: number;
    scope?: "single_subject" | "tenant" | "cross_tenant";
  };
}

When the Gateway executes a compensable or reversible call, the reversal handle must be captured in a typed mutation ref — not a bare string. The reference ToolResult carries mutation_refs; this proposal makes each ref an executable recovery object:

// Proposed: mutation_refs evolves from opaque strings into resolvable MutationRef pointers.
interface MutationRef {
  mutation_id: string;
  original_call_id: string;           // the action this ref can reverse
  capability: string;
  subject_ref: string;
 
  reversal_class: ReversalClass;
  reversal_capability?: string;
  reversal_args?: unknown;            // bound at execution, not reconstructed later
 
  /** Which decision-time assumptions, if invalidated, should trigger recovery. */
  assumption_refs: string[];
  idempotency_key: string;
 
  committed_at: string;
  reversal_expires_at?: string;       // after this, automated recovery is unsafe
 
  prior_value_hash?: string;          // for reversible overwrites
  external_transaction_ref?: string;  // e.g. PSP charge id, courier booking id
 
  verification_ref?: string;
}

ToolResult.mutation_refs may remain a list of stable mutation-ref IDs on the wire, but every ID must resolve to a typed MutationRef before recovery can run. The string is not the undo plan; it is the pointer to the undo plan. (Keeping it a pointer is also more production-realistic — you rarely want full reversal args or sensitive PSP handles embedded in every tool-result payload.)

The shift is small to state and large in consequence: mutation_refs stops being a breadcrumb for auditors and becomes a recovery control plane. After the fact, you do not reconstruct what to compensate from logs. You read the mutation refs and run their bound, verified, time-bounded reversals.


5. Reversal is a lifecycle, not a boolean

The biggest modeling error is treating reversibility as a static flag on a capability. In practice a mutation moves through states, and the recovery options at each state are different. A committed effect inside its refund window is recoverable; the same effect a week later is not.

StateMeaning
plannedTool call selected but not executed.
pending_commitEffect queued, cancellable, or awaiting final verification (commit hold).
committedExternal side effect has happened.
reversal_availableA valid, in-window reversal/compensation handle exists.
reversal_expiredThe compensation window has closed; automated recovery is unsafe.
compensatingA recovery action is running.
compensatedEquivalent business state has been restored and verified.
compensation_failedRecovery failed and must escalate.
manual_resolution_requiredNo safe automated recovery remains.

This is why commit_hold_ms and reversal_window_ms are first-class in the contract. The window between committed and reversal_expired is the entire operating range of the recovery loop. Outside it, the only honest move is to escalate to a human, and the system should say so explicitly rather than silently failing to act.


6. Moving actions up the ladder

Most of the engineering is converting hard-to-reverse actions into compensable — or at least pre-commit-cancellable — ones before they commit. None of these techniques is new; they are the distributed-systems canon, applied to model-driven actions.

TechniqueWhat it doesEffect on the ladder
Commit hold / cancellable windowQueue the effect and expose a cancellation window before it truly fires (the “undo send” pattern).Adds pre-commit cancellability — not post-commit compensation
Reservation / escrow / pre-commit verificationSplit into a reversible reserve and a final commit; only commit after preconditions re-verify.Hard commit → staged, cancellable
Dry-run / shadow executionCompute and return the full effect without committing it, so the Critic and humans see exactly what would happen.Risk → inspectable
Saga with compensationModel a multi-step outcome as a sequence where each step registers its compensating action; on failure, run compensations in reverse.Partial failure → recoverable
Soft delete / tombstoneNever hard-delete; mark deleted and retain for a TTL.Irreversible → reversible
Prior-value captureBefore an overwrite, snapshot the prior value into the mutation ref.Hard-to-reverse → reversible
Blast-radius capBound count, value, subjects, and scope per run, so the cost of being wrong is small even if recovery is slow.Unbounded → bounded

A note on naming: “reservation / escrow” is deliberately not called two-phase commit. Two-phase commit is a specific distributed-transaction protocol that is rarely feasible across third-party systems an agent calls. What works in practice is the saga pattern: local actions paired with compensations — with the caveat AWS itself flags, that saga complexity grows with the number of services involved.

The blast-radius cap deserves emphasis because it is the cheapest and most-skipped. Even a perfectly compensable action has a recovery latency — a refund still takes days to clear, a recalled shipment still moved. Capping how much an agent can do wrong per run turns a catastrophe into an incident. An agent that can refund ₹5,000 per run and is wrong is a support ticket; an agent that can refund ₹5,00,000 per run and is wrong is an outage.


7. The recovery loop

Prevention runs once, before execution. Reversibility adds a second loop that runs after execution, for the life of the run and beyond: when an observation, a Critic verdict, or a human invalidates the assumption that justified a committed action, the runtime does not merely alert — it executes the bound reversal, if one is still in window.

Observation contradicts assumption_ref Critic / human invalidates Commit hold expires cleanly No Yes Pass Fail Execute or stageactionBind typed MutationRefat pending_commit orcommittedLater signal?In reversal window?Finalize externaleffectResolvedEscalate:manual_resolution_requiredReversal call throughTool GatewayExecute compensationVerifysuccess_conditionNew DecisionRecord,reverses_mutation_ref
The recovery loop. A staged or committed action's typed mutation refs carry executable, time-bounded reversals; when a later signal invalidates a recorded assumption, the runtime runs the compensation as its own governed call, verifies it, and links it back to the original mutation. Outside the reversal window it escalates to a human.

Three properties make this trustworthy rather than a second source of bugs.

Reversal is itself a governed action, on its own terms. A compensation is a tool call. It traverses the same Gateway and produces its own DecisionRecord. Crucially, it does not inherit the original’s approval mode — a refund reversal can be more sensitive than the refund — so it runs under the recovery_approval_mode declared in the contract. There is no privileged “undo” backdoor.

Reversal is idempotent, verified, and linked. The compensation carries an idempotency_key derived from the original mutation, so a retried or double-fired recovery compensates exactly once. Its success_condition is checked by observation, not assumed from a 200 OK. The new record points back via reverses_mutation_ref (and the record chain via prev_record_hash), so the audit story shows action and its reversal as one causal link.

const recovery: ToolCall = {
  call_id: `rev_${mutation.mutation_id}`,
  run_id: ctx.run_id,
  adapter_id: registry.adapterFor(mutation.reversal_capability), // resolved from the registry
  capability: mutation.reversal_capability,           // from the MutationRef
  args: mutation.reversal_args,                        // bound at original execution
  approval_mode: mapRecoveryApprovalMode(reversal.recovery_approval_mode), // keeps human ≠ dual_control
  policy_ref: reversal.recovery_policy_ref,            // governed on its OWN terms
  idempotency_key: `rev_${mutation.idempotency_key}`,
  reverses_mutation_ref: mutation.mutation_id,         // causal link, not just prev hash
};
 
function mapRecoveryApprovalMode(
  mode: "auto" | "human" | "dual_control" | undefined,
): "network" | "delegated" {
  return mode === "auto" ? "network" : "delegated";
}

dual_control is not encoded only in approval_mode; it is enforced by the referenced recovery policy (recovery_policy_ref), so a two-person reversal never silently degrades into generic delegated approval.

The reversal is recorded as a first-class decision whose linkage fields make the causal relation explicit, rather than implied by chain order:

// Proposed DecisionRecord linkage fields.
interface DecisionRecord {
  record_hash: string;
  prev_record_hash?: string;      // audit-chain continuity
 
  reverses_call_id?: string;      // the action call this decision reverses
  reverses_mutation_ref?: string; // the specific mutation being undone
  reverses_record_hash?: string;  // causal recovery relation, not just order
}

prev_record_hash preserves audit-chain continuity. reverses_record_hash / reverses_mutation_ref capture that this decision causally reverses that one — a distinction a chain hash alone cannot express.

Reversal is time-bounded. If reversal_expires_at has passed, there is no safe automated recovery. The loop does not silently give up; it transitions the mutation to manual_resolution_required and escalates.


8. The irreversibility budget

The autonomy budget bounds how much uncertainty, cost, and risk a run may consume. It has a sibling the Action plane needs: a budget on how much un-undoable consequence a run may accumulate.

Call it the irreversibility budget. Every committed action debits it by an amount weighted by reversal class and blast radius — reversible actions cost ~0, compensable actions cost their recovery latency and residual harm, irreversible actions cost their full blast radius.

// V0: a single scalar. Good enough to ship a first guardrail.
interface IrreversibilityBudget {
  max_debt: number;       // ceiling of un-undoable consequence for this run
  debt_used: number;      // accumulated, mutated under lock by the runtime
}
 
function debit(b: IrreversibilityBudget, spec: ReversalSpec, value: number): "ok" | "escalate" {
  const weight = spec.class === "reversible" ? 0
               : spec.class === "compensable" ? 0.1
               : 1; // irreversible
  if (b.debt_used + weight * value > b.max_debt) return "escalate";
  b.debt_used += weight * value;
  return "ok";
}

A single scalar is the V0. Real systems quickly need a vector budget, because a thousand rupees of refund exposure is not interchangeable with a thousand customer emails or one irreversible data disclosure. The dimensions that behave differently — and therefore need their own ceilings — are at least:

Budget dimensionBoundsExample ceiling
MoneyValue moved/committed per run₹5,000
User communicationCustomer-visible messages sent1 per run
External commitmentsBookings, orders, contracts placed1 per run
Data exposureRecords/fields disclosed externally0 by default
Legal / regulatoryFilings, attestations, irreversible state changeshuman-only

When a run would exceed any ceiling, the runtime does not deny the work outright — it escalates to a human, exactly as the autonomy budget does when confidence runs out. The budget reframes the deployment question from the binary “is this agent allowed to act?” to the graduated “how much irreversible consequence of each kind has this agent earned the right to commit before a human must step in?“


9. Worked example: the drifted refund

Take the exact scenario the approval-gates work leaves unresolved. A support agent is asked to refund an order. Evidence at decision time: order not shipped. Policy: refunds on unshipped orders are pre-approved up to ₹5,000. The refund is admissible, authorized, and executed. Prevention did its job perfectly.

Ninety seconds later the warehouse system reports the order shipped. The assumption behind the refund is now false. Here is what each architecture does next.

Prevention-only system. Nothing. The refund was valid when it ran; there is no loop watching for the assumption to break. The discrepancy surfaces days later in reconciliation, by which point the customer has the goods and the money.

Reversibility-aware system. The refund capability is registered compensable, with compensation_capability: "adp_payments.reverse_refund", recovery_approval_mode: "human" (reversing a refund touches the customer, so it is not auto-run), a reversal_window_ms matching the PSP’s settlement cutoff, and a blast-radius cap of ₹5,000/run. Its execution bound a typed MutationRef with assumption_refs: ["order_not_shipped"] and the PSP external_transaction_ref. The shipment event invalidates the order_not_shipped assumption recorded in the DecisionRecord. The recovery loop fires, checks the reversal is still in window, resolves the bound compensation, and — because recovery_approval_mode is human — surfaces it to an operator with both facts rather than silently reversing. Direct monetary exposure was capped at ₹5,000 the entire time (support, reconciliation, and reputational cost are separate), and had the window already closed, the mutation would have moved to manual_resolution_required instead.

Same model. Same prompt. Same prevention stack. The only difference is that one system was built to survive being wrong.


10. Failure modes and countermeasures

Failure modeExampleReversibility countermeasure
Silent irreversibilityA new capability ships side-effecting but with no reversal class declaredDefault-deny registration: side effects without a ReversalSpec fail to register
Phantom undo”Reversal” capability exists but does not actually restore stateverification.success_condition checked by observation, like any other action
Compensation stormA buggy recovery loop fires reversals repeatedlyIdempotency key derived from the mutation; compensate exactly once
Recovery bypassAn “undo” path that skips policyReversal is a governed ToolCall under recovery_approval_mode, with its own decision
Over-privileged recoveryAuto-reversing a customer-visible actionRecovery declares its own approval mode; sensitive reversals route to human/dual-control
Expired-window recoveryAutomated refund reversal after settlement closedreversal_expires_at; transition to manual_resolution_required
Unbounded blast radiusCompensable, but one run can move ₹5,00,000Vector blast-radius caps; debit the irreversibility budget
Orphaned mutationEffect committed, no reversal handle retainedBind a typed MutationRef at execution, not reconstructed from logs
Irreversible-by-default deletesHard delete of the only copySoft delete + tombstone TTL; prior-value capture before overwrite

11. Reversibility readiness checklist

Run this as a design review for any agent that causes external effects. A strong prevention stack should not be allowed to pass it alone.

AreaMust hold
ClassificationEvery side-effecting capability declares a reversal class
Reversal bindingCompensable/reversible calls bind a typed MutationRef with reversal args and assumption refs
LifecycleMutations track state from pending_commit through compensated / manual_resolution_required
ExpiryReversal handles carry windows; expired handles escalate instead of failing silently
Irreversible handlingIrreversible capabilities have a commit hold or a mandatory human gate
Recovery loopA post-execution loop watches invalidated assumption_refs and runs in-window compensations
Governed reversalCompensations run under their own recovery_approval_mode, not the original’s
VerificationReversals are confirmed by success_condition, not assumed from a 200 OK
Blast radiusPer-run vector caps on count, value, subjects, and scope
Irreversibility budgetThe run accrues un-undoable-consequence debt per dimension and escalates at the ceiling

12. How ContextOS models reversibility

ContextOS already owns the surfaces reversibility needs; this is an extension of the contract, not a new subsystem.

  • The Adapter Mesh is where a capability declares approval_mode today; the ReversalSpec of §4 belongs in the same registration, so reversal class, recovery approval, and expiry are surfaced into the Context Pack alongside the tool schema.
  • The reference ToolResult.mutation_refs field is the binding point; promoting it from opaque string breadcrumbs to resolvable MutationRef pointers turns the audit trail into an executable, time-bounded recovery control plane.
  • The Tool Gateway that re-evaluates every call at execute time is the natural place to run the recovery loop, because compensation is just another governed call — held to its own approval mode.
  • The DecisionRecord already has outcome and a prev_record_hash chain; adding reverses_mutation_ref and assumption_refs records a reversal as a first-class, causally-linked decision.
  • The irreversibility budget sits beside RunBudget and the autonomy budget as a runtime control that escalates rather than silently caps.

Consistent with how this spec evolves, the primitives borrowed here are established prior art, applied to a new surface:

  • Microsoft, Compensating Transaction pattern — compensation is application-specific, idempotent, may not restore exact state, and can require manual intervention. learn.microsoft.com
  • AWS Prescriptive Guidance, Saga pattern — a sequence of local transactions with compensating transactions that run when a step fails; complexity scales with the number of services. docs.aws.amazon.com
  • OpenAI, A Practical Guide to Building Agents — tool-risk assessment should weigh read/write access, reversibility, permissions, and financial impact, with human oversight for sensitive or irreversible actions. openai.com

The contribution is binding these to the model-driven Action plane so that “can this be undone, by when, by whom, and verified how?” becomes a declared property of every capability, checked at registration and enforced at runtime.


What this changes

For years the agent-safety frontier has been a contest to make the boundary smarter: better policy, tighter constraints, sharper injection defenses. That work is real and it is not finished. But it optimizes a single term in the expected-harm equation, and the one it optimizes can never reach zero.

Reversibility optimizes the other two terms. Neither becomes zero — compensation has latency and residual harm — but both can be bounded, observed, and reduced. A fleet whose worst committed action is bounded, compensable, verified, and watched by a recovery loop is safe to grant more autonomy than a fleet that merely tries very hard not to make mistakes — because the first fleet has a plan for being wrong and the second is betting it never will be.

Build the boundary so the bad action is unlikely. Build for reversibility so that when it happens anyway, it is a footnote and not a headline.


Found this useful? Share it.

Share:XBSMRedditHNEmail