Skip to content
Press / to search

Failure Modes and Recovery Playbooks

Layered failure handling across the five planes — typed verdicts, replay, recovery.

Reference DesignLast reviewed: Edit on GitHub
At a glance

Reference Architecture · Cognitive Core

Failures in ContextOS are typed verdicts, not exceptions. Every plane handles its own class of failure and emits a structured envelope that the Critic and Decision Record can carry. This page is the canonical recovery map for common failure modes by plane.

Intelligence plane

  • Edge missing evidence_ref — write rejected at the Knowledge Substrate boundary; emit ingest error; do not allow into the snapshot.
  • Stale Knowledge Graph snapshot — read returns the stale snapshot_version; reconciliation on every read flips reads to the new pin.
  • Entity-resolution false positive — quarantine into the candidate review queue; never auto-merge below the confidence threshold.
  • Embedding drift — re-embed with the pinned model version against the current snapshot; backfill index in place; alert on coverage delta.
  • Cross-tenant traversal — denied at the storage layer; emit security event; runbook escalates to security on-call.
  • Capture log unbounded — tier-by-tier compaction; alert if compaction lag exceeds budget.

Context plane

  • Pack pinned by name without version — runtime refuses at boundary; return structured error to caller.
  • Pack references missing ontology / decision spec / adapter — fail closed at compile; the Compiler does not start.
  • Bucket truncation under budget — emit budget_report.bucket_truncations[bucket] = true explicitly; the Planner must handle.
  • Enforced JsonLogic rule throws on malformed expression — fail closed at compile, emit policy_eval_error, and block governed action. Non-security scoring rules may be marked non_enforcing and skipped only when the policy bundle explicitly declares that behavior.
  • Token estimator drift between Compiler and runtime tokenizer — reconcile at the edge; alert if drift exceeds tolerance.

Decision plane

  • Plan contains disallowed tools or missing approvals — Critic emits replan at verify; capped by max_replan_attempts.
  • Loop guard trip — repeated identical tool calls or no-progress reflection short-circuits with structured loop_detected reason; routed back to Planner with reason populated.
  • Run Budget exhausted — Critic emits escalate; Decision Record status = ESCALATED; session checkpoint persisted for resume.
  • Subagent lane mutates parent effects — invariant violation; lane terminated; emit security event.
  • Background session resumes against a different pack version — refuse with pack_version_mismatch; require explicit migration.

Action plane

  • Tool schema drift between registered and live endpoint — Tool Gateway returns failed with error_code = VALIDATION_ERROR; Executor returns control to Planner under re-plan budget.
  • Approval gate timeoutToolResultEnvelope.status = paused; session is resumable by session_id within the gate TTL.
  • Approval gate deniedDecision Record status = REJECTED; no side effect performed; full audit retained.
  • Idempotency cache miss after retry — adapter sees a duplicate; mitigated by adapter-side dedup contract; runbook reconciles via reversal_token if available.
  • Egress rate limit reached — Tool Gateway emits failed with error_code = RATE_LIMITED; Critic decides between retry-with-backoff and escalate.
  • Cached read-only alias serves stale capability after a policy change — alias cache invalidated on bundle promotion; if missed, the alias’s first call re-validates against the live registry.

Trust plane

  • Policy decision cache stale after bundle promotion — content-addressed bundle hashing forces fresh evaluation.
  • Trace gap because a custom adapter omitted W3C headers — caught by trace-coverage assertion; release-gate failure.
  • judge model drift after upgrade — re-baseline golden sets; pin judge model and rubric.
  • Audit store unavailable — every governed run hits this on the critical path; replicate; degraded mode is to refuse destructive actions.
  • Cross-tenant denial at read — emit security event; do not return any tenant data, even redacted.

Operator surface

  • Operator action without WebAuthn assertion — refused; runbook investigates account compromise.
  • Promotion bypass attempt — refused by RBAC; emit security event.
  • Approval-gate answer with mismatched evidence snapshot hash — refused; runbook re-takes the snapshot and re-prompts the approver.

Cross-plane recovery

  • Replay is the universal recovery substrate: pin the pack version + snapshot + recorded transcripts and re-derive the verdict offline. Use it for post-incident analysis, regression test cases, and improvement-proposal validation.
  • Frozen evidence snapshots at every approval gate make resume safe even after operator interruption.
  • The improvement loop consumes failed runs and operator corrections as typed proposals; review and ship under the same change-control process as packs and policies.