Failure Modes and Recovery Playbooks
Layered failure handling across the five planes — typed verdicts, replay, recovery.
At a glance
Reference Architecture · Cognitive Core
Failures in ContextOS are typed verdicts, not exceptions. Every plane handles its own class of failure and emits a structured envelope that the Critic and Decision Record can carry. This page is the canonical recovery map for common failure modes by plane.
Intelligence plane
- Edge missing
evidence_ref— write rejected at the Knowledge Substrate boundary; emit ingest error; do not allow into the snapshot. - Stale Knowledge Graph snapshot — read returns the stale
snapshot_version; reconciliation on every read flips reads to the new pin. - Entity-resolution false positive — quarantine into the candidate review queue; never auto-merge below the confidence threshold.
- Embedding drift — re-embed with the pinned model version against the current snapshot; backfill index in place; alert on coverage delta.
- Cross-tenant traversal — denied at the storage layer; emit security event; runbook escalates to security on-call.
- Capture log unbounded — tier-by-tier compaction; alert if compaction lag exceeds budget.
Context plane
- Pack pinned by name without version — runtime refuses at boundary; return structured error to caller.
- Pack references missing ontology / decision spec / adapter — fail closed at compile; the Compiler does not start.
- Bucket truncation under budget — emit
budget_report.bucket_truncations[bucket] = trueexplicitly; the Planner must handle. - Enforced JsonLogic rule throws on malformed expression — fail closed at compile, emit
policy_eval_error, and block governed action. Non-security scoring rules may be markednon_enforcingand skipped only when the policy bundle explicitly declares that behavior. - Token estimator drift between Compiler and runtime tokenizer — reconcile at the edge; alert if drift exceeds tolerance.
Decision plane
- Plan contains disallowed tools or missing approvals — Critic emits
replanat verify; capped bymax_replan_attempts. - Loop guard trip — repeated identical tool calls or no-progress reflection short-circuits with structured
loop_detectedreason; routed back to Planner with reason populated. - Run Budget exhausted — Critic emits
escalate; Decision Recordstatus = ESCALATED; session checkpoint persisted for resume. - Subagent lane mutates parent effects — invariant violation; lane terminated; emit security event.
- Background session resumes against a different pack version — refuse with
pack_version_mismatch; require explicit migration.
Action plane
- Tool schema drift between registered and live endpoint — Tool Gateway returns
failedwitherror_code = VALIDATION_ERROR; Executor returns control to Planner under re-plan budget. - Approval gate timeout —
ToolResultEnvelope.status = paused; session is resumable bysession_idwithin the gate TTL. - Approval gate denied — Decision Record
status = REJECTED; no side effect performed; full audit retained. - Idempotency cache miss after retry — adapter sees a duplicate; mitigated by adapter-side dedup contract; runbook reconciles via
reversal_tokenif available. - Egress rate limit reached — Tool Gateway emits
failedwitherror_code = RATE_LIMITED; Critic decides between retry-with-backoff andescalate. - Cached read-only alias serves stale capability after a policy change — alias cache invalidated on bundle promotion; if missed, the alias’s first call re-validates against the live registry.
Trust plane
- Policy decision cache stale after bundle promotion — content-addressed bundle hashing forces fresh evaluation.
- Trace gap because a custom adapter omitted W3C headers — caught by trace-coverage assertion; release-gate failure.
- judge model drift after upgrade — re-baseline golden sets; pin judge model and rubric.
- Audit store unavailable — every governed run hits this on the critical path; replicate; degraded mode is to refuse
destructiveactions. - Cross-tenant denial at read — emit security event; do not return any tenant data, even redacted.
Operator surface
- Operator action without WebAuthn assertion — refused; runbook investigates account compromise.
- Promotion bypass attempt — refused by RBAC; emit security event.
- Approval-gate answer with mismatched evidence snapshot hash — refused; runbook re-takes the snapshot and re-prompts the approver.
Cross-plane recovery
- Replay is the universal recovery substrate: pin the pack version + snapshot + recorded transcripts and re-derive the verdict offline. Use it for post-incident analysis, regression test cases, and improvement-proposal validation.
- Frozen evidence snapshots at every approval gate make resume safe even after operator interruption.
- The improvement loop consumes failed runs and operator corrections as typed proposals; review and ship under the same change-control process as packs and policies.