Skip to content
Press / to search

Reference Architecture

The canonical five-plane ContextOS architecture: runtime topology, contracts, plane boundaries, trust invariants, execution flow, standards alignment, and implementation checklist.

Reference DesignLast reviewed: Edit on GitHub
At a glance

Executive summary

ContextOS is the governed decision runtime for production AI agents. It does not try to make a model deterministic. It makes the system around the model deterministic where production systems need determinism: context selection, model invocation, provider routing, tool exposure, policy enforcement, approval routing, trace propagation, evidence capture, decision records, replay, and improvement promotion.

The architecture decomposes the runtime into five planes:

PlaneResponsibilityOutput artifact
IntelligenceShared meaning: ontology, identity, graph, memory, evidenceevidence refs, promoted memory, pinned snapshots
ContextPer-request compilation of bounded stateCompiledContext
DecisionBounded Planner / Executor / Critic loopPlan, verdicts, DecisionRecord
ActionGoverned external effects through adaptersToolEnvelope, tool transcripts
TrustPolicy, identity, approvals, evaluation, observability, replaycontrols, approvals, scorecards, replay packets

The model sits inside this architecture. It is not the architecture. Model calls cross the AI Gateway / LLM Router; external effects still cross the Tool Gateway.

The same plane boundaries also define the improvement surface. Autotune and reviewer agents may search over harness variants, but only within the fields each plane declares tunable:

PlaneTunable in architectureRelease invariant
Intelligencesource mappings, retrieval constraints, ontology additions, memory promotion candidatespinned snapshots, provenance, CEID stability, classification rules
Contextretrieval settings, bucket budgets, compression, prompt fragmentscomplete manifests, required evidence, redaction, tool eligibility
Decisionplanner templates, Critic rubrics, re-plan budgets, lane limitstyped plans, approval gates, loop guards, replayable verdicts
Actionadapter retries, circuit breakers, cached read-only aliases, compatible adapter routingschemas, approval-mode maxima, credentials, idempotency
Trustevaluator thresholds, sampling, replay sets, rollout gatessafety/policy floors, human approval, append-only audit

This is why improvement is a control-plane concern, not a model-side feature. A candidate can be generated automatically; promotion is still a governed release.

What this architecture is for

This reference is written for platform teams building or evaluating a ContextOS runtime. It answers six questions:

QuestionArchitecture answer
What owns business meaning?Intelligence plane. Ontology, CEIDs, knowledge snapshots, and promoted memory.
What decides what the model sees?Context plane. Context Pack Compiler, manifests, runtime controls, and budget report.
What decides what happens next?Decision plane. Planner proposes, Critic verifies, Executor runs approved steps.
What calls model providers?AI Gateway / LLM Router. Provider-neutral calls, governed route selection, fallback, and token/cost telemetry.
What touches external systems?Action plane. Every effect crosses the Tool Gateway.
What makes this governable?Trust plane. Policy, identity, approval gates, scorecards, traces, replay, and release gates.

This page is not a tutorial. For a step-by-step MVP, start with Quickstart. For one request end-to-end, read How It Works.

Architectural thesis

Most failed agent architectures collapse four concerns into one prompt: context, reasoning, tools, and governance. That works for demos and fails in production because the organization cannot answer:

  • Which facts did the agent rely on?
  • Which policies were evaluated?
  • Which tools were exposed, and why?
  • Which human or system identity authorized the action?
  • What exactly would happen if we replay the run?
  • Which change caused a regression?

ContextOS separates those concerns into typed planes and turns every important boundary into a contract.

High-level architecture

                         TRUST PLANE
 Policy Engine | Identity | Approvals | Evaluators | OTEL | Replay | Improvement
--------------------------------------------------------------------------------
                         ACTION PLANE
       Tool Gateway | MCP | A2A | OpenAPI | custom adapters | idempotency
--------------------------------------------------------------------------------
                         DECISION PLANE
       Planner | Executor | Critic | AI Gateway | LLM Router | sessions | Decision Catalog
--------------------------------------------------------------------------------
                         CONTEXT PLANE
       Context Pack | Compiler | token budgets | manifests | runtime controls
--------------------------------------------------------------------------------
                       INTELLIGENCE PLANE
       Ontology | Identity Layer | Knowledge Graph | GraphRAG | Memory Fabric

The planes are stacked by dependency direction. Higher planes may constrain lower-plane behavior; lower planes must not bypass higher-plane controls. For example, an adapter cannot self-authorize a destructive action, and a Planner cannot introduce a tool that the Compiler did not surface.

The three ledgers

A production agent run should leave three ledgers. If any ledger is missing, the run is not auditable.

LedgerWhat it recordsPrimary owner
Context ledgerPack version, graph snapshot, retrieved evidence, memory recall, policy and tool manifests, budget truncationsContext plane
Effect ledgerTool calls, tool results, credentials used, approval gates, idempotency keys, side-effect statusAction + Trust planes
Decision ledgerDecisionSpec binding, outcome, evidence refs, policy decisions, approvals, scorecard, replay pointerDecision + Trust planes

The DecisionRecord indexes all three. It is the durable audit artifact, not a decorative log line.

Core invariants

These invariants are more important than any individual component implementation.

InvariantProduction implication
Context is compiled, not hand-assembled.Prompt text is an output of the Compiler, not the runtime contract.
Model calls cross the AI Gateway.Provider choice, fallback, redaction, residency, token/cost telemetry, and route audit stay outside Planner and Critic code.
Tools are surfaced, not discovered ad hoc by the model.The Tool Gateway only accepts tools present in the tool_manifest.
Policy is evaluated outside agent code.The model may propose; the boundary decides.
Evidence precedes governed action.required_evidence must resolve before network, delegated, or destructive effects.
Identity is dual.Human delegation and agent workload identity travel together.
Budgets are enforced by the runtime.Token, cost, wall-clock, tool-call, and replan limits are typed controls.
Memory is promoted before reuse.Captured observations cannot silently become future context.
Replay is designed in.Pack, request, policy, snapshot, model profile, route decisions, tool transcripts, and evaluator set are pinned.
Improvement is gated.Corrections become proposals that pass replay and review before promotion.

Cross-cutting contracts

RunContext
  run_id
  trace_id
  session_id
  tenant_id
  user.delegation
  agent.workload_identity
  intent
  locale
  safety_mode
  run_budget
 
RunBudget
  total_tokens
  bucket_tokens{business,policy,tool,evidence,memory,session}
  max_tool_calls
  max_replan_attempts
  wall_clock_ms
  max_cost_cents
  atomic_usage{tokens,tool_calls,latency_ms,cost_cents}
 
ApprovalMode
  read_only | local_write | network | delegated | destructive
 
ContextPack
  contract_meta
  pack_meta
  intelligence_refs
  business_context
  policy_layer
  tooling_layer
  decision_layer
  memory_layer
  evaluation_layer
  tone_and_comms
 
CompiledContext
  compiled_prompt
  manifests{policy,tool,evidence}
  runtime_controls{must_refuse,must_escalate,approval_gates_active,redaction_rules_active}
  budget_report
  context_ledger
 
ToolEnvelope
  ToolCallEnvelope{tool_call_id,run_id,capability_id,args,trace_id,idempotency_key,evidence_refs}
  ToolResultEnvelope{tool_call_id,capability_id,status,output,error,citations,mutations,policy_decision_id,latency_ms}
 
DecisionRecord
  record_id
  decision_key
  decision_version
  status
  actor
  subject_ids
  outputs
  evidence_refs
  policy_decisions
  approvals
  controls_active
  budget_usage
  trace_id
  replay_id

Contracts move across planes. Components do not reach into each other’s private state.

Canonical execution contract

invokeAgent(request_envelope, run_context)
  -> resolve pack refs, tenant, identity, intent, safety mode
  -> compile(packs, request, run_context) -> CompiledContext
  -> bind DecisionSpec
  -> route model judgment through AI Gateway / LLM Router
  -> loop {
       planner(CompiledContext)         -> Plan
       critic.verify(Plan)              -> ok | replan | reject | escalate
       executor(Plan, ToolGateway)      -> step_results + evidence_refs
       critic.score(step_results)       -> accept | retry | replan | escalate
       consolidate(effects, evidence)   -> memory_proposals
     }
  -> emit DecisionRecord
  -> emit replay packet and scorecard

Every component is either a participant in this loop, a registry consulted by a participant, or an operator surface over the artifacts the loop emits.

Runtime topology

ContextOS separates authoring and control concerns from hot-path execution.

LayerComponentsWritesReads
Control planePack registry, policy bundle registry, decision catalog, adapter registry, evaluator registry, approval configurationsigned versions, rollout state, kill switchesruntime config, release gates
Runtime planeConversation manager, Compiler, Orchestrator, Planner, Executor, Critic, AI Gateway, LLM Router, Tool Gatewayrun state, route decisions, tool transcripts, DecisionRecordsactive pack refs, policies, model profiles, routing rules, tools, graph snapshots
Intelligence substrateontology service, entity resolver, graph store, retrieval service, memory fabricsnapshots, evidence refs, promoted memoriescompile-time evidence and recall
Trust and ops planepolicy engine, approval queue, trace collector, scorecard service, replay harness, improvement queueapprovals, scorecards, incidents, proposalstraces, DecisionRecords, run artifacts

Deployment rule

Do not deploy the Compiler, AI Gateway, Tool Gateway, and Policy Engine as optional libraries inside agent code. They are platform services or platform-controlled modules because they enforce the boundary. Agent code may call them; it must not replace them.

Tool Gateway names the Action-plane pattern. Tool Manager is the concrete component implementation.

Plane responsibilities

Intelligence plane

The Intelligence plane owns durable meaning. It turns enterprise data into stable identities, typed relationships, evidence refs, and memory that can be safely reused.

PrimitiveContractSource
Ontologyversioned entity and relationship schemaOntology
Identity LayerCEIDs for audit, SIDs for ML features, workload identity for agentsIdentity Layer
Knowledge Graphevidence-bound graph, snapshot pinning, GraphRAG retrievalKnowledge Graph
Memory Fabriccapture -> candidate -> review -> promoted memoryMemory, Memory Fabric

Owns:

  • canonical entity identity,
  • evidence provenance,
  • knowledge snapshots,
  • memory promotion state.

Does not own:

  • deciding which facts enter a specific prompt,
  • authorizing external actions,
  • final decision outcomes.

Context plane

The Context plane owns per-request compilation. It converts a pinned Context Pack plus RunContext into a CompiledContext envelope.

PrimitiveContractSource
Context Packversioned declarative contract for a workflowContext Pack
ContextPackCompilerdeterministic compile pipelineCognitive Core
Token Budget Allocatorbudget allocation and truncation reportCognitive Core
Runtime Controlsactive refusals, escalations, approval gates, redaction rulesAPI Contracts

Owns:

  • policy/tool/evidence manifests,
  • prompt assembly,
  • context budget accounting,
  • truncation visibility.

Does not own:

  • live tool execution,
  • approval decisions,
  • memory promotion.

Decision plane

The Decision plane owns the bounded loop. It turns CompiledContext into a plan, verifies it, executes approved steps, scores the result, and emits a typed decision.

PrimitiveContractSource
Planner / Executor / Criticplan, verify, execute, scoreOrchestration
Subagent lanesisolated sub-runs with independent budgetsOrchestration
Background sessionsresumable durable executionOrchestration
AI Gateway / LLM Routerprovider-neutral model invocation, route selection, fallback, route auditAI Gateway and LLM Router
Decision CatalogDecisionSpec registry and decision bindingDecision Catalog
Decision Record Storereplayable records, evidence refs, approvals, controls, lineage, trace idsDecision Record
Intent-Task Catalogintent taxonomy and task templatesIntent-Task Catalog

Owns:

  • plan structure,
  • Critic verdicts,
  • loop control,
  • terminal DecisionRecord emission.

Does not own:

  • direct API calls,
  • policy truth,
  • graph mutation.

Action plane

The Action plane owns external effects. It converts tool intents into validated, authorized, traced calls.

PrimitiveContractSource
Tool Gateway patternpolicy-bound tool execution boundaryAdapter Mesh
Tool Managerconcrete Tool Gateway implementationTool Manager
Adapter Registrycapabilities, schemas, auth mode, approval modeAdapter Mesh
MCP / A2A / OpenAPI / custom adaptersprotocol adapters behind one envelopeAdapter Mesh
Idempotencywrite-class calls carry stable idempotency keysAdapter Mesh
Approval gatespropose -> approve -> executeGovernance

Owns:

  • schema validation for tool args and results,
  • credential exchange,
  • tool transcript capture,
  • side-effect idempotency.

Does not own:

  • deciding that a risky action is allowed,
  • inventing capabilities outside the registry,
  • storing final business decisions.

Trust plane

The Trust plane owns control over the other four planes. It makes the runtime governable.

PrimitiveContractSource
Policy Enginedeterministic policy decisions outside model codeGovernance
Approval modesfive-tier action-risk taxonomyGovernance
Evaluatorspolicy, utility, latency, safety, economicsEvaluation and Observability
Trace propagationW3C trace context and OTEL spansEvaluation and Observability
Replay Harnessre-derive verdicts from pinned artifactsEvaluation and Observability
Improvement Loopinsights, strategy proposals, feedback, autotuneImprovement Loop

Owns:

  • policy decisions,
  • approval state,
  • scorecards and release gates,
  • trace and replay requirements,
  • promotion of improvement proposals.

Does not own:

  • arbitrary business logic hidden in prompts,
  • unreviewed automatic self-modification.

Plane dependency rules

RuleRationale
Context may read Intelligence, but may not mutate it during compilation.Compile stays deterministic and replayable.
Decision may read Context, but cannot add tools not present in tool_manifest.Planning remains bounded by compiled state.
Decision may call model providers only through AI Gateway / LLM Router.Provider drift, fallback, cost, residency, and route audit stay governed.
Action may execute tools only through Tool Gateway.External effects stay governed and traced.
Trust may constrain all planes.Policy, identity, approvals, and evaluation are cross-cutting controls.
Intelligence writes happen through promotion workflows.Memory and graph state cannot be poisoned by a single run.

Reference flow: support refund

The refund workflow is the reference example because it exercises all production boundaries:

  1. User request enters with a RunContext.
  2. Context Pack ctxpack.support@x.y.z and graph snapshot are pinned.
  3. Compiler emits CompiledContext with policy, tool, and evidence manifests.
  4. Planner proposes lookup, policy eval, and refund steps.
  5. Critic verifies tool availability, args, approval mode, and required evidence.
  6. Tool Gateway executes read tools and freezes evidence for the risky write.
  7. Approval gate authorizes or denies the destructive refund.
  8. Executor calls the payment adapter with idempotency key and trace context.
  9. Critic scores the completed run.
  10. Runtime emits DecisionRecord, replay packet, and memory proposals.

For the complete transcript, see Workflow Examples and How It Works.

Trust architecture

Tenant boundary
  storage, graph, memory, pack registry, traces, and tool credentials are tenant scoped
 
Identity boundary
  user delegation and agent workload identity are both present on every governed call
 
Policy boundary
  policy decisions are evaluated before compile exposure and before tool execution
 
Approval boundary
  network, delegated, and destructive actions can freeze evidence and wait for an approver
 
Audit boundary
  policy decisions, approvals, tool transcripts, scorecards, and traces bind to one trace_id
 
Replay boundary
  request, pack, policy, graph snapshot, model profile, route decision, tool transcripts, evaluator set, and model config are pinned

See Security and Compliance for the detailed control map.

Failure semantics

Failures must be typed. Silent fallback is a production bug.

FailureBoundary that catches itRequired outcome
Unknown intentIntent / Risk Classifierreject or operator clarification
Missing required evidenceCritic verifyreplan or escalate
Tool not in manifestCritic verify / Tool Gatewayreject
Tool arg schema mismatchTool Gatewayprotocol error and no side effect
Approval timeoutApproval Queueescalate or denied verdict
Policy denialPolicy Enginerefuse or escalate
Budget exhaustionRunBudget guardterminal budget verdict
No eligible model routeAI Gateway / LLM Routerfail closed or escalate
Unsafe tool outputCritic score / output validationretry, replan, or fail closed
Replay mismatchReplay Harnessblock promotion and open incident

Observability and AgentOps

Every production run should be observable at four levels.

LevelRequired signals
TraceW3C traceparent, span hierarchy, plane and component names, parent-child tool spans
Logsstructured run events, policy decisions, approval lifecycle, errors, redactions
Metricslatency, model token/cost use, route decisions, tool calls, approval wait time, evaluator scores, replay pass rate
ArtifactsCompiledContext, Plan, RoutingDecision, tool transcripts, DecisionRecord, replay packet, scorecard

Recommended span attributes:

contextos.run_id
contextos.session_id
contextos.tenant_id
contextos.intent
contextos.context_pack_ref
contextos.policy_bundle_ids
contextos.approval_mode_required
contextos.approval_mode_effective
contextos.decision_key
contextos.decision_record_id
contextos.replay_id

Tail sampling should force retention for runs that cross approval gates, fail evaluator thresholds, produce incidents, or affect durable business state.

Standards alignment

ContextOS uses existing standards where they fit and adds only the agent-runtime contracts those standards do not define.

ConcernExternal standard or guidanceContextOS use
Distributed trace identityW3C Trace Contexttrace_id, parent spans, tool spans, replay correlation
Telemetry modelOpenTelemetry and semantic conventionsspans, metrics, logs, resource and attribute conventions
Workload identitySPIFFEagent workload identity format and trust-domain separation
Delegation and token exchangeOAuth 2.0 Token Exchange, RFC 8693user delegation, actor/subject distinction, scoped credentials
HTTP API descriptionOpenAPI Specificationadapter schemas, operation metadata, security schemes
AI tool protocolModel Context Protocolone adapter class behind the Tool Gateway, with ContextOS adding policy, approval, and audit envelopes
GenAI risk taxonomyOWASP GenAI Security Projectprompt injection, excessive agency, insecure plugin/tool design, data exposure, output handling
AI risk governanceNIST AI RMF Coregovern, map, measure, manage reflected through policy, evaluator, release, and improvement loops

Alignment does not mean delegation. MCP or OpenAPI can describe a tool. ContextOS still decides whether that tool is exposed, whether it can execute, which identity it uses, which evidence it must cite, and how the action is replayed.

Control-plane lifecycle

Lifecycle stepRequired artifactGate
AuthorContext Pack, policy bundle, DecisionSpec, adapter capability, model profile, routing ruleschema lint
Reviewarchitecture, security, data, evaluation reviewreviewer verdicts
Publishsigned immutable versionregistry signature
Roll outtenant and environment pinrelease gate
Executerun artifacts and tracesruntime guards
Evaluatescorecard and replay packetevaluator thresholds
Improveproposal from feedback, incident, or autotunereplay and review before promotion
Roll backprior pack/policy/model/tool version pinreplay determinism check

Multi-tenant isolation

Tenant isolation is not only a database filter. It applies to every artifact:

ArtifactIsolation requirement
Context Packtenant or environment pin; signed publisher; immutable version
Graph snapshottenant-scoped snapshot ref; no cross-tenant traversal without explicit policy
Memorytenant, subject, consent, classification, and retention gates
Model profile / RoutingDecisiontenant policy, residency, capability, and retention gates
Tool credentialtenant-scoped credential exchange with short-lived tokens
Tracetenant-scoped trace access; redaction before export
DecisionRecordsubject IDs and evidence refs must not leak across tenant boundaries
Replay packetpinned to tenant-owned or explicitly shared artifacts

Reference contracts

ContractSource of truth
invokeAgent, ToolCallEnvelope, ToolResultEnvelope, DecisionRecordAPI Contracts
ContextPack schemaContext Pack
DecisionSpecDecision Catalog
DecisionRecordDecision Record
model invocation and routing decisionsAI Gateway and LLM Router
Intent and TaskTemplateIntent-Task Catalog
memory write proposals and review queueMemory Fabric
component-level reference pagesComponent Inventory

Implementation checklist

For a new tenant or workflow, do not call the runtime production-ready until every row is true.

AreaCheck
Ontologyentity types, relationship types, CEID format, and evidence refs are declared
Context Packpack is signed, versioned, immutable, and pinned by environment
Policybundle is outside agent code and evaluated before risky action
Modelsmodel profiles and routing rules are signed, pinned, residency-aware, and replay-gated
Toolsevery capability has schema, auth mode, approval mode, and idempotency behavior
Decisionevery governed action binds to a DecisionSpec with required evidence
Identityuser delegation and agent workload identity are both present on tool calls
Budgettoken, cost, tool-call, wall-clock, and replan limits are enforced
Memorycapture, candidate, review, promotion, consent, and contradiction checks exist
Observabilitytraces, logs, metrics, scorecards, and artifacts are joined by trace ID
Replayrequest, pack, policy, snapshot, tools, evaluator set, and model config are pinned
Releasegolden replay and evaluator thresholds gate promotion
Rollbackprior versions can be re-pinned without schema migration drama

Anti-patterns

Anti-patternWhy it fails
Direct provider calls from planners or evaluatorsbypasses routing policy, residency checks, fallback controls, token/cost telemetry, and route audit
Direct adapter calls from the modelbypasses policy, identity, approval, schema validation, and trace capture
Hand-built prompts as the source of truthhides context selection, truncation, and runtime controls
Free-form final answers for governed actionsloses DecisionSpec binding, evidence refs, and replay
Tool descriptions trusted as policytool metadata can describe behavior; it cannot authorize behavior
Memory writes on every runturns temporary observations and injected content into future context
One global agent identitydestroys attribution between user delegation and agent workload identity
Evaluation only on model qualitymisses policy, safety, cost, latency, evidence, and tool-risk regressions
Prompt edits after incidentscreates unreviewed behavior drift instead of replayable proposals
Cross-tenant shared traces by defaultleaks business context and evidence refs

Roadmap notes

  • Plane primitives are stable contracts; individual services may evolve.
  • New patterns should be validated against working systems before promotion into this reference.
  • Major changes follow the same change-control process as Context Packs, policy bundles, DecisionSpecs, and evaluator sets.

Appendix A: Component inventory

ComponentPlaneOwner doc
Conversation ManagerDecisioncomponents/conversation-manager
Intent / Risk ClassifierDecisioncomponents/intent-risk-classifier
Intent-Task CatalogDecisionimplementation/intent-task-catalog
Context Pack CompilerContextcomponents/context-pack-compiler
Policy EngineTrustcomponents/policy-engine
OrchestratorDecisioncomponents/orchestrator
AI Gateway / LLM RouterDecisionreference/ai-gateway-llm-router
Tool Gateway patternActionAdapter Mesh
Tool ManagerActioncomponents/tool-manager
Decision CatalogDecisionimplementation/decision-catalog
Knowledge SubstrateIntelligencefoundations/knowledge-graph
Memory FabricIntelligenceimplementation/memory-fabric
Identity LayerIntelligence + Trustfoundations/identity-layer
Evaluation EngineTrustcomponents/evaluation-engine
ObservabilityTrustcomponents/observability
Admin ConsoleTrustcomponents/admin-console

Appendix B: Naming conventions

  • Planes: Intelligence, Context, Decision, Action, Trust.
  • Primitives: PascalCase (RunContext, ContextPack, CompiledContext, DecisionRecord, ToolEnvelope).
  • Enum values: snake_case (read_only, local_write, network, delegated, destructive).
  • Identifiers: <scope>:<type>:<id> when human-readable (order:ord_881, customer:cus_77).
  • Trace attributes: contextos.<plane>.<attribute> when plane-specific; contextos.run_id and contextos.decision_record_id when global.
  • Version refs: <artifact_id>@<semver> for Context Packs, policy bundles, evaluator sets, and DecisionSpecs.