Reference Architecture

The canonical five-plane ContextOS architecture: runtime topology, contracts, plane boundaries, trust invariants, execution flow, standards alignment, and implementation checklist.

Reference DesignLast reviewed: 2026-05-12 Edit on GitHub

At a glance

Executive summary

ContextOS is the governed decision runtime for production AI agents. It does not try to make a model deterministic. It makes the system around the model deterministic where production systems need determinism: context selection, model invocation, provider routing, tool exposure, policy enforcement, approval routing, trace propagation, evidence capture, decision records, replay, and improvement promotion.

The architecture decomposes the runtime into five planes:

Plane	Responsibility	Output artifact
Intelligence	Shared meaning: ontology, identity, graph, memory, evidence	evidence refs, promoted memory, pinned snapshots
Context	Per-request compilation of bounded state	`CompiledContext`
Decision	Bounded Planner / Executor / Critic loop	`Plan`, verdicts, `DecisionRecord`
Action	Governed external effects through adapters	`ToolEnvelope`, tool transcripts
Trust	Policy, identity, approvals, evaluation, observability, replay	controls, approvals, scorecards, replay packets

The model sits inside this architecture. It is not the architecture. Model calls cross the AI Gateway / LLM Router; external effects still cross the Tool Gateway.

The same plane boundaries also define the improvement surface. Autotune and reviewer agents may search over harness variants, but only within the fields each plane declares tunable:

Plane	Tunable in architecture	Release invariant
Intelligence	source mappings, retrieval constraints, ontology additions, memory promotion candidates	pinned snapshots, provenance, CEID stability, classification rules
Context	retrieval settings, bucket budgets, compression, prompt fragments	complete manifests, required evidence, redaction, tool eligibility
Decision	planner templates, Critic rubrics, re-plan budgets, lane limits	typed plans, approval gates, loop guards, replayable verdicts
Action	adapter retries, circuit breakers, cached read-only aliases, compatible adapter routing	schemas, approval-mode maxima, credentials, idempotency
Trust	evaluator thresholds, sampling, replay sets, rollout gates	safety/policy floors, human approval, append-only audit

This is why improvement is a control-plane concern, not a model-side feature. A candidate can be generated automatically; promotion is still a governed release.

What this architecture is for

This reference is written for platform teams building or evaluating a ContextOS runtime. It answers six questions:

Question	Architecture answer
What owns business meaning?	Intelligence plane. Ontology, CEIDs, knowledge snapshots, and promoted memory.
What decides what the model sees?	Context plane. Context Pack Compiler, manifests, runtime controls, and budget report.
What decides what happens next?	Decision plane. Planner proposes, Critic verifies, Executor runs approved steps.
What calls model providers?	AI Gateway / LLM Router. Provider-neutral calls, governed route selection, fallback, and token/cost telemetry.
What touches external systems?	Action plane. Every effect crosses the Tool Gateway.
What makes this governable?	Trust plane. Policy, identity, approval gates, scorecards, traces, replay, and release gates.

This page is not a tutorial. For a step-by-step MVP, start with Quickstart. For one request end-to-end, read How It Works.

Architectural thesis

Most failed agent architectures collapse four concerns into one prompt: context, reasoning, tools, and governance. That works for demos and fails in production because the organization cannot answer:

Which facts did the agent rely on?
Which policies were evaluated?
Which tools were exposed, and why?
Which human or system identity authorized the action?
What exactly would happen if we replay the run?
Which change caused a regression?

ContextOS separates those concerns into typed planes and turns every important boundary into a contract.

High-level architecture

                         TRUST PLANE
 Policy Engine | Identity | Approvals | Evaluators | OTEL | Replay | Improvement
--------------------------------------------------------------------------------
                         ACTION PLANE
       Tool Gateway | MCP | A2A | OpenAPI | custom adapters | idempotency
--------------------------------------------------------------------------------
                         DECISION PLANE
       Planner | Executor | Critic | AI Gateway | LLM Router | sessions | Decision Catalog
--------------------------------------------------------------------------------
                         CONTEXT PLANE
       Context Pack | Compiler | token budgets | manifests | runtime controls
--------------------------------------------------------------------------------
                       INTELLIGENCE PLANE
       Ontology | Identity Layer | Knowledge Graph | GraphRAG | Memory Fabric

The planes are stacked by dependency direction. Higher planes may constrain lower-plane behavior; lower planes must not bypass higher-plane controls. For example, an adapter cannot self-authorize a destructive action, and a Planner cannot introduce a tool that the Compiler did not surface.

The three ledgers

A production agent run should leave three ledgers. If any ledger is missing, the run is not auditable.

Ledger	What it records	Primary owner
Context ledger	Pack version, graph snapshot, retrieved evidence, memory recall, policy and tool manifests, budget truncations	Context plane
Effect ledger	Tool calls, tool results, credentials used, approval gates, idempotency keys, side-effect status	Action + Trust planes
Decision ledger	DecisionSpec binding, outcome, evidence refs, policy decisions, approvals, scorecard, replay pointer	Decision + Trust planes

The DecisionRecord indexes all three. It is the durable audit artifact, not a decorative log line.

Core invariants

These invariants are more important than any individual component implementation.

Invariant	Production implication
Context is compiled, not hand-assembled.	Prompt text is an output of the Compiler, not the runtime contract.
Model calls cross the AI Gateway.	Provider choice, fallback, redaction, residency, token/cost telemetry, and route audit stay outside Planner and Critic code.
Tools are surfaced, not discovered ad hoc by the model.	The Tool Gateway only accepts tools present in the `tool_manifest`.
Policy is evaluated outside agent code.	The model may propose; the boundary decides.
Evidence precedes governed action.	`required_evidence` must resolve before `network`, `delegated`, or `destructive` effects.
Identity is dual.	Human delegation and agent workload identity travel together.
Budgets are enforced by the runtime.	Token, cost, wall-clock, tool-call, and replan limits are typed controls.
Memory is promoted before reuse.	Captured observations cannot silently become future context.
Replay is designed in.	Pack, request, policy, snapshot, model profile, route decisions, tool transcripts, and evaluator set are pinned.
Improvement is gated.	Corrections become proposals that pass replay and review before promotion.

Cross-cutting contracts

RunContext
  run_id
  trace_id
  session_id
  tenant_id
  user.delegation
  agent.workload_identity
  intent
  locale
  safety_mode
  run_budget
 
RunBudget
  total_tokens
  bucket_tokens{business,policy,tool,evidence,memory,session}
  max_tool_calls
  max_replan_attempts
  wall_clock_ms
  max_cost_cents
  atomic_usage{tokens,tool_calls,latency_ms,cost_cents}
 
ApprovalMode
  read_only | local_write | network | delegated | destructive
 
ContextPack
  contract_meta
  pack_meta
  intelligence_refs
  business_context
  policy_layer
  tooling_layer
  decision_layer
  memory_layer
  evaluation_layer
  tone_and_comms
 
CompiledContext
  compiled_prompt
  manifests{policy,tool,evidence}
  runtime_controls{must_refuse,must_escalate,approval_gates_active,redaction_rules_active}
  budget_report
  context_ledger
 
ToolEnvelope
  ToolCallEnvelope{tool_call_id,run_id,capability_id,args,trace_id,idempotency_key,evidence_refs}
  ToolResultEnvelope{tool_call_id,capability_id,status,output,error,citations,mutations,policy_decision_id,latency_ms}
 
DecisionRecord
  record_id
  decision_key
  decision_version
  status
  actor
  subject_ids
  outputs
  evidence_refs
  policy_decisions
  approvals
  controls_active
  budget_usage
  trace_id
  replay_id

Contracts move across planes. Components do not reach into each other’s private state.

Canonical execution contract

invokeAgent(request_envelope, run_context)
  -> resolve pack refs, tenant, identity, intent, safety mode
  -> compile(packs, request, run_context) -> CompiledContext
  -> bind DecisionSpec
  -> route model judgment through AI Gateway / LLM Router
  -> loop {
       planner(CompiledContext)         -> Plan
       critic.verify(Plan)              -> ok | replan | reject | escalate
       executor(Plan, ToolGateway)      -> step_results + evidence_refs
       critic.score(step_results)       -> accept | retry | replan | escalate
       consolidate(effects, evidence)   -> memory_proposals
     }
  -> emit DecisionRecord
  -> emit replay packet and scorecard

Every component is either a participant in this loop, a registry consulted by a participant, or an operator surface over the artifacts the loop emits.

Runtime topology

ContextOS separates authoring and control concerns from hot-path execution.

Layer	Components	Writes	Reads
Control plane	Pack registry, policy bundle registry, decision catalog, adapter registry, evaluator registry, approval configuration	signed versions, rollout state, kill switches	runtime config, release gates
Runtime plane	Conversation manager, Compiler, Orchestrator, Planner, Executor, Critic, AI Gateway, LLM Router, Tool Gateway	run state, route decisions, tool transcripts, DecisionRecords	active pack refs, policies, model profiles, routing rules, tools, graph snapshots
Intelligence substrate	ontology service, entity resolver, graph store, retrieval service, memory fabric	snapshots, evidence refs, promoted memories	compile-time evidence and recall
Trust and ops plane	policy engine, approval queue, trace collector, scorecard service, replay harness, improvement queue	approvals, scorecards, incidents, proposals	traces, DecisionRecords, run artifacts

Deployment rule

Do not deploy the Compiler, AI Gateway, Tool Gateway, and Policy Engine as optional libraries inside agent code. They are platform services or platform-controlled modules because they enforce the boundary. Agent code may call them; it must not replace them.

Tool Gateway names the Action-plane pattern. Tool Manager is the concrete component implementation.

Plane responsibilities

Intelligence plane

The Intelligence plane owns durable meaning. It turns enterprise data into stable identities, typed relationships, evidence refs, and memory that can be safely reused.

Primitive	Contract	Source
Ontology	versioned entity and relationship schema	Ontology
Identity Layer	CEIDs for audit, SIDs for ML features, workload identity for agents	Identity Layer
Knowledge Graph	evidence-bound graph, snapshot pinning, GraphRAG retrieval	Knowledge Graph
Memory Fabric	capture -> candidate -> review -> promoted memory	Memory, Memory Fabric

Owns:

canonical entity identity,
evidence provenance,
knowledge snapshots,
memory promotion state.

Does not own:

deciding which facts enter a specific prompt,
authorizing external actions,
final decision outcomes.

Context plane

The Context plane owns per-request compilation. It converts a pinned Context Pack plus RunContext into a CompiledContext envelope.

Primitive	Contract	Source
Context Pack	versioned declarative contract for a workflow	Context Pack
ContextPackCompiler	deterministic compile pipeline	Cognitive Core
Token Budget Allocator	budget allocation and truncation report	Cognitive Core
Runtime Controls	active refusals, escalations, approval gates, redaction rules	API Contracts

Owns:

policy/tool/evidence manifests,
prompt assembly,
context budget accounting,
truncation visibility.

Does not own:

live tool execution,
approval decisions,
memory promotion.

Decision plane

The Decision plane owns the bounded loop. It turns CompiledContext into a plan, verifies it, executes approved steps, scores the result, and emits a typed decision.

Primitive	Contract	Source
Planner / Executor / Critic	plan, verify, execute, score	Orchestration
Subagent lanes	isolated sub-runs with independent budgets	Orchestration
Background sessions	resumable durable execution	Orchestration
AI Gateway / LLM Router	provider-neutral model invocation, route selection, fallback, route audit	AI Gateway and LLM Router
Decision Catalog	DecisionSpec registry and decision binding	Decision Catalog
Decision Record Store	replayable records, evidence refs, approvals, controls, lineage, trace ids	Decision Record
Intent-Task Catalog	intent taxonomy and task templates	Intent-Task Catalog

Owns:

plan structure,
Critic verdicts,
loop control,
terminal DecisionRecord emission.

Does not own:

direct API calls,
policy truth,
graph mutation.

Action plane

The Action plane owns external effects. It converts tool intents into validated, authorized, traced calls.

Primitive	Contract	Source
Tool Gateway pattern	policy-bound tool execution boundary	Adapter Mesh
Tool Manager	concrete Tool Gateway implementation	Tool Manager
Adapter Registry	capabilities, schemas, auth mode, approval mode	Adapter Mesh
MCP / A2A / OpenAPI / custom adapters	protocol adapters behind one envelope	Adapter Mesh
Idempotency	write-class calls carry stable idempotency keys	Adapter Mesh
Approval gates	propose -> approve -> execute	Governance

Owns:

schema validation for tool args and results,
credential exchange,
tool transcript capture,
side-effect idempotency.

Does not own:

deciding that a risky action is allowed,
inventing capabilities outside the registry,
storing final business decisions.

Trust plane

The Trust plane owns control over the other four planes. It makes the runtime governable.

Primitive	Contract	Source
Policy Engine	deterministic policy decisions outside model code	Governance
Approval modes	five-tier action-risk taxonomy	Governance
Evaluators	policy, utility, latency, safety, economics	Evaluation and Observability
Trace propagation	W3C trace context and OTEL spans	Evaluation and Observability
Replay Harness	re-derive verdicts from pinned artifacts	Evaluation and Observability
Improvement Loop	insights, strategy proposals, feedback, autotune	Improvement Loop

Owns:

policy decisions,
approval state,
scorecards and release gates,
trace and replay requirements,
promotion of improvement proposals.

Does not own:

arbitrary business logic hidden in prompts,
unreviewed automatic self-modification.

Plane dependency rules

Rule	Rationale
Context may read Intelligence, but may not mutate it during compilation.	Compile stays deterministic and replayable.
Decision may read Context, but cannot add tools not present in `tool_manifest`.	Planning remains bounded by compiled state.
Decision may call model providers only through AI Gateway / LLM Router.	Provider drift, fallback, cost, residency, and route audit stay governed.
Action may execute tools only through Tool Gateway.	External effects stay governed and traced.
Trust may constrain all planes.	Policy, identity, approvals, and evaluation are cross-cutting controls.
Intelligence writes happen through promotion workflows.	Memory and graph state cannot be poisoned by a single run.

Reference flow: support refund

The refund workflow is the reference example because it exercises all production boundaries:

User request enters with a RunContext.
Context Pack ctxpack.support@x.y.z and graph snapshot are pinned.
Compiler emits CompiledContext with policy, tool, and evidence manifests.
Planner proposes lookup, policy eval, and refund steps.
Critic verifies tool availability, args, approval mode, and required evidence.
Tool Gateway executes read tools and freezes evidence for the risky write.
Approval gate authorizes or denies the destructive refund.
Executor calls the payment adapter with idempotency key and trace context.
Critic scores the completed run.
Runtime emits DecisionRecord, replay packet, and memory proposals.

For the complete transcript, see Workflow Examples and How It Works.

Trust architecture

Tenant boundary
  storage, graph, memory, pack registry, traces, and tool credentials are tenant scoped
 
Identity boundary
  user delegation and agent workload identity are both present on every governed call
 
Policy boundary
  policy decisions are evaluated before compile exposure and before tool execution
 
Approval boundary
  network, delegated, and destructive actions can freeze evidence and wait for an approver
 
Audit boundary
  policy decisions, approvals, tool transcripts, scorecards, and traces bind to one trace_id
 
Replay boundary
  request, pack, policy, graph snapshot, model profile, route decision, tool transcripts, evaluator set, and model config are pinned

See Security and Compliance for the detailed control map.

Failure semantics

Failures must be typed. Silent fallback is a production bug.

Failure	Boundary that catches it	Required outcome
Unknown intent	Intent / Risk Classifier	`reject` or operator clarification
Missing required evidence	Critic verify	`replan` or `escalate`
Tool not in manifest	Critic verify / Tool Gateway	`reject`
Tool arg schema mismatch	Tool Gateway	protocol error and no side effect
Approval timeout	Approval Queue	`escalate` or `denied` verdict
Policy denial	Policy Engine	`refuse` or `escalate`
Budget exhaustion	RunBudget guard	terminal budget verdict
No eligible model route	AI Gateway / LLM Router	fail closed or `escalate`
Unsafe tool output	Critic score / output validation	retry, replan, or fail closed
Replay mismatch	Replay Harness	block promotion and open incident

Observability and AgentOps

Every production run should be observable at four levels.

Level	Required signals
Trace	W3C `traceparent`, span hierarchy, plane and component names, parent-child tool spans
Logs	structured run events, policy decisions, approval lifecycle, errors, redactions
Metrics	latency, model token/cost use, route decisions, tool calls, approval wait time, evaluator scores, replay pass rate
Artifacts	`CompiledContext`, Plan, `RoutingDecision`, tool transcripts, DecisionRecord, replay packet, scorecard

Recommended span attributes:

contextos.run_id
contextos.session_id
contextos.tenant_id
contextos.intent
contextos.context_pack_ref
contextos.policy_bundle_ids
contextos.approval_mode_required
contextos.approval_mode_effective
contextos.decision_key
contextos.decision_record_id
contextos.replay_id

Tail sampling should force retention for runs that cross approval gates, fail evaluator thresholds, produce incidents, or affect durable business state.

Standards alignment

ContextOS uses existing standards where they fit and adds only the agent-runtime contracts those standards do not define.

Concern	External standard or guidance	ContextOS use
Distributed trace identity	W3C Trace Context	`trace_id`, parent spans, tool spans, replay correlation
Telemetry model	OpenTelemetry and semantic conventions	spans, metrics, logs, resource and attribute conventions
Workload identity	SPIFFE	agent workload identity format and trust-domain separation
Delegation and token exchange	OAuth 2.0 Token Exchange, RFC 8693	user delegation, actor/subject distinction, scoped credentials
HTTP API description	OpenAPI Specification	adapter schemas, operation metadata, security schemes
AI tool protocol	Model Context Protocol	one adapter class behind the Tool Gateway, with ContextOS adding policy, approval, and audit envelopes
GenAI risk taxonomy	OWASP GenAI Security Project	prompt injection, excessive agency, insecure plugin/tool design, data exposure, output handling
AI risk governance	NIST AI RMF Core	govern, map, measure, manage reflected through policy, evaluator, release, and improvement loops

Alignment does not mean delegation. MCP or OpenAPI can describe a tool. ContextOS still decides whether that tool is exposed, whether it can execute, which identity it uses, which evidence it must cite, and how the action is replayed.

Control-plane lifecycle

Lifecycle step	Required artifact	Gate
Author	Context Pack, policy bundle, DecisionSpec, adapter capability, model profile, routing rule	schema lint
Review	architecture, security, data, evaluation review	reviewer verdicts
Publish	signed immutable version	registry signature
Roll out	tenant and environment pin	release gate
Execute	run artifacts and traces	runtime guards
Evaluate	scorecard and replay packet	evaluator thresholds
Improve	proposal from feedback, incident, or autotune	replay and review before promotion
Roll back	prior pack/policy/model/tool version pin	replay determinism check

Multi-tenant isolation

Tenant isolation is not only a database filter. It applies to every artifact:

Artifact	Isolation requirement
Context Pack	tenant or environment pin; signed publisher; immutable version
Graph snapshot	tenant-scoped snapshot ref; no cross-tenant traversal without explicit policy
Memory	tenant, subject, consent, classification, and retention gates
Model profile / RoutingDecision	tenant policy, residency, capability, and retention gates
Tool credential	tenant-scoped credential exchange with short-lived tokens
Trace	tenant-scoped trace access; redaction before export
DecisionRecord	subject IDs and evidence refs must not leak across tenant boundaries
Replay packet	pinned to tenant-owned or explicitly shared artifacts

Reference contracts

Contract	Source of truth
`invokeAgent`, `ToolCallEnvelope`, `ToolResultEnvelope`, `DecisionRecord`	API Contracts
`ContextPack` schema	Context Pack
`DecisionSpec`	Decision Catalog
`DecisionRecord`	Decision Record
model invocation and routing decisions	AI Gateway and LLM Router
`Intent` and `TaskTemplate`	Intent-Task Catalog
memory write proposals and review queue	Memory Fabric
component-level reference pages	Component Inventory

Implementation checklist

For a new tenant or workflow, do not call the runtime production-ready until every row is true.

Area	Check
Ontology	entity types, relationship types, CEID format, and evidence refs are declared
Context Pack	pack is signed, versioned, immutable, and pinned by environment
Policy	bundle is outside agent code and evaluated before risky action
Models	model profiles and routing rules are signed, pinned, residency-aware, and replay-gated
Tools	every capability has schema, auth mode, approval mode, and idempotency behavior
Decision	every governed action binds to a DecisionSpec with required evidence
Identity	user delegation and agent workload identity are both present on tool calls
Budget	token, cost, tool-call, wall-clock, and replan limits are enforced
Memory	capture, candidate, review, promotion, consent, and contradiction checks exist
Observability	traces, logs, metrics, scorecards, and artifacts are joined by trace ID
Replay	request, pack, policy, snapshot, tools, evaluator set, and model config are pinned
Release	golden replay and evaluator thresholds gate promotion
Rollback	prior versions can be re-pinned without schema migration drama

Anti-patterns

Anti-pattern	Why it fails
Direct provider calls from planners or evaluators	bypasses routing policy, residency checks, fallback controls, token/cost telemetry, and route audit
Direct adapter calls from the model	bypasses policy, identity, approval, schema validation, and trace capture
Hand-built prompts as the source of truth	hides context selection, truncation, and runtime controls
Free-form final answers for governed actions	loses DecisionSpec binding, evidence refs, and replay
Tool descriptions trusted as policy	tool metadata can describe behavior; it cannot authorize behavior
Memory writes on every run	turns temporary observations and injected content into future context
One global agent identity	destroys attribution between user delegation and agent workload identity
Evaluation only on model quality	misses policy, safety, cost, latency, evidence, and tool-risk regressions
Prompt edits after incidents	creates unreviewed behavior drift instead of replayable proposals
Cross-tenant shared traces by default	leaks business context and evidence refs

Roadmap notes

Plane primitives are stable contracts; individual services may evolve.
New patterns should be validated against working systems before promotion into this reference.
Major changes follow the same change-control process as Context Packs, policy bundles, DecisionSpecs, and evaluator sets.

Appendix A: Component inventory

Component	Plane	Owner doc
Conversation Manager	Decision	components/conversation-manager
Intent / Risk Classifier	Decision	components/intent-risk-classifier
Intent-Task Catalog	Decision	implementation/intent-task-catalog
Context Pack Compiler	Context	components/context-pack-compiler
Policy Engine	Trust	components/policy-engine
Orchestrator	Decision	components/orchestrator
AI Gateway / LLM Router	Decision	reference/ai-gateway-llm-router
Tool Gateway pattern	Action	Adapter Mesh
Tool Manager	Action	components/tool-manager
Decision Catalog	Decision	implementation/decision-catalog
Knowledge Substrate	Intelligence	foundations/knowledge-graph
Memory Fabric	Intelligence	implementation/memory-fabric
Identity Layer	Intelligence + Trust	foundations/identity-layer
Evaluation Engine	Trust	components/evaluation-engine
Observability	Trust	components/observability
Admin Console	Trust	components/admin-console

Appendix B: Naming conventions

Planes: Intelligence, Context, Decision, Action, Trust.
Primitives: PascalCase (RunContext, ContextPack, CompiledContext, DecisionRecord, ToolEnvelope).
Enum values: snake_case (read_only, local_write, network, delegated, destructive).
Identifiers: <scope>:<type>:<id> when human-readable (order:ord_881, customer:cus_77).
Trace attributes: contextos.<plane>.<attribute> when plane-specific; contextos.run_id and contextos.decision_record_id when global.
Version refs: <artifact_id>@<semver> for Context Packs, policy bundles, evaluator sets, and DecisionSpecs.