Abstract
Large language model agents are moving from interactive demos into workflows that retrieve private knowledge, call tools, remember user and business facts, and produce side effects. The research literature has made fast progress on individual capabilities: retrieval-augmented generation grounds model outputs in external memory, GraphRAG improves corpus-level sensemaking, ReAct and Toolformer show that models can interleave reasoning with action and API calls, MemGPT and generative agents explore memory architectures, Reflexion and ACE show how feedback can improve behavior without changing model weights, and the Meta-Harness preprint argues that the surrounding harness can be optimized as a first-class artifact.
Those advances also expose a systems problem. A production agent is not just a model with a longer prompt. It is a distributed decision system whose behavior depends on context selection, policy, identity, tool schemas, retrieval snapshots, memory promotion, approval gates, traces, replay, and release governance. If those artifacts are implicit, every incident becomes a model incident. If they are explicit, typed, and replayable, the system can be operated.
This paper frames ContextOS as a governed decision runtime for production agents. The central claim is not that ContextOS makes stochastic models deterministic. It makes the system around the model deterministic where production systems need determinism: context compilation, tool exposure, policy enforcement, evidence capture, approval routing, audit, replay, and improvement promotion. The paper synthesizes current research, maps it onto the five ContextOS planes, proposes runtime invariants, and outlines an evaluation protocol that could falsify or validate the architecture in real deployments.

Visual summary of ContextOS as the control plane around stochastic agent runtimes.
Keywords
Agentic AI, context engineering, retrieval-augmented generation, GraphRAG, tool use, memory, prompt injection, AI governance, replay, evaluation harnesses, ContextOS.
1. Scope and Method
This is a research-backed architecture paper for the public ContextOS spec. It analyzes:
- The ContextOS public documentation, especially Docs Overview, Reference Architecture, Agentic Context Engineering, Governance, Memory, and Evaluation and Observability.
- The typed reference implementation under
src/lib/contextos/, especiallytypes.tsandcompiler.ts. - Primary research and standards sources on retrieval, knowledge graphs, agent reasoning, tool use, memory, context optimization, prompt injection, and AI risk management.
The paper does not claim that the public repo is a production runtime or that ContextOS has already been benchmarked against all cited systems. The repo itself defines ContextOS as the public spec and documentation surface plus a small TypeScript reference implementation of the Context Pack compiler. The empirical claim here is therefore structured as a research thesis: ContextOS should be evaluated as a control-plane architecture for agentic systems, not as another prompt pattern or model wrapper.
2. The Problem: Agents Collapse Too Many Concerns Into One Prompt
Most early agent architectures combine four distinct concerns inside a single prompt:
- What the agent should know.
- What the agent should do next.
- Which tools the agent can call.
- Which policies and approvals govern the action.
That collapse is manageable in demos because the blast radius is low and the operator can inspect failure manually. It fails in enterprise settings because the organization needs precise answers after every important run:
- Which facts did the agent rely on?
- Which policy version was active?
- Which tools were exposed, and why?
- Which evidence was required before a side effect?
- Which human or workload identity authorized the call?
- Which artifact changed when quality regressed?
- Can the run be replayed without live side effects?
Research supports the need to separate these concerns. RAG shows that model behavior improves when parametric memory is combined with explicit non-parametric memory for knowledge-intensive tasks [1]. GraphRAG shows that graph structure can improve answers over large private corpora, especially for global sensemaking questions [2]. ReAct shows the value of interleaving reasoning traces with actions against external environments [3]. Toolformer shows that language models can learn to call APIs when external tools are useful [4]. MemGPT frames context-window pressure as a memory-management problem inspired by operating systems [7]. ACE and Meta-Harness push the same conclusion further: the context or harness around the model is an object of optimization, not incidental glue [10, 11].
The security literature makes the negative case. Indirect prompt injection arises because LLM applications blur the line between data and instructions [12]. A 2026 web-scale empirical preprint found prompt-injection instructions already present in webpages and HTTP responses, including hidden and non-rendered locations aimed at machines rather than humans [13]. OWASP treats prompt injection, excessive agency, sensitive information disclosure, supply-chain weakness, and output handling as separate but connected GenAI risks [14]. The shared conclusion is clear: the model cannot be the security boundary.
ContextOS is a proposed answer to that systems problem.
3. ContextOS Thesis
ContextOS is best understood as a governed decision runtime for production AI agents. It decomposes the runtime into five planes:
| Plane | Research problem it absorbs | ContextOS artifact |
|---|---|---|
| Intelligence | What is known, remembered, normalized, and retrieved | ontology, knowledge graph, identity, promoted memory, evidence refs |
| Context | What this run is allowed to see | ContextPack, CompiledContext, bucket manifests, budget report |
| Decision | How the agent reasons, plans, verifies, and reaches a verdict | Plan, Critic verdicts, DecisionRecord |
| Action | How external systems are touched | ToolEnvelope, Tool Gateway, adapter transcript |
| Trust | What is allowed, observable, replayable, and promotable | policy bundles, approval gates, scorecards, replay packets |
The architectural thesis is:
Agent reliability improves when stochastic model calls are embedded inside deterministic, typed, replayable control surfaces.
This does not require pretending that model outputs are deterministic. It requires deterministic handling of the surfaces that surround them:
- Context selection is compiled from versioned packs.
- Tool exposure is derived from registry, permissions, policy, and safety mode.
- Policy evaluation happens outside model text.
- Approval-mode tiers are fixed and enumerable:
read_only,local_write,network,delegated,destructive. - Memory enters future context only after capture, extraction, review, promotion, consent, and contradiction checks.
- Decisions are recorded as typed artifacts with evidence refs, policy decisions, approvals, budget use, trace id, and replay pointer.
- Improvement proposals pass replay, review, and staged rollout before promotion.
This turns an agent from an opaque prompt loop into an operated system.
4. Related Work
4.1 Retrieval and Evidence
Lewis et al. introduced retrieval-augmented generation for knowledge-intensive NLP, combining a parametric seq2seq model with a dense vector index of external documents [1]. The important lesson for ContextOS is not simply “retrieve documents.” It is that factuality, provenance, and updateability improve when the system can reach beyond model weights.
GraphRAG extends this idea by using an LLM-derived graph index and community summaries to answer broader questions over private corpora [2]. This matters for ContextOS because enterprise evidence is rarely a flat list of passages. It is a set of entities, relationships, policies, events, identities, and provenance chains. ContextOS places this work in the Intelligence plane, where ontology, knowledge graph, evidence refs, and pinned snapshots are owned separately from per-request compilation.
The implication: retrieval is not the runtime contract. Retrieval is a supply function. The contract is the evidence manifest that says which sources were eligible, which were included, which were omitted, and why.
4.2 Reasoning and Acting
ReAct demonstrates that reasoning and acting can be interleaved so an LLM can plan, gather information, update its state, and handle exceptions [3]. Toolformer demonstrates that models can learn when and how to call APIs [4]. Tree of Thoughts shows that deliberate search over multiple reasoning paths can improve difficult problem solving [5].
These papers justify the Decision plane but do not by themselves solve production governance. In a production runtime, an action proposal is not enough. The system must decide whether the tool was surfaced, whether arguments satisfy schema and policy constraints, whether the approval mode is permitted, whether required evidence is present, and whether the call can be audited and replayed.
ContextOS therefore treats ReAct-style loops as inner decision mechanics. They sit inside typed boundaries, not above them.
4.3 Memory and Learning From Feedback
Generative Agents showed that agents can store observations, synthesize reflections, and retrieve memories to guide future behavior [6]. MemGPT developed an operating-system-inspired memory model to move information between context tiers under limited context windows [7]. Reflexion showed that language agents can improve through verbal feedback stored in episodic memory rather than weight updates [8].
These works support the claim that durable agent behavior needs memory. They also expose a risk: memory that writes itself directly into future context becomes a contamination surface. A poisoned memory, stale preference, unconsented personal fact, or contradicted business rule can become a future instruction.
ContextOS responds with promotion-aware memory. Raw captures are not compiled. Candidates are reviewed. Promoted records carry provenance, classification, consent, contradiction state, and recall filters. The compiler sees promoted memory only, and even then only under tenant, role, classification, freshness, and intent filters.
4.4 Context and Harness Optimization
ACE treats context as an evolving playbook that accumulates strategies through generation, reflection, and curation while avoiding context collapse [10]. The Meta-Harness preprint argues that LLM system performance depends not only on model weights but on the harness: the code that chooses what to store, retrieve, and present to the model [11].
These papers are especially aligned with ContextOS. They shift the optimization target away from only prompts and toward the surrounding system. ContextOS adds a governance constraint: optimization can propose changes to retrieval settings, bucket budgets, prompt fragments, memory recall filters, evaluator thresholds, or rollout gates, but promotion remains a Trust-plane decision. Search can discover better candidates; it cannot lower safety floors or auto-promote itself.
4.5 Security, Standards, and Governance
Indirect prompt injection research shows that arbitrary retrieved content can act like an instruction stream to the model [12]. A later empirical study found such instructions in the wild across webpages and HTTP responses [13]. OWASP’s GenAI project organizes these risks into operational categories for LLM applications [14]. NIST’s AI Risk Management Framework frames AI risk as a lifecycle activity across design, development, use, and evaluation [15].
Tool and agent interoperability standards such as MCP and A2A are necessary but insufficient. MCP standardizes how applications expose context, resources, prompts, and tools [16]. A2A standardizes agent-to-agent task interaction, discovery, transport, and authentication concerns [17]. W3C Trace Context and OpenTelemetry standardize propagation and observability mechanisms [18, 19]. ContextOS should not replace these. It should govern how they are admitted into a decision runtime.
The distinction is important. MCP can expose a tool. ContextOS decides whether this run can see that tool, under which identity, at which approval mode, with which argument constraints, and how the call is recorded.
The same distinction applies to agent frameworks. Frameworks can help teams build graphs, crews, tool-using agents, routing logic, and multi-agent workflows. They do not by themselves answer the governance questions ContextOS treats as first-class: what evidence is required before action, which memory is eligible for recall, which policy bundle applies, what approval mode binds the tool, how the run is replayed after an incident, and which artifact must change when behavior regresses. ContextOS is therefore not a replacement for agent frameworks or interoperability protocols. It is the control plane that decides how their capabilities enter a governed run.
5. The ContextOS Architecture Model
5.1 Intelligence Plane
The Intelligence plane owns the slow-moving substrate of meaning:
- ontology
- knowledge graph
- identity model
- embedding keys and source contracts
- promoted memory
- evidence snapshots
The research link is direct. RAG and GraphRAG need external knowledge. Generative Agents, MemGPT, and Reflexion need memory. ContextOS makes those capabilities governable by requiring provenance, classification, consent, contradiction handling, and snapshot pinning.
The key invariant is:
No unpinned, uncited, unpromoted knowledge becomes durable runtime context.
That is stricter than common RAG systems. A vector hit is not automatically evidence. It must become an evidence ref with source, timestamp, snapshot, and eligibility.
5.2 Context Plane
The Context plane owns per-request compilation. The source artifact is the versioned ContextPack; the runtime artifact is the CompiledContext.
The reference compiler follows an eight-stage shape:
- Intent classification.
- Policy resolution.
- Tool surfacing.
- Evidence retrieval.
- Memory recall.
- Token budget allocation.
- Bucket assembly.
- Manifests and runtime controls.
The important design choice is that prompt text is an output, not the contract. The contract includes:
- compiled prompt
- policy manifest
- tool manifest
- evidence manifest
- context blocks by bucket
- runtime controls
- budget report
- omission and truncation diagnostics
The compiler separates buckets such as business, policy, tool, evidence, memory, and session. This lets the runtime measure pressure and failure by bucket. A missing evidence block is different from a missing session summary. A policy omission is different from memory truncation. The Decision plane should not have to infer those differences from a long prompt string.
The key invariant is:
The model sees only what the compiler emitted, and the compiler emits a manifest explaining what it emitted.
5.3 Decision Plane
The Decision plane owns the bounded loop:
planner(CompiledContext) -> Plan
critic.verify(Plan) -> ok | replan | reject
executor(Plan, ToolGateway) -> step_results
critic.score(step_results) -> accept | retry | replan | escalate
consolidate(effects, evidence) -> memory_proposalsThis plane benefits from ReAct-style interleaving of reasoning and action, Tree-of-Thought-style exploration, and Reflexion-style feedback. But ContextOS requires that the loop end in a typed DecisionRecord, not just a response.
The DecisionRecord is the audit index over the run. It binds:
- decision key and version
- status
- actor and subject ids
- outputs
- evidence refs
- policy decisions
- approvals
- controls active
- budget usage
- trace id
- replay id
The key invariant is:
The Decision plane may propose actions and verdicts, but it does not own policy, identity, memory promotion, or side-effect execution.
5.4 Action Plane
The Action plane owns governed external effects. It routes tool calls through a Tool Gateway and adapter mesh. Adapters may speak MCP, A2A, OpenAPI, internal function contracts, database protocols, or custom interfaces, but the runtime-facing shape is a ToolEnvelope.
This is where ContextOS turns tool-use research into production tool governance. Toolformer asks when and how a model can call APIs. ContextOS asks a wider set of questions:
- Was this tool surfaced in the
tool_manifest? - Is this capability within the run’s
safety_mode? - Does the effective approval mode allow it?
- Are required arguments present and within constraints?
- Is the destination allowed?
- Is there a valid user delegation or workload identity?
- Is there an idempotency key?
- Will the result carry trace context and evidence refs?
The key invariant is:
Every external effect crosses the Tool Gateway; a tool the compiler did not surface does not exist to the model for this run.
5.5 Trust Plane
The Trust plane owns policy, approvals, evaluation, observability, replay, and improvement gates. It sits over the other planes because trust constraints appear at every boundary.
This plane exists because prompt-side governance is structurally weak. If a policy exists only as text inside a prompt, retrieved adversarial text can compete with it. If tool authorization exists only in the model’s behavior, tool use becomes persuasion. If memory promotion is automatic, a poisoned observation can become future state.
ContextOS moves those decisions outside the model:
- policy bundles are deterministic rules scoped by intent and risk
- approval gates bind high-risk actions to named approvers and frozen evidence snapshots
- evaluation scorecards measure Policy, Utility, Latency, Safety, and Economics
- replay pins pack version, graph snapshot, request envelope, tool transcripts, model profile, route decisions, and evaluator set
- improvement outputs are proposals, not self-applying changes
The key invariant is:
The model proposes; deterministic boundaries decide.
6. What ContextOS Is Not
The boundary is easier to understand by stating the non-goals explicitly.
| ContextOS is not | ContextOS is |
|---|---|
| A prompt template | A runtime contract system |
| A vector database | A governed context and decision layer |
| An agent framework replacement | A control plane over agent frameworks |
| An MCP or A2A replacement | A governance layer over tool and agent interoperability |
| A claim of model determinism | Deterministic boundaries around stochastic models |
| An automatic self-improvement loop | Gated improvement through replay, evals, review, and release controls |
| A log viewer | A typed audit and replay substrate |
This matters because most confusion about agent infrastructure comes from putting every component in the same category. A framework may orchestrate reasoning. A protocol may expose tools or agents. A model provider may produce candidate plans. A vector store may retrieve candidate evidence. ContextOS governs which of those candidates become eligible runtime artifacts and records why.
7. Formalized Runtime Contract
7.1 Minimum Viable ContextOS
A practical first implementation does not need every plane at full maturity. It needs the smallest contract set that makes a run attributable, governable, and replayable:
| Contract | Why it matters |
|---|---|
RunContext | Names who is acting, for whom, under which tenant, role, safety mode, trace, and budget |
ContextPack | Declares context sources, tool eligibility, policies, memory rules, and evaluator settings |
CompiledContext | Captures what the model actually sees, including manifests and budget report |
ToolManifest | Lists which tools and capabilities are visible for this run |
PolicyDecision | Records why something was allowed, denied, escalated, or routed to approval |
ToolEnvelope | Standardizes every tool call and result, including auth, constraints, idempotency, trace, and evidence refs |
DecisionRecord | Stores the final outcome, evidence, approvals, controls, budgets, and trace binding |
ReplayPacket | Pins the inputs needed to reproduce or debug the run later |
That MVP is deliberately narrow. It does not require a perfect planner, a full enterprise knowledge graph, or a large evaluator stack on day one. It requires that every meaningful action has a typed context, a typed authorization path, and a typed record after completion.
7.2 Runtime Flow Mental Model
Without a governed runtime, the common flow is:
User -> Prompt -> LLM -> Tool Call -> Side EffectThat shape hides the authority boundary. Context, policy, memory, and tool eligibility are all implicit.
With ContextOS, the flow becomes:
User Request
-> RunContext
-> Context Compiler
-> CompiledContext + ToolManifest + PolicyManifest + EvidenceManifest
-> Planner / Critic / Verifier
-> Tool Gateway
-> DecisionRecord + ReplayPacket + ScorecardThis is the smallest useful mental model for leaders and builders. ContextOS turns an agent from a black-box prompt loop into an operated system.
Figure: ContextOS as the Control Plane Around Agent Frameworks
Agent Framework / Model Runtime
|
v
+--------------------+
| Context Compiler |
+--------------------+
|
v
+--------------------+ +----------------+
| CompiledContext | ---> | Planner/Critic |
| ToolManifest | +----------------+
| PolicyManifest | |
| EvidenceManifest | v
+--------------------+ +----------------+
| Tool Gateway |
+----------------+
|
v
+----------------+
| DecisionRecord |
| ReplayPacket |
| Scorecard |
+----------------+A ContextOS run can be modeled as:
Run = (Request, RunContext, ContextPack, EvidenceSnapshot, PromotedMemory, PolicyBundleSet, ToolRegistry)The Context compiler is a function:
compile(Request, RunContext, ContextPack, EvidenceSnapshot, PromotedMemory)
-> CompiledContextFor a replayable runtime, compile should be referentially transparent over pinned inputs:
same inputs -> same CompiledContextThe model call is allowed to be stochastic:
model(CompiledContext, ModelProfile, SamplingConfig) -> CandidatePlanBut the surrounding runtime must bind it:
verify(CandidatePlan, CompiledContext, PolicyManifest, ToolManifest, RunBudget)
-> ok | replan | reject | escalate
execute(VerifiedStep, ToolGateway, RunContext)
-> ToolResult
record(RunArtifacts)
-> DecisionRecordThis division is the core of the architecture. Determinism is required at policy, manifest, tool surface, budget accounting, approval gating, trace propagation, and replay boundaries. It is not required inside the generative model’s internal token path.
8. Runtime Invariants
Invariant 1: Context Is Compiled, Not Hand-Assembled
A prompt assembled ad hoc at runtime cannot explain its own source priority, omissions, redactions, policy version, tool eligibility, or budget pressure. A CompiledContext can.
Operational test:
- Given a
trace_id, an operator can recover the pack version, evidence refs, tool manifest, policy manifest, budget report, and runtime controls used by that run.
Invariant 2: Tool Surface Is Precomputed
The surfaced tool set must be:
Registry intersection Permissions minus Prohibitions, filtered by ApprovalModeThe model should not discover arbitrary tools during reasoning. A prompt injection that names a non-surfaced tool should fail because the tool is absent from the manifest.
Operational test:
- For every tool call, the adapter id and capability appear in the run’s
tool_manifest.
Invariant 3: Approval Modes Are Monotone Risk Tiers
The five canonical modes are read_only, local_write, network, delegated, and destructive. A run may filter out capabilities above its safety_mode; policy may further restrict actions; no runtime path should invent a sixth mode or silently upgrade authority.
Operational test:
- A
read_onlyrun cannot emit anetwork,delegated, ordestructiveside effect, even if the model requests it.
Invariant 4: Evidence Precedes Governed Action
For network, delegated, and destructive effects, required evidence must resolve before execution. A missing evidence ref is a runtime condition, not a model uncertainty hidden in text.
Operational test:
- A destructive call without required evidence returns a typed denial or escalation and records the missing evidence requirement.
Invariant 5: Memory Must Be Promoted Before Recall
Memory captured from prior runs is not automatically eligible for future context. It must pass capture, candidate extraction, review, promotion, consent, contradiction checks, and recall filters.
Operational test:
- No raw capture id appears in a compiled memory bucket. Only promoted records can enter.
Invariant 6: Policy Lives Outside Agent Code
The model may summarize policy; it may not enforce policy. Enforcement belongs to deterministic policy evaluation at compile, plan, and execute boundaries.
Operational test:
- Each allow, deny, approval, or escalation decision records a
policy_decision_id, matched rule ids, and normalized claims.
Invariant 7: Replay Is Designed In
Replay cannot be added after an incident. It requires pinned pack versions, graph snapshots, tool transcripts, model profiles, route decisions, evaluator versions, and trace ids.
Operational test:
- A historical run can be replayed without re-executing external tools and can reproduce the expected
DecisionRecordor report a precise replay gap.
Invariant 8: Improvement Is Gated
Autotune, reviewer agents, and harness optimizers can propose changes. They cannot silently promote changes that weaken policy, redaction, approval, evidence, or replay invariants.
Operational test:
- Every promoted pack, policy, tool, evaluator, or memory-rule change has a proposal, replay scorecard, review decision, rollout stage, and rollback target.
9. Why This Architecture Matches the Research Trajectory
The research field has been converging on a clear pattern: model quality depends heavily on the external system that surrounds the model.
RAG and GraphRAG externalize knowledge [1, 2]. ReAct, Toolformer, SWE-agent, and related agent work externalize action interfaces [3, 4, 9]. MemGPT externalizes memory management [7]. Reflexion and ACE externalize learning into feedback and evolving context [8, 10]. Meta-Harness externalizes optimization into the harness [11]. Prompt-injection research externalizes the need for security boundaries [12, 13].
ContextOS is a synthesis layer over that trajectory. It does not reject any of those capabilities. It assigns them to control surfaces:
| Research capability | Common implementation | ContextOS treatment |
|---|---|---|
| RAG | retrieve chunks into prompt | evidence refs in a manifest with provenance and snapshot |
| GraphRAG | graph summaries and entity relations | Intelligence-plane knowledge substrate with pinned graph snapshots |
| ReAct | interleaved reasoning and action | Decision-plane loop under Critic and Tool Gateway |
| Toolformer | model-selected API calls | pre-surfaced tools with approval modes and argument constraints |
| Generative Agents | observation, reflection, planning memory | promotion-aware memory with consent and contradiction checks |
| MemGPT | virtual context and memory tiers | explicit memory tiers plus compiler-controlled recall |
| Reflexion | feedback stored as episodic memory | corrections become governed memory and improvement signals |
| ACE | evolving contexts | Context Pack changes as versioned, replayed proposals |
| Meta-Harness | optimize harness code | Improvement Loop over declared tunable surfaces |
| Prompt injection research | retrieved content can attack behavior | untrusted text remains evidence, not authority |
This mapping is the strongest scientific argument for ContextOS. The architecture is not an isolated invention. It is a control-plane reading of where the literature already points.
10. Evaluation Protocol
A scientific claim needs a falsifiable evaluation plan. A ContextOS implementation should be measured against baselines such as:
- Prompt-only agent.
- RAG agent with unconstrained tool calling.
- ReAct-style agent with tool schemas but no deterministic policy boundary.
- Memory-enabled agent without promotion gates.
- Harness-optimized agent without Trust-plane release gates.
- ContextOS-style runtime with all five planes active.
10.1 Datasets and Workloads
The evaluation should include at least four workload classes:
- Knowledge-intensive decisions, such as support refunds, regulated account changes, or policy Q&A.
- Tool-using workflows, such as order lookup, calendar scheduling, ticket updates, code edits, or incident triage.
- Long-running memory workflows, such as customer support personalization or recurring operations analysis.
- Adversarial workflows, including indirect prompt injection in retrieved documents, tool results, webpages, and memory candidates.
Each workload should have golden cases, adversarial cases, and production-like noisy cases.
10.2 Metrics
Metrics should map to ContextOS scorecard dimensions:
| Dimension | Metric examples |
|---|---|
| Policy | rule violation rate, approval-gate honored rate, denied-but-attempted risky call rate |
| Utility | task success, decision correctness, evidence-supported answer rate |
| Latency | p50/p95/p99 compile, decision, tool, and end-to-end latency |
| Safety | redaction success, prompt-injection execution rate, hallucinated citation rate |
| Economics | tokens per decision, tool calls per decision, cost per verified success |
| Replay | replay match rate, replay gap rate, artifact completeness |
| Operability | mean time to locate fault plane, number of artifacts needed for incident diagnosis |
The operability metrics matter because ContextOS is an operating architecture. If it improves offline task success but does not improve diagnosis, replay, and release governance, the core thesis is only partially supported.
10.3 Ablation Matrix
A useful ablation matrix removes one control at a time:
| Ablation | Expected failure mode |
|---|---|
| Remove Context plane | prompt grows informal; omissions and source priority become invisible |
| Remove tool manifest | model attempts unavailable or unauthorized tools |
| Remove approval modes | risk taxonomy fragments across workflows |
| Remove promotion-aware memory | stale or poisoned captures return as future context |
| Remove Trust-plane policy | prompt instructions become the effective security boundary |
| Remove replay pinning | incidents cannot be reproduced after source or model changes |
| Remove release gates | harness optimization can overfit search sets or weaken safety floors |
The architecture is validated if each control removal produces the predicted class of regression and if the full system reduces those regressions without unacceptable latency or cost.
10.4 Prompt-Injection Evaluation
Prompt injection should not be evaluated only by asking whether the final answer contains malicious text. The more important measure is whether an injected instruction causes an unauthorized effect.
Test cases should inject adversarial instructions into:
- retrieved documents
- HTML comments and metadata
- tool results
- user-uploaded files
- memory candidates
- third-party records
For each case, measure:
- Did the model repeat or follow the instruction?
- Did the Critic catch the unauthorized plan?
- Did the Tool Gateway deny unauthorized calls?
- Was the denial recorded with policy decision id and trace id?
- Did the memory promotion pipeline reject or quarantine the injected content?
The expected ContextOS result is not “the model never proposes the bad action.” The expected result is “the boundary refuses to execute it.”
10.5 Replay Evaluation
Replay should be tested as a first-class target:
- Run a golden workflow.
- Persist request envelope, pack version, policy bundle, graph snapshot, promoted memory refs, tool transcript, model profile, route decision, evaluator version, and trace id.
- Re-run replay without live tool execution.
- Compare
CompiledContext, Critic verdict, budget report, andDecisionRecord.
Failures should be classified:
- input not pinned
- nondeterministic compiler behavior
- model profile unavailable
- tool transcript incomplete
- evaluator version drift
- policy bundle unavailable
- graph snapshot unavailable
That classification is the difference between “replay failed” and “we know which contract broke.”
11. A Concrete Example: High-Risk Refund
Consider a refund workflow with a user asking for a high-value refund.
In a prompt-only system, the prompt might say:
Follow refund policy. Ask for approval for high-value refunds. Do not issue unsafe refunds.That looks reasonable until policy changes, a retrieved page contains stale limits, a PDF includes an injected instruction, or the model decides the user sounds trustworthy.
In ContextOS, the run is structured:
RunContextcarries tenant, user claims, agent identity, trace id, locale, safety mode, and budget.- The
ContextPackpoints to refund policy bundles, eligible tools, evidence requirements, memory rules, and evaluator settings. - The compiler resolves policy and tool surface. For example,
payments.issue_refundmay be present only underdestructivewith a named approval gate. - Required evidence includes order lookup, identity verification, refund history, fraud signal, and policy citation.
- The planner proposes a refund plan.
- The Critic verifies required evidence, tool eligibility, argument bounds, and approval gate.
- The Tool Gateway re-evaluates policy at execution time.
- The
DecisionRecordstores evidence refs, approvals, policy decision ids, controls active, budget usage, and replay id. - Any correction becomes a memory or improvement proposal, not an invisible prompt edit.
The model still does useful work: extracting intent, drafting explanations, proposing plans, summarizing evidence, and interacting with tools through the runtime. But the authority boundary is outside the model.
12. Research Hypotheses
The ContextOS architecture yields several testable hypotheses.
H1: Manifested Context Improves Incident Diagnosis
Runs with CompiledContext manifests should reduce the time required to identify whether a failure originated in retrieval, policy, memory, tool surfacing, budget pressure, or model reasoning.
Measurement:
- Mean time to fault-plane localization on seeded incidents.
- Number of artifacts needed to explain a wrong decision.
- Fraction of incidents with complete evidence lineage.
H2: Tool Surface Narrowing Reduces Unauthorized Effects
Agents with precomputed tool manifests and gateway enforcement should have lower unauthorized-effect rates under indirect prompt injection than agents that expose broad tool schemas directly to the model.
Measurement:
- Unauthorized tool-call execution rate.
- Unauthorized tool-call attempt rate.
- Denial audit completeness.
H3: Promotion-Aware Memory Reduces Long-Horizon Contamination
Memory systems that require promotion, consent, and contradiction checks should reduce future-run errors caused by stale, poisoned, or unconsented memory.
Measurement:
- Contaminated recall rate.
- Contradiction surfacing rate.
- Memory-related rollback rate.
H4: Replay-Gated Harness Optimization Beats Prompt-Only Iteration
Harness optimization constrained by replay and release gates should improve utility while preserving policy and safety floors better than free-form prompt iteration.
Measurement:
- Utility delta on held-out tests.
- Safety and policy regression rate.
- Search-set to test-set generalization gap.
- Rollback frequency.
H5: Five-Plane Ownership Reduces Cross-Team Ambiguity
When artifacts are owned by planes, operational teams should assign fixes more precisely: pack change, policy rule, tool manifest, memory promotion rule, evaluator threshold, or planner strategy rule.
Measurement:
- Percentage of incidents assigned to one plane and one artifact class.
- Postmortem action-item specificity.
- Repeat incident rate by class.
13. Limitations
ContextOS is not magic around an unreliable model. It is a runtime architecture with its own costs and risks.
First, the architecture adds operational overhead. Pack versions, manifests, policy bundles, evidence refs, replay packets, and scorecards must be maintained. Small teams may not need the full model until they have real blast radius.
Second, deterministic boundaries can still be misconfigured. A permissive policy rule is still permissive. A tool registered at the wrong approval mode is still dangerous. ContextOS makes the error visible and replayable; it does not make bad governance impossible.
Third, LLM-as-judge evaluation can drift. ContextOS mitigates this by pinning judge models, rubrics, and golden sets, but evaluator governance remains a hard problem.
Fourth, replay is only as complete as its pinned inputs. If a graph snapshot, model profile, tool transcript, or policy bundle is missing, replay becomes partial. The architecture should report that gap visibly rather than pretending the run is reproducible.
Fifth, standards interoperability creates boundary confusion. MCP, A2A, OpenAPI, W3C Trace Context, and OpenTelemetry solve different pieces. Treating any one of them as the whole governance model would be a category error. ContextOS depends on those protocols but must still define authority, approval, and audit.
Finally, the public repo is a spec and reference compiler. The strongest version of this paper requires empirical validation across production workloads, adversarial suites, and longitudinal memory tasks.
14. Design Implications
For platform engineers, ContextOS suggests a concrete build order:
- Define
RunContext,ApprovalMode,ContextPack,CompiledContext,ToolEnvelope, andDecisionRecordas stable contracts. - Build a deterministic compiler that emits manifests and budget reports before focusing on advanced planners.
- Put every external effect behind a Tool Gateway with schema validation, identity, idempotency, and trace propagation.
- Move policy into versioned bundles evaluated outside model text.
- Implement replay before broad rollout, not after the first incident.
- Treat memory as a promotion pipeline rather than a vector store write.
- Add scorecards and release gates before running automated harness optimization.
For product and governance teams, the implication is also concrete:
- A new tool is not just an integration. It is a capability with an approval mode and audit contract.
- A new memory class is not just personalization. It is a retention, consent, contradiction, and recall policy.
- A new prompt is not just copy. It is part of a pack with test coverage and replay impact.
- A model upgrade is not just a provider change. It is a release candidate that must replay prior decision records and safety cases.
15. Conclusion
The agent research literature has already moved beyond the idea that better prompts alone make reliable agents. Strong systems retrieve external knowledge, use tools, manage memory, learn from feedback, optimize harnesses, and defend against adversarial context. The missing layer is an operating model that makes those capabilities governable together.
ContextOS is that operating model. It divides the runtime into Intelligence, Context, Decision, Action, and Trust; compiles per-run context into typed manifests; routes side effects through governed tool envelopes; records decisions with evidence and approvals; and treats improvement as a gated release process.
The scientific value of ContextOS is not that it introduces every individual technique. It is that it names the boundaries between them. Those boundaries are what make agentic systems debuggable, auditable, replayable, and safe enough to improve.
References
[1] Patrick Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, NeurIPS 2020.
[2] Darren Edge et al., “From Local to Global: A Graph RAG Approach to Query-Focused Summarization”, arXiv:2404.16130.
[3] Shunyu Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models”, ICLR 2023.
[4] Timo Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools”, arXiv:2302.04761.
[5] Shunyu Yao et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”, NeurIPS 2023.
[6] Joon Sung Park et al., “Generative Agents: Interactive Simulacra of Human Behavior”, arXiv:2304.03442.
[7] Charles Packer et al., “MemGPT: Towards LLMs as Operating Systems”, arXiv:2310.08560.
[8] Noah Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning”, arXiv:2303.11366.
[9] John Yang et al., “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering”, arXiv:2405.15793.
[10] Qizheng Zhang et al., “Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models”, ICLR 2026.
[11] Yoonho Lee et al., “Meta-Harness: End-to-End Optimization of Model Harnesses”, arXiv preprint 2603.28052.
[12] Kai Greshake et al., “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”, arXiv:2302.12173.
[13] Soheil Khodayari et al., “Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives”, arXiv preprint 2604.27202.
[14] OWASP, “Top 10 for Large Language Model Applications”, OWASP GenAI Security Project.
[15] NIST, “AI Risk Management Framework”, National Institute of Standards and Technology.
[16] Model Context Protocol, “Specification, version 2025-11-25”.
[17] A2A Protocol, “Specification, version 1.0.0”.
[18] W3C, “Trace Context”, W3C Recommendation.
[19] OpenTelemetry, “OpenTelemetry Specification Overview”.
