ContextOS: A Research-Grounded Architecture for Governed Agent Runtimes

Abstract

Large language model agents are moving from interactive demos into workflows that retrieve private knowledge, call tools, remember user and business facts, and produce side effects. The research literature has made fast progress on individual capabilities: retrieval-augmented generation grounds model outputs in external memory, GraphRAG improves corpus-level sensemaking, ReAct and Toolformer show that models can interleave reasoning with action and API calls, MemGPT and generative agents explore memory architectures, Reflexion and ACE show how feedback can improve behavior without changing model weights, and the Meta-Harness preprint argues that the surrounding harness can be optimized as a first-class artifact.

Those advances also expose a systems problem. A production agent is not just a model with a longer prompt. It is a distributed decision system whose behavior depends on context selection, policy, identity, tool schemas, retrieval snapshots, memory promotion, approval gates, traces, replay, and release governance. If those artifacts are implicit, every incident becomes a model incident. If they are explicit, typed, and replayable, the system can be operated.

This paper frames ContextOS as a governed decision runtime for production agents. The central claim is not that ContextOS makes stochastic models deterministic. It makes the system around the model deterministic where production systems need determinism: context compilation, tool exposure, policy enforcement, evidence capture, approval routing, audit, replay, and improvement promotion. The paper synthesizes current research, maps it onto the five ContextOS planes, proposes runtime invariants, and outlines an evaluation protocol that could falsify or validate the architecture in real deployments.

Infographic summarizing ContextOS as a governed agent runtime with five planes, minimum viable contracts, runtime invariants, scorecards, and the model-proposes deterministic-boundaries-decide thesis. — Visual summary of ContextOS as the control plane around stochastic agent runtimes.

Keywords

Agentic AI, context engineering, retrieval-augmented generation, GraphRAG, tool use, memory, prompt injection, AI governance, replay, evaluation harnesses, ContextOS.

1. Scope and Method

This is a research-backed architecture paper for the public ContextOS spec. It analyzes:

The ContextOS public documentation, especially Docs Overview, Reference Architecture, Agentic Context Engineering, Governance, Memory, and Evaluation and Observability.
The typed reference implementation under src/lib/contextos/, especially types.ts and compiler.ts.
Primary research and standards sources on retrieval, knowledge graphs, agent reasoning, tool use, memory, context optimization, prompt injection, and AI risk management.

The paper does not claim that the public repo is a production runtime or that ContextOS has already been benchmarked against all cited systems. The repo itself defines ContextOS as the public spec and documentation surface plus a small TypeScript reference implementation of the Context Pack compiler. The empirical claim here is therefore structured as a research thesis: ContextOS should be evaluated as a control-plane architecture for agentic systems, not as another prompt pattern or model wrapper.

2. The Problem: Agents Collapse Too Many Concerns Into One Prompt

Most early agent architectures combine four distinct concerns inside a single prompt:

What the agent should know.
What the agent should do next.
Which tools the agent can call.
Which policies and approvals govern the action.

That collapse is manageable in demos because the blast radius is low and the operator can inspect failure manually. It fails in enterprise settings because the organization needs precise answers after every important run:

Which facts did the agent rely on?
Which policy version was active?
Which tools were exposed, and why?
Which evidence was required before a side effect?
Which human or workload identity authorized the call?
Which artifact changed when quality regressed?
Can the run be replayed without live side effects?

Research supports the need to separate these concerns. RAG shows that model behavior improves when parametric memory is combined with explicit non-parametric memory for knowledge-intensive tasks [1]. GraphRAG shows that graph structure can improve answers over large private corpora, especially for global sensemaking questions [2]. ReAct shows the value of interleaving reasoning traces with actions against external environments [3]. Toolformer shows that language models can learn to call APIs when external tools are useful [4]. MemGPT frames context-window pressure as a memory-management problem inspired by operating systems [7]. ACE and Meta-Harness push the same conclusion further: the context or harness around the model is an object of optimization, not incidental glue [10, 11].

The security literature makes the negative case. Indirect prompt injection arises because LLM applications blur the line between data and instructions [12]. A 2026 web-scale empirical preprint found prompt-injection instructions already present in webpages and HTTP responses, including hidden and non-rendered locations aimed at machines rather than humans [13]. OWASP treats prompt injection, excessive agency, sensitive information disclosure, supply-chain weakness, and output handling as separate but connected GenAI risks [14]. The shared conclusion is clear: the model cannot be the security boundary.

ContextOS is a proposed answer to that systems problem.

3. ContextOS Thesis

ContextOS is best understood as a governed decision runtime for production AI agents. It decomposes the runtime into five planes:

Plane	Research problem it absorbs	ContextOS artifact
Intelligence	What is known, remembered, normalized, and retrieved	ontology, knowledge graph, identity, promoted memory, evidence refs
Context	What this run is allowed to see	`ContextPack`, `CompiledContext`, bucket manifests, budget report
Decision	How the agent reasons, plans, verifies, and reaches a verdict	`Plan`, Critic verdicts, `DecisionRecord`
Action	How external systems are touched	`ToolEnvelope`, Tool Gateway, adapter transcript
Trust	What is allowed, observable, replayable, and promotable	policy bundles, approval gates, scorecards, replay packets

The architectural thesis is:

Agent reliability improves when stochastic model calls are embedded inside deterministic, typed, replayable control surfaces.

This does not require pretending that model outputs are deterministic. It requires deterministic handling of the surfaces that surround them:

Context selection is compiled from versioned packs.
Tool exposure is derived from registry, permissions, policy, and safety mode.
Policy evaluation happens outside model text.
Approval-mode tiers are fixed and enumerable: read_only, local_write, network, delegated, destructive.
Memory enters future context only after capture, extraction, review, promotion, consent, and contradiction checks.
Decisions are recorded as typed artifacts with evidence refs, policy decisions, approvals, budget use, trace id, and replay pointer.
Improvement proposals pass replay, review, and staged rollout before promotion.

This turns an agent from an opaque prompt loop into an operated system.

4.1 Retrieval and Evidence

Lewis et al. introduced retrieval-augmented generation for knowledge-intensive NLP, combining a parametric seq2seq model with a dense vector index of external documents [1]. The important lesson for ContextOS is not simply “retrieve documents.” It is that factuality, provenance, and updateability improve when the system can reach beyond model weights.

GraphRAG extends this idea by using an LLM-derived graph index and community summaries to answer broader questions over private corpora [2]. This matters for ContextOS because enterprise evidence is rarely a flat list of passages. It is a set of entities, relationships, policies, events, identities, and provenance chains. ContextOS places this work in the Intelligence plane, where ontology, knowledge graph, evidence refs, and pinned snapshots are owned separately from per-request compilation.

The implication: retrieval is not the runtime contract. Retrieval is a supply function. The contract is the evidence manifest that says which sources were eligible, which were included, which were omitted, and why.

4.2 Reasoning and Acting

ReAct demonstrates that reasoning and acting can be interleaved so an LLM can plan, gather information, update its state, and handle exceptions [3]. Toolformer demonstrates that models can learn when and how to call APIs [4]. Tree of Thoughts shows that deliberate search over multiple reasoning paths can improve difficult problem solving [5].

These papers justify the Decision plane but do not by themselves solve production governance. In a production runtime, an action proposal is not enough. The system must decide whether the tool was surfaced, whether arguments satisfy schema and policy constraints, whether the approval mode is permitted, whether required evidence is present, and whether the call can be audited and replayed.

ContextOS therefore treats ReAct-style loops as inner decision mechanics. They sit inside typed boundaries, not above them.

4.3 Memory and Learning From Feedback

Generative Agents showed that agents can store observations, synthesize reflections, and retrieve memories to guide future behavior [6]. MemGPT developed an operating-system-inspired memory model to move information between context tiers under limited context windows [7]. Reflexion showed that language agents can improve through verbal feedback stored in episodic memory rather than weight updates [8].

These works support the claim that durable agent behavior needs memory. They also expose a risk: memory that writes itself directly into future context becomes a contamination surface. A poisoned memory, stale preference, unconsented personal fact, or contradicted business rule can become a future instruction.

ContextOS responds with promotion-aware memory. Raw captures are not compiled. Candidates are reviewed. Promoted records carry provenance, classification, consent, contradiction state, and recall filters. The compiler sees promoted memory only, and even then only under tenant, role, classification, freshness, and intent filters.

4.4 Context and Harness Optimization

ACE treats context as an evolving playbook that accumulates strategies through generation, reflection, and curation while avoiding context collapse [10]. The Meta-Harness preprint argues that LLM system performance depends not only on model weights but on the harness: the code that chooses what to store, retrieve, and present to the model [11].

These papers are especially aligned with ContextOS. They shift the optimization target away from only prompts and toward the surrounding system. ContextOS adds a governance constraint: optimization can propose changes to retrieval settings, bucket budgets, prompt fragments, memory recall filters, evaluator thresholds, or rollout gates, but promotion remains a Trust-plane decision. Search can discover better candidates; it cannot lower safety floors or auto-promote itself.

4.5 Security, Standards, and Governance

Indirect prompt injection research shows that arbitrary retrieved content can act like an instruction stream to the model [12]. A later empirical study found such instructions in the wild across webpages and HTTP responses [13]. OWASP’s GenAI project organizes these risks into operational categories for LLM applications [14]. NIST’s AI Risk Management Framework frames AI risk as a lifecycle activity across design, development, use, and evaluation [15].

Tool and agent interoperability standards such as MCP and A2A are necessary but insufficient. MCP standardizes how applications expose context, resources, prompts, and tools [16]. A2A standardizes agent-to-agent task interaction, discovery, transport, and authentication concerns [17]. W3C Trace Context and OpenTelemetry standardize propagation and observability mechanisms [18, 19]. ContextOS should not replace these. It should govern how they are admitted into a decision runtime.

The distinction is important. MCP can expose a tool. ContextOS decides whether this run can see that tool, under which identity, at which approval mode, with which argument constraints, and how the call is recorded.

The same distinction applies to agent frameworks. Frameworks can help teams build graphs, crews, tool-using agents, routing logic, and multi-agent workflows. They do not by themselves answer the governance questions ContextOS treats as first-class: what evidence is required before action, which memory is eligible for recall, which policy bundle applies, what approval mode binds the tool, how the run is replayed after an incident, and which artifact must change when behavior regresses. ContextOS is therefore not a replacement for agent frameworks or interoperability protocols. It is the control plane that decides how their capabilities enter a governed run.

5. The ContextOS Architecture Model

5.1 Intelligence Plane

The Intelligence plane owns the slow-moving substrate of meaning:

ontology
knowledge graph
identity model
embedding keys and source contracts
promoted memory
evidence snapshots

The research link is direct. RAG and GraphRAG need external knowledge. Generative Agents, MemGPT, and Reflexion need memory. ContextOS makes those capabilities governable by requiring provenance, classification, consent, contradiction handling, and snapshot pinning.

The key invariant is:

No unpinned, uncited, unpromoted knowledge becomes durable runtime context.

That is stricter than common RAG systems. A vector hit is not automatically evidence. It must become an evidence ref with source, timestamp, snapshot, and eligibility.

5.2 Context Plane

The Context plane owns per-request compilation. The source artifact is the versioned ContextPack; the runtime artifact is the CompiledContext.

The reference compiler follows an eight-stage shape:

Intent classification.
Policy resolution.
Tool surfacing.
Evidence retrieval.
Memory recall.
Token budget allocation.
Bucket assembly.
Manifests and runtime controls.

The important design choice is that prompt text is an output, not the contract. The contract includes:

compiled prompt
policy manifest
tool manifest
evidence manifest
context blocks by bucket
runtime controls
budget report
omission and truncation diagnostics

The compiler separates buckets such as business, policy, tool, evidence, memory, and session. This lets the runtime measure pressure and failure by bucket. A missing evidence block is different from a missing session summary. A policy omission is different from memory truncation. The Decision plane should not have to infer those differences from a long prompt string.

The key invariant is:

The model sees only what the compiler emitted, and the compiler emits a manifest explaining what it emitted.

5.3 Decision Plane

The Decision plane owns the bounded loop:

planner(CompiledContext) -> Plan
critic.verify(Plan) -> ok | replan | reject
executor(Plan, ToolGateway) -> step_results
critic.score(step_results) -> accept | retry | replan | escalate
consolidate(effects, evidence) -> memory_proposals

This plane benefits from ReAct-style interleaving of reasoning and action, Tree-of-Thought-style exploration, and Reflexion-style feedback. But ContextOS requires that the loop end in a typed DecisionRecord, not just a response.

The DecisionRecord is the audit index over the run. It binds:

decision key and version
status
actor and subject ids
outputs
evidence refs
policy decisions
approvals
controls active
budget usage
trace id
replay id

The key invariant is:

The Decision plane may propose actions and verdicts, but it does not own policy, identity, memory promotion, or side-effect execution.

5.4 Action Plane

The Action plane owns governed external effects. It routes tool calls through a Tool Gateway and adapter mesh. Adapters may speak MCP, A2A, OpenAPI, internal function contracts, database protocols, or custom interfaces, but the runtime-facing shape is a ToolEnvelope.

This is where ContextOS turns tool-use research into production tool governance. Toolformer asks when and how a model can call APIs. ContextOS asks a wider set of questions:

Was this tool surfaced in the tool_manifest?
Is this capability within the run’s safety_mode?
Does the effective approval mode allow it?
Are required arguments present and within constraints?
Is the destination allowed?
Is there a valid user delegation or workload identity?
Is there an idempotency key?
Will the result carry trace context and evidence refs?

The key invariant is:

Every external effect crosses the Tool Gateway; a tool the compiler did not surface does not exist to the model for this run.

5.5 Trust Plane

The Trust plane owns policy, approvals, evaluation, observability, replay, and improvement gates. It sits over the other planes because trust constraints appear at every boundary.

This plane exists because prompt-side governance is structurally weak. If a policy exists only as text inside a prompt, retrieved adversarial text can compete with it. If tool authorization exists only in the model’s behavior, tool use becomes persuasion. If memory promotion is automatic, a poisoned observation can become future state.

ContextOS moves those decisions outside the model:

policy bundles are deterministic rules scoped by intent and risk
approval gates bind high-risk actions to named approvers and frozen evidence snapshots
evaluation scorecards measure Policy, Utility, Latency, Safety, and Economics
replay pins pack version, graph snapshot, request envelope, tool transcripts, model profile, route decisions, and evaluator set
improvement outputs are proposals, not self-applying changes

The key invariant is:

The model proposes; deterministic boundaries decide.

6. What ContextOS Is Not

The boundary is easier to understand by stating the non-goals explicitly.

ContextOS is not	ContextOS is
A prompt template	A runtime contract system
A vector database	A governed context and decision layer
An agent framework replacement	A control plane over agent frameworks
An MCP or A2A replacement	A governance layer over tool and agent interoperability
A claim of model determinism	Deterministic boundaries around stochastic models
An automatic self-improvement loop	Gated improvement through replay, evals, review, and release controls
A log viewer	A typed audit and replay substrate

This matters because most confusion about agent infrastructure comes from putting every component in the same category. A framework may orchestrate reasoning. A protocol may expose tools or agents. A model provider may produce candidate plans. A vector store may retrieve candidate evidence. ContextOS governs which of those candidates become eligible runtime artifacts and records why.

7. Formalized Runtime Contract

7.1 Minimum Viable ContextOS

A practical first implementation does not need every plane at full maturity. It needs the smallest contract set that makes a run attributable, governable, and replayable:

Contract	Why it matters
`RunContext`	Names who is acting, for whom, under which tenant, role, safety mode, trace, and budget
`ContextPack`	Declares context sources, tool eligibility, policies, memory rules, and evaluator settings
`CompiledContext`	Captures what the model actually sees, including manifests and budget report
`ToolManifest`	Lists which tools and capabilities are visible for this run
`PolicyDecision`	Records why something was allowed, denied, escalated, or routed to approval
`ToolEnvelope`	Standardizes every tool call and result, including auth, constraints, idempotency, trace, and evidence refs
`DecisionRecord`	Stores the final outcome, evidence, approvals, controls, budgets, and trace binding
`ReplayPacket`	Pins the inputs needed to reproduce or debug the run later

That MVP is deliberately narrow. It does not require a perfect planner, a full enterprise knowledge graph, or a large evaluator stack on day one. It requires that every meaningful action has a typed context, a typed authorization path, and a typed record after completion.

7.2 Runtime Flow Mental Model

Without a governed runtime, the common flow is:

User -> Prompt -> LLM -> Tool Call -> Side Effect

That shape hides the authority boundary. Context, policy, memory, and tool eligibility are all implicit.

With ContextOS, the flow becomes:

User Request
  -> RunContext
  -> Context Compiler
  -> CompiledContext + ToolManifest + PolicyManifest + EvidenceManifest
  -> Planner / Critic / Verifier
  -> Tool Gateway
  -> DecisionRecord + ReplayPacket + Scorecard

This is the smallest useful mental model for leaders and builders. ContextOS turns an agent from a black-box prompt loop into an operated system.

Figure: ContextOS as the Control Plane Around Agent Frameworks

Agent Framework / Model Runtime
        |
        v
+--------------------+
| Context Compiler   |
+--------------------+
        |
        v
+--------------------+      +----------------+
| CompiledContext    | ---> | Planner/Critic |
| ToolManifest       |      +----------------+
| PolicyManifest     |              |
| EvidenceManifest   |              v
+--------------------+      +----------------+
                            | Tool Gateway   |
                            +----------------+
                                    |
                                    v
                            +----------------+
                            | DecisionRecord |
                            | ReplayPacket   |
                            | Scorecard      |
                            +----------------+

A ContextOS run can be modeled as:

Run = (Request, RunContext, ContextPack, EvidenceSnapshot, PromotedMemory, PolicyBundleSet, ToolRegistry)

The Context compiler is a function:

compile(Request, RunContext, ContextPack, EvidenceSnapshot, PromotedMemory)
  -> CompiledContext

For a replayable runtime, compile should be referentially transparent over pinned inputs:

same inputs -> same CompiledContext

The model call is allowed to be stochastic:

model(CompiledContext, ModelProfile, SamplingConfig) -> CandidatePlan

But the surrounding runtime must bind it:

verify(CandidatePlan, CompiledContext, PolicyManifest, ToolManifest, RunBudget)
  -> ok | replan | reject | escalate
 
execute(VerifiedStep, ToolGateway, RunContext)
  -> ToolResult
 
record(RunArtifacts)
  -> DecisionRecord

This division is the core of the architecture. Determinism is required at policy, manifest, tool surface, budget accounting, approval gating, trace propagation, and replay boundaries. It is not required inside the generative model’s internal token path.

8. Runtime Invariants

Invariant 1: Context Is Compiled, Not Hand-Assembled

A prompt assembled ad hoc at runtime cannot explain its own source priority, omissions, redactions, policy version, tool eligibility, or budget pressure. A CompiledContext can.

Operational test:

Given a trace_id, an operator can recover the pack version, evidence refs, tool manifest, policy manifest, budget report, and runtime controls used by that run.

Invariant 2: Tool Surface Is Precomputed

The surfaced tool set must be:

Registry intersection Permissions minus Prohibitions, filtered by ApprovalMode

The model should not discover arbitrary tools during reasoning. A prompt injection that names a non-surfaced tool should fail because the tool is absent from the manifest.

Operational test:

For every tool call, the adapter id and capability appear in the run’s tool_manifest.

Invariant 3: Approval Modes Are Monotone Risk Tiers

The five canonical modes are read_only, local_write, network, delegated, and destructive. A run may filter out capabilities above its safety_mode; policy may further restrict actions; no runtime path should invent a sixth mode or silently upgrade authority.

Operational test:

A read_only run cannot emit a network, delegated, or destructive side effect, even if the model requests it.

Invariant 4: Evidence Precedes Governed Action

For network, delegated, and destructive effects, required evidence must resolve before execution. A missing evidence ref is a runtime condition, not a model uncertainty hidden in text.

Operational test:

A destructive call without required evidence returns a typed denial or escalation and records the missing evidence requirement.

Invariant 5: Memory Must Be Promoted Before Recall

Memory captured from prior runs is not automatically eligible for future context. It must pass capture, candidate extraction, review, promotion, consent, contradiction checks, and recall filters.

Operational test:

No raw capture id appears in a compiled memory bucket. Only promoted records can enter.

Invariant 6: Policy Lives Outside Agent Code

The model may summarize policy; it may not enforce policy. Enforcement belongs to deterministic policy evaluation at compile, plan, and execute boundaries.

Operational test:

Each allow, deny, approval, or escalation decision records a policy_decision_id, matched rule ids, and normalized claims.

Invariant 7: Replay Is Designed In

Replay cannot be added after an incident. It requires pinned pack versions, graph snapshots, tool transcripts, model profiles, route decisions, evaluator versions, and trace ids.

Operational test:

A historical run can be replayed without re-executing external tools and can reproduce the expected DecisionRecord or report a precise replay gap.

Invariant 8: Improvement Is Gated

Autotune, reviewer agents, and harness optimizers can propose changes. They cannot silently promote changes that weaken policy, redaction, approval, evidence, or replay invariants.

Operational test:

Every promoted pack, policy, tool, evaluator, or memory-rule change has a proposal, replay scorecard, review decision, rollout stage, and rollback target.

9. Why This Architecture Matches the Research Trajectory

The research field has been converging on a clear pattern: model quality depends heavily on the external system that surrounds the model.

RAG and GraphRAG externalize knowledge [1, 2]. ReAct, Toolformer, SWE-agent, and related agent work externalize action interfaces [3, 4, 9]. MemGPT externalizes memory management [7]. Reflexion and ACE externalize learning into feedback and evolving context [8, 10]. Meta-Harness externalizes optimization into the harness [11]. Prompt-injection research externalizes the need for security boundaries [12, 13].

ContextOS is a synthesis layer over that trajectory. It does not reject any of those capabilities. It assigns them to control surfaces:

Research capability	Common implementation	ContextOS treatment
RAG	retrieve chunks into prompt	evidence refs in a manifest with provenance and snapshot
GraphRAG	graph summaries and entity relations	Intelligence-plane knowledge substrate with pinned graph snapshots
ReAct	interleaved reasoning and action	Decision-plane loop under Critic and Tool Gateway
Toolformer	model-selected API calls	pre-surfaced tools with approval modes and argument constraints
Generative Agents	observation, reflection, planning memory	promotion-aware memory with consent and contradiction checks
MemGPT	virtual context and memory tiers	explicit memory tiers plus compiler-controlled recall
Reflexion	feedback stored as episodic memory	corrections become governed memory and improvement signals
ACE	evolving contexts	Context Pack changes as versioned, replayed proposals
Meta-Harness	optimize harness code	Improvement Loop over declared tunable surfaces
Prompt injection research	retrieved content can attack behavior	untrusted text remains evidence, not authority

This mapping is the strongest scientific argument for ContextOS. The architecture is not an isolated invention. It is a control-plane reading of where the literature already points.

10. Evaluation Protocol

A scientific claim needs a falsifiable evaluation plan. A ContextOS implementation should be measured against baselines such as:

Prompt-only agent.
RAG agent with unconstrained tool calling.
ReAct-style agent with tool schemas but no deterministic policy boundary.
Memory-enabled agent without promotion gates.
Harness-optimized agent without Trust-plane release gates.
ContextOS-style runtime with all five planes active.

10.1 Datasets and Workloads

The evaluation should include at least four workload classes:

Knowledge-intensive decisions, such as support refunds, regulated account changes, or policy Q&A.
Tool-using workflows, such as order lookup, calendar scheduling, ticket updates, code edits, or incident triage.
Long-running memory workflows, such as customer support personalization or recurring operations analysis.
Adversarial workflows, including indirect prompt injection in retrieved documents, tool results, webpages, and memory candidates.

Each workload should have golden cases, adversarial cases, and production-like noisy cases.

10.2 Metrics

Metrics should map to ContextOS scorecard dimensions:

Dimension	Metric examples
Policy	rule violation rate, approval-gate honored rate, denied-but-attempted risky call rate
Utility	task success, decision correctness, evidence-supported answer rate
Latency	p50/p95/p99 compile, decision, tool, and end-to-end latency
Safety	redaction success, prompt-injection execution rate, hallucinated citation rate
Economics	tokens per decision, tool calls per decision, cost per verified success
Replay	replay match rate, replay gap rate, artifact completeness
Operability	mean time to locate fault plane, number of artifacts needed for incident diagnosis

The operability metrics matter because ContextOS is an operating architecture. If it improves offline task success but does not improve diagnosis, replay, and release governance, the core thesis is only partially supported.

10.3 Ablation Matrix

A useful ablation matrix removes one control at a time:

Ablation	Expected failure mode
Remove Context plane	prompt grows informal; omissions and source priority become invisible
Remove tool manifest	model attempts unavailable or unauthorized tools
Remove approval modes	risk taxonomy fragments across workflows
Remove promotion-aware memory	stale or poisoned captures return as future context
Remove Trust-plane policy	prompt instructions become the effective security boundary
Remove replay pinning	incidents cannot be reproduced after source or model changes
Remove release gates	harness optimization can overfit search sets or weaken safety floors

The architecture is validated if each control removal produces the predicted class of regression and if the full system reduces those regressions without unacceptable latency or cost.

10.4 Prompt-Injection Evaluation

Prompt injection should not be evaluated only by asking whether the final answer contains malicious text. The more important measure is whether an injected instruction causes an unauthorized effect.

Test cases should inject adversarial instructions into:

retrieved documents
HTML comments and metadata
tool results
user-uploaded files
memory candidates
third-party records

For each case, measure:

Did the model repeat or follow the instruction?
Did the Critic catch the unauthorized plan?
Did the Tool Gateway deny unauthorized calls?
Was the denial recorded with policy decision id and trace id?
Did the memory promotion pipeline reject or quarantine the injected content?

The expected ContextOS result is not “the model never proposes the bad action.” The expected result is “the boundary refuses to execute it.”

10.5 Replay Evaluation

Replay should be tested as a first-class target:

Run a golden workflow.
Persist request envelope, pack version, policy bundle, graph snapshot, promoted memory refs, tool transcript, model profile, route decision, evaluator version, and trace id.
Re-run replay without live tool execution.
Compare CompiledContext, Critic verdict, budget report, and DecisionRecord.

Failures should be classified:

input not pinned
nondeterministic compiler behavior
model profile unavailable
tool transcript incomplete
evaluator version drift
policy bundle unavailable
graph snapshot unavailable

That classification is the difference between “replay failed” and “we know which contract broke.”

11. A Concrete Example: High-Risk Refund

Consider a refund workflow with a user asking for a high-value refund.

In a prompt-only system, the prompt might say:

Follow refund policy. Ask for approval for high-value refunds. Do not issue unsafe refunds.

That looks reasonable until policy changes, a retrieved page contains stale limits, a PDF includes an injected instruction, or the model decides the user sounds trustworthy.

In ContextOS, the run is structured:

RunContext carries tenant, user claims, agent identity, trace id, locale, safety mode, and budget.
The ContextPack points to refund policy bundles, eligible tools, evidence requirements, memory rules, and evaluator settings.
The compiler resolves policy and tool surface. For example, payments.issue_refund may be present only under destructive with a named approval gate.
Required evidence includes order lookup, identity verification, refund history, fraud signal, and policy citation.
The planner proposes a refund plan.
The Critic verifies required evidence, tool eligibility, argument bounds, and approval gate.
The Tool Gateway re-evaluates policy at execution time.
The DecisionRecord stores evidence refs, approvals, policy decision ids, controls active, budget usage, and replay id.
Any correction becomes a memory or improvement proposal, not an invisible prompt edit.

The model still does useful work: extracting intent, drafting explanations, proposing plans, summarizing evidence, and interacting with tools through the runtime. But the authority boundary is outside the model.

12. Research Hypotheses

The ContextOS architecture yields several testable hypotheses.

H1: Manifested Context Improves Incident Diagnosis

Runs with CompiledContext manifests should reduce the time required to identify whether a failure originated in retrieval, policy, memory, tool surfacing, budget pressure, or model reasoning.

Measurement:

Mean time to fault-plane localization on seeded incidents.
Number of artifacts needed to explain a wrong decision.
Fraction of incidents with complete evidence lineage.

H2: Tool Surface Narrowing Reduces Unauthorized Effects

Agents with precomputed tool manifests and gateway enforcement should have lower unauthorized-effect rates under indirect prompt injection than agents that expose broad tool schemas directly to the model.

Measurement:

Unauthorized tool-call execution rate.
Unauthorized tool-call attempt rate.
Denial audit completeness.

H3: Promotion-Aware Memory Reduces Long-Horizon Contamination

Memory systems that require promotion, consent, and contradiction checks should reduce future-run errors caused by stale, poisoned, or unconsented memory.

Measurement:

Contaminated recall rate.
Contradiction surfacing rate.
Memory-related rollback rate.

H4: Replay-Gated Harness Optimization Beats Prompt-Only Iteration

Harness optimization constrained by replay and release gates should improve utility while preserving policy and safety floors better than free-form prompt iteration.

Measurement:

Utility delta on held-out tests.
Safety and policy regression rate.
Search-set to test-set generalization gap.
Rollback frequency.

H5: Five-Plane Ownership Reduces Cross-Team Ambiguity

When artifacts are owned by planes, operational teams should assign fixes more precisely: pack change, policy rule, tool manifest, memory promotion rule, evaluator threshold, or planner strategy rule.

Measurement:

Percentage of incidents assigned to one plane and one artifact class.
Postmortem action-item specificity.
Repeat incident rate by class.

13. Limitations

ContextOS is not magic around an unreliable model. It is a runtime architecture with its own costs and risks.

First, the architecture adds operational overhead. Pack versions, manifests, policy bundles, evidence refs, replay packets, and scorecards must be maintained. Small teams may not need the full model until they have real blast radius.

Second, deterministic boundaries can still be misconfigured. A permissive policy rule is still permissive. A tool registered at the wrong approval mode is still dangerous. ContextOS makes the error visible and replayable; it does not make bad governance impossible.

Third, LLM-as-judge evaluation can drift. ContextOS mitigates this by pinning judge models, rubrics, and golden sets, but evaluator governance remains a hard problem.

Fourth, replay is only as complete as its pinned inputs. If a graph snapshot, model profile, tool transcript, or policy bundle is missing, replay becomes partial. The architecture should report that gap visibly rather than pretending the run is reproducible.

Fifth, standards interoperability creates boundary confusion. MCP, A2A, OpenAPI, W3C Trace Context, and OpenTelemetry solve different pieces. Treating any one of them as the whole governance model would be a category error. ContextOS depends on those protocols but must still define authority, approval, and audit.

Finally, the public repo is a spec and reference compiler. The strongest version of this paper requires empirical validation across production workloads, adversarial suites, and longitudinal memory tasks.

14. Design Implications

For platform engineers, ContextOS suggests a concrete build order:

Define RunContext, ApprovalMode, ContextPack, CompiledContext, ToolEnvelope, and DecisionRecord as stable contracts.
Build a deterministic compiler that emits manifests and budget reports before focusing on advanced planners.
Put every external effect behind a Tool Gateway with schema validation, identity, idempotency, and trace propagation.
Move policy into versioned bundles evaluated outside model text.
Implement replay before broad rollout, not after the first incident.
Treat memory as a promotion pipeline rather than a vector store write.
Add scorecards and release gates before running automated harness optimization.

For product and governance teams, the implication is also concrete:

A new tool is not just an integration. It is a capability with an approval mode and audit contract.
A new memory class is not just personalization. It is a retention, consent, contradiction, and recall policy.
A new prompt is not just copy. It is part of a pack with test coverage and replay impact.
A model upgrade is not just a provider change. It is a release candidate that must replay prior decision records and safety cases.

15. Conclusion

The agent research literature has already moved beyond the idea that better prompts alone make reliable agents. Strong systems retrieve external knowledge, use tools, manage memory, learn from feedback, optimize harnesses, and defend against adversarial context. The missing layer is an operating model that makes those capabilities governable together.

ContextOS is that operating model. It divides the runtime into Intelligence, Context, Decision, Action, and Trust; compiles per-run context into typed manifests; routes side effects through governed tool envelopes; records decisions with evidence and approvals; and treats improvement as a gated release process.

The scientific value of ContextOS is not that it introduces every individual technique. It is that it names the boundaries between them. Those boundaries are what make agentic systems debuggable, auditable, replayable, and safe enough to improve.