Skip to content
Harness engineering for production AI agents

Run AI agents on production systems with replayable decisions.

ContextOS gives teams the harness around the model: compiled task context, enforced policy, governed tool use, validation before completion, and an audit record that can be replayed after the run. It keeps the core promise of agents while making their decisions inspectable, reversible, and improvable.

Audit your agent harness against its own code, or see how one run works and build the Quickstart.

Built for platform, product, and applied AI teams moving agents from prototypes into governed workflows.

Abstract ContextOS runtime visual showing context compilation, policy gates, tool boundaries, decision records, and replay trace rails.
ContextOS runtime

One run moves through a compiled context packet, deterministic policy checks, a bounded decision loop, governed tools, and a replayable record.

ContextPackPolicy gateDecision loopTool GatewayDecisionRecordReplay
Five layers in one run
  1. 01Intelligence

    Evidence, ontology, memory, and identity handles.

  2. 02Context

    Compile the request into a budgeted ContextPack.

  3. 03Decision

    Plan, execute, and critic-check inside bounds.

  4. 04Action

    Route every external effect through the Tool Gateway.

  5. 05Trust

    Emit policy, approval, trace, replay, and record state.

Replayable run
run_2026_05_09_7f3a
policy passedtools gatedreplay ready
Execution path
Compile
ContextPack v42
Policy
network gated
Tool
MCP envelope
Critic
evidence pass
Record
trace linked
Replay
snapshot pinned
DecisionRecord
typed
evidence_refs14
approval_modenetwork
critic_verdictpass
replay_bundleready
trace_idotel-7f3a
proposal_statenone
Context: pinned pack + snapshot
Action: identity-bound tool call
Learning: correction becomes proposal

Compile the right context

Every request starts from a versioned Context Pack with evidence, memory, tool, policy, and budget manifests.

Govern the decision loop

Planner / Executor / Critic runs under deterministic budgets, approval-mode tiers, and policy decisions outside the model.

Record replayable decisions

Every answer or action emits a DecisionRecord with evidence refs, approvals, controls, tool envelopes, and trace ID.

Improve without prompt drift

Failures and operator corrections become typed proposals that pass replay, review, and release gates before promotion.
What happens on every run

From request to replayable DecisionRecord

A ContextOS run starts as an invokeAgent envelope and exits as a typed DecisionRecord. In between, the harness compiles context, verifies policy, limits tools, scores the output, and records the evidence needed for audit and replay. Long-running work resumes by session_id against pinned pack + snapshot.

Compile
Plan
Critic-verify
Execute
Critic-score
Consolidate
The harness architecture

Five planes keep agent systems understandable

ContextOS keeps the model inside a larger execution environment. Intelligence stores what the system can know. Context chooses what this request may use. Decision turns context into a bounded plan. Action mediates external effects. Trust governs the other four with policy, evaluation, replay, and improvement.

One signed RunContext across the lifecycle

run_id, trace_id, session_id, tenant_id, user delegation, agent workload identity, safety_mode, and run_budget travel through every contract. Pressure signals flow alongside as Functional State.

Risk is typed before a tool is callable

read_only · local_write · network · delegated · destructive for risk; observe · recall · think_support · act · verify for kind. Together they bound every capability.

One Tool Gateway for every external effect

MCP, A2A, OpenAPI, custom adapters, chat, email, voice, API, webhook, and SMS all use one envelope shape, one policy boundary, and OTEL trace propagation end-to-end.

Replay is a property of the harness, not a log-search exercise. Pinned pack + pinned snapshot + recorded request reproduces the verdict, and the Critic re-derives its decision without re-executing tools.
Five-plane ContextOS architecture visual with embedded labels for Intelligence, Context, Decision, Action, Trust, and a cross-cutting RunContext rail.
Five-plane architecture

RunContext crosses Intelligence, Context, Decision, Action, and Trust so one execution vocabulary survives from request to replay.

IntelligenceContextDecisionActionTrustRunContext
Five layers
  1. 01Intelligence

    Stable facts, identity, evidence, and memory.

  2. 02Context

    Per-request compilation and budgeted context.

  3. 03Decision

    Bounded planning, execution, and critique.

  4. 04Action

    Governed tools, adapters, and external effects.

  5. 05Trust

    Policy, audit, replay, and evaluator control.

The five planes

The operating model behind the runtime

Each plane owns a clear responsibility, with spec docs for interfaces, failure modes, operational concerns, and evaluation metrics. That keeps harness engineering concrete enough for platform teams to build, review, and operate.

Substrate of meaning

Intelligence plane

Ontology · Identity Layer · Knowledge Graph · Promotion-aware Memory with Consent Records.
Per-request compilation

Context plane

Context Pack · ContextPackCompiler · Token Budget Allocator · Runtime Controls · Functional State.
Bounded execution loop

Decision plane

Cognitive loop · Planner / Executor / Critic · Subagent lanes · Durable Background Sessions · Decision Catalog.
Governed external effects

Action plane

Tool Gateway · MCP / A2A / OpenAPI · Capability Classification · Skills · Approval-mode tiers · Sandbox Profiles.
Control over the other four

Trust plane

Policy outside agent code · evaluators · OTEL traces · Replay · Improvement Loop.
Improvement loop

Corrections become release-gated harness upgrades

Prompt edits are not an improvement loop. ContextOS turns failed runs, operator corrections, evaluator regressions, and pressure signals into typed proposals. Nothing auto-applies: every proposed change runs through replay, review, and the same promotion path as packs and policies.

Insight Synthesizer

Surfaces recurring patterns across runs as typed Insight records — failure clusters, blocked decisions, common detours.

Strategy Compiler

Converts validated feedback into reusable strategy rules at the right runtime layer (prompt, planner, retrieval, tool selection, memory recall, budget).

Feedback Store

First-class, append-only store for operator corrections and tips with provenance — cited by future improvements.

Chief-of-Staff

Scans due tasks, open loops, queue backlog, and gate latency to propose proactive operational notes.

Research Queue

Enqueues autonomous research tasks against scoped knowledge gaps; produces typed Knowledge Patches for review.

Autotune

Proposes prompt, retrieval, or budget changes against a target evaluator metric. Gated by golden replay before review.
Trust-grade primitives

Governance belongs in the runtime, not in a prompt

Sandbox profiles, consent records, functional state, durable sessions, and tool approvals are typed primitives. They compose with the canonical contract so trust has the same shape across every run, every tenant, and every integration channel.

Sandbox profiles

Typed, signed contracts for code execution. Pinned image, no host mount, no inbound network, hard caps. Bound per capability.

Consent records

First-class, append-only consent gating PII memory promotion at the candidate stage. Revocation honored on next recall.

Functional state

Typed per-turn pressure signals (budget / loop / evidence / gate / conflict) read by the Critic — never by the model.

Background sessions

Hour-scale durable runs with checkpoint after every Critic verdict. Resume by session_id against pinned pack + snapshot.

  • MCP, A2A, OpenAPI, custom — one envelope shape; one policy boundary.
  • Skills compose capabilities + prompt fragments + their own evaluation suites.
  • User delegation + agent workload identity (SPIFFE-style) on every call.
  • Semantic discovery — tools resolved from intent, not catalog dumps.
  • Cached read-only aliases for prompt density; alias cache invalidated on bundle promotion.
  • Idempotency keys, retries, circuit breakers, dead-letter handling.

Before ContextOS

Agent prototypes depend on prompt discipline, scattered tool wrappers, and manual review. When something goes wrong, teams reconstruct the run from logs and memory.

With ContextOS

The harness compiles the context, enforces policy, gates tools, validates output, emits a DecisionRecord, and routes failures into a typed improvement loop.

Operational result

Teams can answer what the agent saw, which policy allowed it, which tool it called, why the verdict passed, how to replay it, and what change would prevent a repeat failure.

Make the agent run explainable before it acts.

Start with the harness: bounded context, governed loops, typed decisions, replay, and release-gated improvement by default.