Harness engineering for production AI agents

Run AI agents on production systems with replayable decisions.

ContextOS gives teams the harness around the model: compiled task context, enforced policy, governed tool use, validation before completion, and an audit record that can be replayed after the run. It keeps the core promise of agents while making their decisions inspectable, reversible, and improvable.

Audit your agent harness against its own code, or see how one run works and build the Quickstart.

Built for platform, product, and applied AI teams moving agents from prototypes into governed workflows.

Abstract ContextOS runtime visual showing context compilation, policy gates, tool boundaries, decision records, and replay trace rails. — ContextOS runtime
One run moves through a compiled context packet, deterministic policy checks, a bounded decision loop, governed tools, and a replayable record.
ContextPackPolicy gateDecision loopTool GatewayDecisionRecordReplay
Five layers in one run
01Intelligence
Evidence, ontology, memory, and identity handles.
02Context
Compile the request into a budgeted ContextPack.
03Decision
Plan, execute, and critic-check inside bounds.
04Action
Route every external effect through the Tool Gateway.
05Trust
Emit policy, approval, trace, replay, and record state.

Replayable run

run_2026_05_09_7f3a

policy passedtools gatedreplay ready

Execution path

Compile

ContextPack v42

Policy

network gated

Tool

MCP envelope

Critic

evidence pass

Record

trace linked

Replay

snapshot pinned

DecisionRecord

typed

evidence_refs14

approval_modenetwork

critic_verdictpass

replay_bundleready

trace_idotel-7f3a

proposal_statenone

Context: pinned pack + snapshot

Action: identity-bound tool call

Learning: correction becomes proposal

Compile the right context

Every request starts from a versioned Context Pack with evidence, memory, tool, policy, and budget manifests.

Govern the decision loop

Planner / Executor / Critic runs under deterministic budgets, approval-mode tiers, and policy decisions outside the model.

Record replayable decisions

Every answer or action emits a DecisionRecord with evidence refs, approvals, controls, tool envelopes, and trace ID.

Improve without prompt drift

Failures and operator corrections become typed proposals that pass replay, review, and release gates before promotion.

What happens on every run

From request to replayable DecisionRecord

A ContextOS run starts as an invokeAgent envelope and exits as a typed DecisionRecord. In between, the harness compiles context, verifies policy, limits tools, scores the output, and records the evidence needed for audit and replay. Long-running work resumes by session_id against pinned pack + snapshot.

Compile

Plan

Critic-verify

Execute

Critic-score

Consolidate

The harness architecture

Five planes keep agent systems understandable

ContextOS keeps the model inside a larger execution environment. Intelligence stores what the system can know. Context chooses what this request may use. Decision turns context into a bounded plan. Action mediates external effects. Trust governs the other four with policy, evaluation, replay, and improvement.

One signed RunContext across the lifecycle

run_id, trace_id, session_id, tenant_id, user delegation, agent workload identity, safety_mode, and run_budget travel through every contract. Pressure signals flow alongside as Functional State.

Risk is typed before a tool is callable

read_only · local_write · network · delegated · destructive for risk; observe · recall · think_support · act · verify for kind. Together they bound every capability.

One Tool Gateway for every external effect

MCP, A2A, OpenAPI, custom adapters, chat, email, voice, API, webhook, and SMS all use one envelope shape, one policy boundary, and OTEL trace propagation end-to-end.

Replay is a property of the harness, not a log-search exercise. Pinned pack + pinned snapshot + recorded request reproduces the verdict, and the Critic re-derives its decision without re-executing tools.

Five-plane ContextOS architecture visual with embedded labels for Intelligence, Context, Decision, Action, Trust, and a cross-cutting RunContext rail. — Five-plane architecture
RunContext crosses Intelligence, Context, Decision, Action, and Trust so one execution vocabulary survives from request to replay.
IntelligenceContextDecisionActionTrustRunContext
Five layers
01Intelligence
Stable facts, identity, evidence, and memory.
02Context
Per-request compilation and budgeted context.
03Decision
Bounded planning, execution, and critique.
04Action
Governed tools, adapters, and external effects.
05Trust
Policy, audit, replay, and evaluator control.

The five planes

The operating model behind the runtime

Each plane owns a clear responsibility, with spec docs for interfaces, failure modes, operational concerns, and evaluation metrics. That keeps harness engineering concrete enough for platform teams to build, review, and operate.

Substrate of meaning

Intelligence plane

Ontology · Identity Layer · Knowledge Graph · Promotion-aware Memory with Consent Records.

Read the spec

Per-request compilation

Context plane

Context Pack · ContextPackCompiler · Token Budget Allocator · Runtime Controls · Functional State.

Read the spec

Bounded execution loop

Decision plane

Cognitive loop · Planner / Executor / Critic · Subagent lanes · Durable Background Sessions · Decision Catalog.

Read the spec

Governed external effects

Action plane

Tool Gateway · MCP / A2A / OpenAPI · Capability Classification · Skills · Approval-mode tiers · Sandbox Profiles.

Read the spec

Control over the other four

Trust plane

Policy outside agent code · evaluators · OTEL traces · Replay · Improvement Loop.

Read the spec

Improvement loop

Corrections become release-gated harness upgrades

Prompt edits are not an improvement loop. ContextOS turns failed runs, operator corrections, evaluator regressions, and pressure signals into typed proposals. Nothing auto-applies: every proposed change runs through replay, review, and the same promotion path as packs and policies.

Insight Synthesizer

Surfaces recurring patterns across runs as typed Insight records — failure clusters, blocked decisions, common detours.

Strategy Compiler

Converts validated feedback into reusable strategy rules at the right runtime layer (prompt, planner, retrieval, tool selection, memory recall, budget).

Feedback Store

First-class, append-only store for operator corrections and tips with provenance — cited by future improvements.

Chief-of-Staff

Scans due tasks, open loops, queue backlog, and gate latency to propose proactive operational notes.

Research Queue

Enqueues autonomous research tasks against scoped knowledge gaps; produces typed Knowledge Patches for review.

Autotune

Proposes prompt, retrieval, or budget changes against a target evaluator metric. Gated by golden replay before review.

Read the Improvement Loop spec →

Trust-grade primitives

Governance belongs in the runtime, not in a prompt

Sandbox profiles, consent records, functional state, durable sessions, and tool approvals are typed primitives. They compose with the canonical contract so trust has the same shape across every run, every tenant, and every integration channel.

Sandbox profiles

Typed, signed contracts for code execution. Pinned image, no host mount, no inbound network, hard caps. Bound per capability.

Consent records

First-class, append-only consent gating PII memory promotion at the candidate stage. Revocation honored on next recall.

Functional state

Typed per-turn pressure signals (budget / loop / evidence / gate / conflict) read by the Critic — never by the model.

Background sessions

Hour-scale durable runs with checkpoint after every Critic verdict. Resume by session_id against pinned pack + snapshot.

MCP, A2A, OpenAPI, custom — one envelope shape; one policy boundary.
Skills compose capabilities + prompt fragments + their own evaluation suites.
User delegation + agent workload identity (SPIFFE-style) on every call.
Semantic discovery — tools resolved from intent, not catalog dumps.
Cached read-only aliases for prompt density; alias cache invalidated on bundle promotion.
Idempotency keys, retries, circuit breakers, dead-letter handling.

Before ContextOS

Agent prototypes depend on prompt discipline, scattered tool wrappers, and manual review. When something goes wrong, teams reconstruct the run from logs and memory.

With ContextOS

The harness compiles the context, enforces policy, gates tools, validates output, emits a DecisionRecord, and routes failures into a typed improvement loop.

Operational result

Teams can answer what the agent saw, which policy allowed it, which tool it called, why the verdict passed, how to replay it, and what change would prevent a repeat failure.

Make the agent run explainable before it acts.

Start with the harness: bounded context, governed loops, typed decisions, replay, and release-gated improvement by default.