Harness Engineering

The discipline of building the controlled execution environment around AI agents — how every ContextOS primitive composes into a governed, observable, reversible, searchable runtime.

Foundational SpecLast reviewed: 2026-05-09 Edit on GitHub

At a glance

Trust planeContext planeDecision planeAction planeIntelligence planeControl over the other four

The cross-plane discipline of building a controlled execution environment around the model — every primitive composes into a governed, observable, replayable runtime.

Inputs

Intent / task contract from product surface or upstream agent
RunContext (user, agent, tenant, claims, budget, safety mode)
Pinned Context Pack / policy bundle / tool registry versions

Outputs

CompiledContext per request
DecisionRecord per turn (with evidence_refs, approvals, controls_active, trace_id)
OTEL trace bundle per run
Scorecard per run on the five evaluators
Improvement-loop proposals per pattern

Canonical types

Harness
CompiledContext
DecisionRecord
RunContext
ContextPack

Harness Engineering is the cross-plane discipline that ContextOS encodes. It is the practice of building the controlled execution environment around an AI agent — the code that decides what context the model sees, which tools it can call, what it must justify, what gets logged, what gets rolled back, and how all of that improves over time.

Traditional software engineering asks how do we write good code? Harness Engineering asks how do we create a governed system where agents can understand context, use tools, validate outputs, observe failures, recover safely, and continuously improve?

The model reasons. The harness governs execution.

Without a harness, agents behave like clever but inconsistent assistants. With a harness, agents become repeatable, observable, policy-governed execution units. The five planes of ContextOS — Intelligence, Context, Decision, Action, Trust — are the harness, decomposed into primitives that can be specified, versioned, evaluated, and replayed.

Definition

A harness is a stateful program that wraps a language model and determines, at every step, what context it sees, what tools it can call, and what is preserved from the result. Harness Engineering is the discipline of designing, evaluating, and continuously improving that program.

Concretely, a harness is the union of:

Capability	ContextOS primitive
Decide what to put in front of the model	Context Pack compiled into a `CompiledContext`
Decide what the model is allowed to do	Policy Engine + `ApprovalMode` tiers
Decide which external effects are reachable	Adapter Mesh / Tool Gateway
Decide which outputs are acceptable	Evaluators + `DecisionRecord`
Decide what to remember and what to forget	Promotion-aware Memory
Decide how to recover when something goes wrong	Failure Playbooks + replay
Decide how the harness itself improves	Improvement Loop

A ContextOS deployment is not “a model plus a few prompts” — it is a harness with named, typed, governed components. Every guarantee the platform makes is a property of the harness, not of the model.

Why it exists

Single-prompt and “model-plus-tools” architectures decay quickly under enterprise constraints. The bottleneck shifts from model intelligence to execution control:

Can the system ensure that the answer or action is correct, safe, validated, observable, reversible, and compliant?
Can a regression be detected before it ships, and rolled back when it does?
Can the harness improve from its own failures without a human rewriting prompts?

Harness Engineering exists because three things are true at once: model behavior changes (drift, upgrades, new vendors); the world the model acts on changes (data, policies, suppliers); and the cost of an unbounded mistake is high. The only durable response is to externalize control from the model into a versioned, executable harness — and then to engineer that harness as deliberately as the rest of the platform.

The Stanford / MIT / KRAFTON Meta-Harness work (Lee et al., 2026) makes this concrete: changing the harness around a fixed model can produce a 6× swing on the same benchmark. The harness, not just the weights, is the optimization target.

See it running. The discipline below is implemented end-to-end in SecondBrain, an open-source local-first agent OS. Every primitive in this doc — AGENTS.md, the harness/ repo layout, the eight properties, the rollout stages, reviewer profiles — has a concrete code path you can clone, run, and inspect with make quickstart-docker.

How it works

Every ContextOS request passes through a harness in the same shape:

intent / task contract
        ↓
[Intelligence plane]   evidence + identity + memory recall
        ↓
[Context plane]        Context Pack → CompiledContext (eight stages)
        ↓
[Decision plane]       Planner → Critic.verify → Executor → Critic.score → Consolidate
        ↓
[Action plane]         Tool Gateway: approval-mode-bound tool calls only
        ↓
[Trust plane]          policy + evidence + audit + telemetry on every step
        ↓
DecisionRecord (evidence_refs, approvals, controls_active, trace_id)

The harness is the whole pipeline plus the control surfaces around it: the registry of tools, the policy bundles, the evaluator suite, the trace store, the replay harness, the review queue, and the improvement-loop primitives that consume their output.

Eight properties every harness must guarantee

Every harness ContextOS specifies must make every agent action:

Property	Mechanism
Context-aware	The agent sees the right task-specific information — no more, no less. Compiled by the Context Pack pipeline.
Policy-governed	The agent cannot violate safety, compliance, or business rules. Enforced by the Policy Engine and `ApprovalMode` tiers.
Tool-controlled	The agent can only use approved tools with declared schemas, capability classes, and side-effect classifications. Enforced by the Tool Gateway.
Validated	Output is checked through tests, evaluators, and rules before it counts as “done.” Provided by Evaluators and the `DecisionRecord` contract.
Observable	Every decision and action is traceable end-to-end. Provided by OTEL-first tracing in Evaluation and Observability.
Reversible	Failures can be rolled back, retracted, or compensated. Provided by replay, reversal tokens, idempotency keys, and the Failure Playbooks typed-verdict map.
Measurable	Quality, cost, latency, safety, and business impact are tracked per intent and per pack version. Provided by the Metrics Glossary.
Continuously improving	Failures upgrade the harness, not just the output. Provided by the Improvement Loop.

These eight properties are not aspirations. Each maps to a typed primitive that must be present, versioned, and replayable for the harness to be considered production-grade.

The reference architecture

                  Human / Product / Architect
                            |
                            v
                    Intent / Task Contract
                            |
                            v
+-------------------------------------------------------+
|                 Harness Control Plane                 |
|-------------------------------------------------------|
| Context Builder | Policy Engine | Tool Registry       |
| Eval Engine     | Review Agents | Approval Gates      |
| Memory Layer    | Observability | Rollback Manager    |
| Improvement Loop                                      |
+-------------------------------------------------------+
                            |
                            v
                    Agent Execution Runtime
                            |
         +------------------+------------------+
         |                  |                  |
         v                  v                  v
   Read tools        Reversible writes    Sensitive actions
   GraphRAG          Memory updates       Payment / booking
   Lookup            Draft saves          / cancellation
                            |
                            v
                Validation + Telemetry + Feedback
                            |
                            v
                    DecisionRecord + Replay

The control plane is the long-lived, governed surface; the runtime is what executes per request; tools are the only path to external effect; and every step is observable and reversible by design.

The harness as a search target

A subtle but load-bearing claim: the harness itself is a target of optimization, not just an output of design. ContextOS treats every harness component — pack version, policy bundle, evaluator suite, planner skill, retrieval rule — as a versioned artifact that lives in change control alongside the code.

The Improvement Loop makes this concrete. Operator corrections, failed runs, and approval-gate overrides become typed proposals. Proposals run against golden replays. Proposals that pass scorecard guardrails are promoted by an approver. Over time, the harness becomes the cumulative product of validated change — not a static prompt that ages.

This framing aligns with the Meta-Harness result: with full access to prior code, scores, and execution traces, an outer-loop search can discover better harnesses than hand engineering. ContextOS does not require an autonomous outer-loop optimizer, but it ships the substrate one needs: an experience store with code + scores + traces, a Pareto-frontier evaluator (Policy / Utility / Latency / Safety / Economics), and a release-gated promotion path.

The five planes, viewed as a harness

Every plane is a slice of the harness with its own ownership, contract, and failure modes:

Plane	What it owns in the harness
Intelligence	The substrate the harness retrieves from. Ontology, knowledge graph, memory, identity.
Context	The compiler that turns intent + evidence + memory into a `CompiledContext` the model can act on.
Decision	The bounded loop (Planner → Critic.verify → Executor → Critic.score → Consolidate) that converts a `CompiledContext` into a `DecisionRecord`.
Action	The Tool Gateway that is the only path to external effect. Every tool is typed, owned, side-effect-classified, and approval-mode-bound.
Trust	The control surface over the other four. Policy, evaluation, observability, improvement, governance.

Cross-cutting primitives (RunContext, RunBudget, ApprovalMode, ContextPack, CompiledContext, DecisionRecord, ToolEnvelope) are the typed seams between planes. They are the harness’s wire format.

Repo-local harness layout

A ContextOS-aligned engineering repo should expose its harness on disk so that humans and agents can both navigate it:

repo/
  AGENTS.md                       # primary navigation for agents
  ARCHITECTURE.md
  SECURITY.md
  RELIABILITY.md
 
  docs/
    decision-records/             # ADRs, dated, append-only
    execution-plans/
      active/
      completed/
    runbooks/
    known-limitations/
 
  harness/
    packs/                        # Context Pack bundles, versioned
    policies/                     # JsonLogic / DSL bundles
    tools/                        # Tool manifests with schema + ownership
    evals/                        # golden sets, scenario rubrics
    fixtures/                     # synthetic scenarios for simulation
    validators/                   # interface tests run before full eval
    observability/                # trace schemas, span attribute conventions
    reviewers/                    # review-agent skills + rubrics
    skills/                       # planner / executor / critic skills
    feedback/                     # captured corrections + lineage

Two principles govern this layout:

Anything the agent must follow is visible, versioned, and machine-checkable — close to the code, not in a wiki.
The proposer can always find prior experience by reading the filesystem — every prior run leaves a directory containing code, scores, and traces. This is the substrate the Improvement Loop consumes, and it is the same substrate Meta-Harness-style outer loops require.

A minimal AGENTS.md is a navigation file, not a prompt dump:

# AGENTS.md
 
You are operating inside this engineering repository.
 
Start here:
1. Read ARCHITECTURE.md for system boundaries.
2. Read SECURITY.md before touching PII, auth, payment, or sensitive data.
3. Read RELIABILITY.md before changing runtime flows.
4. Read docs/execution-plans/active/ for active work.
5. Run `make verify` before opening a PR.
6. Do not bypass policy gates.
7. Every change must include tests, telemetry, and rollback notes.
 
Harness layout:
- harness/packs/        — Context Pack bundles (versioned)
- harness/policies/     — policy bundles; JsonLogic; do not edit without ADR
- harness/tools/        — tool manifests; schema-checked in CI
- harness/evals/        — golden sets; replay before any pack change
- harness/reviewers/    — reviewer-agent skills; one per concern

The Meta-Harness paper finds that a short, well-written skill that constrains outputs and forbidden actions — while leaving the proposer free to inspect anything — is the strongest lever on harness search quality. AGENTS.md plays the same role for any coding agent operating in the repo.

Feature-flagged harness rollout

A new harness component (a pack version, a policy change, a tool addition, a planner skill) is never released to all traffic at once. The canonical rollout stages:

Stage	What is true	Gate to advance
`0%_shadow`	Runs in parallel; emits scorecard but does not affect outcome	Scorecard delta within bounds for ≥ N runs
`1%_internal`	Internal users only; full telemetry	No regression on safety / policy guardrails
`5%_low_risk`	Limited cohort; low-blast-radius intents only	Adoption rate of corrections trending down
`25%_monitored`	Broader rollout; tail-based sampling on every escalation	Evaluator scorecards stable across cohorts
`100%`	Full rollout; previous version pinned for replay	Two-week canary period clean

Every stage carries a kill switch: a single control-plane operation that pins the prior version and re-routes traffic to it. Replay against the pinned snapshot must reproduce the prior DecisionRecord byte-for-byte; this is the contract that makes rollback meaningful.

Reviewer agents

Specialized review agents inspect every proposed change before it reaches a human approver. They are not a replacement for human judgment; they are a way to make humans focus on judgment instead of repetitive checks. The full taxonomy is documented in Reviewer Agents; the seven canonical roles are:

Reviewer	Concern
Architecture	Layering, plane boundaries, dependency direction
Security	PII, secrets, auth, injection, sandbox profile
Reliability	Timeouts, retries, fallbacks, idempotency, rollback
Product	User experience, edge cases, intent fidelity
Data	Event schema, evidence coverage, analytics impact
Cost	Tokens, infra cost, retrieval cost, run-budget headroom
Compliance	Audit, consent, regulated actions, ApprovalMode binding

Reviewer output is a typed envelope (status, findings, severity, policy_id, recommendation) that lands in the same change-control queue as Improvement Loop proposals.

Interfaces

Inputs

Intent / task contract (from product surface or upstream agent)
RunContext (user, agent, tenant, claims, budget, safety mode)
Context Pack version pin
Policy bundle version pin
Tool registry version pin

Outputs

CompiledContext (per request, by the Context Pack compiler)
DecisionRecord (per turn, with evidence_refs, approvals, controls_active, trace_id)
OTEL trace bundle (per run)
Scorecard (per run, on the five evaluators)
Improvement-loop proposals (per pattern, typed)

Versioned artifacts

Context Pack bundle (packs/)
Policy bundle (policies/)
Tool manifest (tools/)
Evaluator suite + golden sets (evals/)
Planner / Executor / Critic skills (skills/)
Reviewer skills (reviewers/)

Failure modes

One giant prompt file. Becomes stale and ignored; the harness loses any property that was supposed to live in it.
No tool schema. Agents misuse tools; the Tool Gateway cannot enforce side-effect classification or approval binding.
No policy engine. Unsafe actions reach production; “the model knows the rules” is never a safety property.
Only manual review. Does not scale; every additional intent multiplies the review queue linearly.
Only happy-path evals. Real-world failures escape because the golden set captures success, not the long tail of edge cases.
No telemetry. Failures cannot be debugged; there is nothing for the Improvement Loop to consume.
No feature flags. Rollout becomes a binary cutover; a bad change hits 100% before it can be detected.
No rollback. A small issue becomes an incident because the prior version is not pinned for replay.
No domain ontology. Agent output remains generic; evidence is unbound; replay is non-deterministic.
No ownership. The harness decays — no team owns the policy bundle, the eval set, or the tool registry, so they age silently.
Compressed feedback to the optimizer. The Improvement Loop or any outer-loop search reads only summaries and scalar scores, losing the diagnostic signal needed to identify confounds (Meta-Harness Section 4.1 ablation).
Search-set / test-set leakage. Goldens and scenarios used for tuning are also used for release gating; scorecards stop being honest.

Operational concerns

Pack, policy, tool, and eval bundles are versioned independently; a release is a tuple of pinned versions, not a single SHA.
Every long-running session has a kill switch wired to the rollback manager; the kill switch operates on the pinned-version tuple, not on individual primitives.
Trace retention bands track data classification: traces touching regulated data follow the data’s retention rule, not the platform default.
Run-budget headroom is measured per intent; intents trending toward budget exhaustion trigger an Autotune proposal before they fail under load.
The reviewer-agent suite is itself a harness component; reviewer skills are versioned and golden-set-evaluated like any other.
Search-set and test-set are kept disjoint by construction; the proposer (human or automated) never sees test-set scores during iteration.
Lightweight interface validation runs before any full evaluation: import the candidate, instantiate, call its public methods on a tiny fixture, and require it to pass. Most malformed candidates die in seconds.

Evaluation metrics

Engineering metrics for the harness itself:

Metric	Why it matters
Pack adoption rate	Fraction of intents on the latest pack version
Eval pass rate	Fraction of candidates clearing the release gate on first submission
Policy violation rate	Per-intent, per-tier; should trend to zero as the harness matures
Human intervention rate	Fraction of decisions that crossed a human approval gate
Mean time to repair	From regression detection to released fix
Change failure rate	Released proposals retired within 90 days
Rollback rate	Releases that triggered a kill switch
Cost per verified success	Tokens + tool calls + infra, scoped to runs that passed the scorecard
Lead time reduction	Operator correction → released StrategyRule

Product metrics carried by the harness’s DecisionRecord and replay log are documented in the Metrics Glossary.

Example

A condensed end-to-end harness flow for a sensitive action:

1. Intent: "support.refund.execute" arrives with RunContext.
2. Context plane: Context Pack v5.2.0 compiles a CompiledContext (8 stages).
   - intelligence_refs:  customer + order + prior corrections
   - policy_layer:       ApprovalMode=destructive, requires evidence(refund_window)
   - tooling_layer:      adp_payments.issue_refund (destructive), adp_policy.eval (read_only)
   - evaluation_layer:   scorecard target { policy>=1.0, safety>=1.0 }
3. Decision plane: Planner proposes [eval, refund]; Critic.verify checks
   evidence coverage and ApprovalMode binding before execution.
4. Trust plane: Policy Engine evaluates JsonLogic bundle; rule
   R_HIGH_VALUE_REQUIRES_APPROVAL requires approval gate GATE_FINANCE_APPROVAL.
5. Action plane: Tool Gateway blocks adp_payments.issue_refund pending approval token.
6. Reviewer agents (security, compliance, cost) emit findings; all clear.
7. Approver signs the gate; reversal token issued alongside the refund call.
8. DecisionRecord written: status=COMPLETED, evidence_refs=[...],
   approvals=[GATE_FINANCE_APPROVAL/op_22], reversal_token=rev_x7y, trace_id=...
9. Scorecard: policy=1.00, utility=0.94, safety=1.00, latency=1830ms, cost=$0.0091.
10. Replay against pack v5.2.0 reproduces the [DecisionRecord](/docs/implementation/decision-record) exactly.

The same intent on pack v5.1.0 would produce a different DecisionRecord only where v5.2.0’s policy bundle changed; replay isolates the cause.

Common misconceptions

The harness is not a prompt. The prompt is one input to the Context Pack compiler. Most harness behavior lives outside any prompt: policy, gates, evidence binding, evaluation, replay.
Harness Engineering is not “prompt engineering done well.” Prompt engineering is one tactic inside one stage of the Context Pack compiler. Harness Engineering covers what the model sees, what it can do, what counts as done, and how the system improves.
The model is not the optimization target. The harness is. Changing the harness around a fixed model can produce 6× swings on the same benchmark; weight changes are the smaller lever in most enterprise settings.
“Reviewed by a human” is not a sufficient guarantee. Reviewer agents and golden replays catch the patterns humans drift on; humans approve, but the harness is what enforces.
Improvement is not auto-pilot. Every Improvement Loop primitive emits proposals. Proposals run through the same release gate as packs and policies. Nothing auto-applies without an approver, except where policy explicitly permits it for a specific class.
The harness is not a one-time build. It is a long-lived versioned artifact whose components age, decay, and must be re-baselined on a cadence. Treat it like any other production system.