Harness Engineering
The discipline of building the controlled execution environment around AI agents — how every ContextOS primitive composes into a governed, observable, reversible, searchable runtime.
The cross-plane discipline of building a controlled execution environment around the model — every primitive composes into a governed, observable, replayable runtime.
- Intent / task contract from product surface or upstream agent
- RunContext (user, agent, tenant, claims, budget, safety mode)
- Pinned Context Pack / policy bundle / tool registry versions
- CompiledContext per request
- DecisionRecord per turn (with evidence_refs, approvals, controls_active, trace_id)
- OTEL trace bundle per run
- Scorecard per run on the five evaluators
- Improvement-loop proposals per pattern
- Harness
- CompiledContext
- DecisionRecord
- RunContext
- ContextPack
Harness Engineering is the cross-plane discipline that ContextOS encodes. It is the practice of building the controlled execution environment around an AI agent — the code that decides what context the model sees, which tools it can call, what it must justify, what gets logged, what gets rolled back, and how all of that improves over time.
Traditional software engineering asks how do we write good code? Harness Engineering asks how do we create a governed system where agents can understand context, use tools, validate outputs, observe failures, recover safely, and continuously improve?
The model reasons. The harness governs execution.
Without a harness, agents behave like clever but inconsistent assistants. With a harness, agents become repeatable, observable, policy-governed execution units. The five planes of ContextOS — Intelligence, Context, Decision, Action, Trust — are the harness, decomposed into primitives that can be specified, versioned, evaluated, and replayed.
Definition
A harness is a stateful program that wraps a language model and determines, at every step, what context it sees, what tools it can call, and what is preserved from the result. Harness Engineering is the discipline of designing, evaluating, and continuously improving that program.
Concretely, a harness is the union of:
| Capability | ContextOS primitive |
|---|---|
| Decide what to put in front of the model | Context Pack compiled into a CompiledContext |
| Decide what the model is allowed to do | Policy Engine + ApprovalMode tiers |
| Decide which external effects are reachable | Adapter Mesh / Tool Gateway |
| Decide which outputs are acceptable | Evaluators + DecisionRecord |
| Decide what to remember and what to forget | Promotion-aware Memory |
| Decide how to recover when something goes wrong | Failure Playbooks + replay |
| Decide how the harness itself improves | Improvement Loop |
A ContextOS deployment is not “a model plus a few prompts” — it is a harness with named, typed, governed components. Every guarantee the platform makes is a property of the harness, not of the model.
Why it exists
Single-prompt and “model-plus-tools” architectures decay quickly under enterprise constraints. The bottleneck shifts from model intelligence to execution control:
- Can the system ensure that the answer or action is correct, safe, validated, observable, reversible, and compliant?
- Can a regression be detected before it ships, and rolled back when it does?
- Can the harness improve from its own failures without a human rewriting prompts?
Harness Engineering exists because three things are true at once: model behavior changes (drift, upgrades, new vendors); the world the model acts on changes (data, policies, suppliers); and the cost of an unbounded mistake is high. The only durable response is to externalize control from the model into a versioned, executable harness — and then to engineer that harness as deliberately as the rest of the platform.
The Stanford / MIT / KRAFTON Meta-Harness work (Lee et al., 2026) makes this concrete: changing the harness around a fixed model can produce a 6× swing on the same benchmark. The harness, not just the weights, is the optimization target.
See it running. The discipline below is implemented end-to-end in SecondBrain, an open-source local-first agent OS. Every primitive in this doc —
AGENTS.md, theharness/repo layout, the eight properties, the rollout stages, reviewer profiles — has a concrete code path you can clone, run, and inspect withmake quickstart-docker.
How it works
Every ContextOS request passes through a harness in the same shape:
intent / task contract
↓
[Intelligence plane] evidence + identity + memory recall
↓
[Context plane] Context Pack → CompiledContext (eight stages)
↓
[Decision plane] Planner → Critic.verify → Executor → Critic.score → Consolidate
↓
[Action plane] Tool Gateway: approval-mode-bound tool calls only
↓
[Trust plane] policy + evidence + audit + telemetry on every step
↓
DecisionRecord (evidence_refs, approvals, controls_active, trace_id)The harness is the whole pipeline plus the control surfaces around it: the registry of tools, the policy bundles, the evaluator suite, the trace store, the replay harness, the review queue, and the improvement-loop primitives that consume their output.
Eight properties every harness must guarantee
Every harness ContextOS specifies must make every agent action:
| Property | Mechanism |
|---|---|
| Context-aware | The agent sees the right task-specific information — no more, no less. Compiled by the Context Pack pipeline. |
| Policy-governed | The agent cannot violate safety, compliance, or business rules. Enforced by the Policy Engine and ApprovalMode tiers. |
| Tool-controlled | The agent can only use approved tools with declared schemas, capability classes, and side-effect classifications. Enforced by the Tool Gateway. |
| Validated | Output is checked through tests, evaluators, and rules before it counts as “done.” Provided by Evaluators and the DecisionRecord contract. |
| Observable | Every decision and action is traceable end-to-end. Provided by OTEL-first tracing in Evaluation and Observability. |
| Reversible | Failures can be rolled back, retracted, or compensated. Provided by replay, reversal tokens, idempotency keys, and the Failure Playbooks typed-verdict map. |
| Measurable | Quality, cost, latency, safety, and business impact are tracked per intent and per pack version. Provided by the Metrics Glossary. |
| Continuously improving | Failures upgrade the harness, not just the output. Provided by the Improvement Loop. |
These eight properties are not aspirations. Each maps to a typed primitive that must be present, versioned, and replayable for the harness to be considered production-grade.
The reference architecture
Human / Product / Architect
|
v
Intent / Task Contract
|
v
+-------------------------------------------------------+
| Harness Control Plane |
|-------------------------------------------------------|
| Context Builder | Policy Engine | Tool Registry |
| Eval Engine | Review Agents | Approval Gates |
| Memory Layer | Observability | Rollback Manager |
| Improvement Loop |
+-------------------------------------------------------+
|
v
Agent Execution Runtime
|
+------------------+------------------+
| | |
v v v
Read tools Reversible writes Sensitive actions
GraphRAG Memory updates Payment / booking
Lookup Draft saves / cancellation
|
v
Validation + Telemetry + Feedback
|
v
DecisionRecord + ReplayThe control plane is the long-lived, governed surface; the runtime is what executes per request; tools are the only path to external effect; and every step is observable and reversible by design.
The harness as a search target
A subtle but load-bearing claim: the harness itself is a target of optimization, not just an output of design. ContextOS treats every harness component — pack version, policy bundle, evaluator suite, planner skill, retrieval rule — as a versioned artifact that lives in change control alongside the code.
The Improvement Loop makes this concrete. Operator corrections, failed runs, and approval-gate overrides become typed proposals. Proposals run against golden replays. Proposals that pass scorecard guardrails are promoted by an approver. Over time, the harness becomes the cumulative product of validated change — not a static prompt that ages.
This framing aligns with the Meta-Harness result: with full access to prior code, scores, and execution traces, an outer-loop search can discover better harnesses than hand engineering. ContextOS does not require an autonomous outer-loop optimizer, but it ships the substrate one needs: an experience store with code + scores + traces, a Pareto-frontier evaluator (Policy / Utility / Latency / Safety / Economics), and a release-gated promotion path.
The five planes, viewed as a harness
Every plane is a slice of the harness with its own ownership, contract, and failure modes:
| Plane | What it owns in the harness |
|---|---|
| Intelligence | The substrate the harness retrieves from. Ontology, knowledge graph, memory, identity. |
| Context | The compiler that turns intent + evidence + memory into a CompiledContext the model can act on. |
| Decision | The bounded loop (Planner → Critic.verify → Executor → Critic.score → Consolidate) that converts a CompiledContext into a DecisionRecord. |
| Action | The Tool Gateway that is the only path to external effect. Every tool is typed, owned, side-effect-classified, and approval-mode-bound. |
| Trust | The control surface over the other four. Policy, evaluation, observability, improvement, governance. |
Cross-cutting primitives (RunContext, RunBudget, ApprovalMode, ContextPack, CompiledContext, DecisionRecord, ToolEnvelope) are the typed seams between planes. They are the harness’s wire format.
Repo-local harness layout
A ContextOS-aligned engineering repo should expose its harness on disk so that humans and agents can both navigate it:
repo/
AGENTS.md # primary navigation for agents
ARCHITECTURE.md
SECURITY.md
RELIABILITY.md
docs/
decision-records/ # ADRs, dated, append-only
execution-plans/
active/
completed/
runbooks/
known-limitations/
harness/
packs/ # Context Pack bundles, versioned
policies/ # JsonLogic / DSL bundles
tools/ # Tool manifests with schema + ownership
evals/ # golden sets, scenario rubrics
fixtures/ # synthetic scenarios for simulation
validators/ # interface tests run before full eval
observability/ # trace schemas, span attribute conventions
reviewers/ # review-agent skills + rubrics
skills/ # planner / executor / critic skills
feedback/ # captured corrections + lineageTwo principles govern this layout:
- Anything the agent must follow is visible, versioned, and machine-checkable — close to the code, not in a wiki.
- The proposer can always find prior experience by reading the filesystem — every prior run leaves a directory containing code, scores, and traces. This is the substrate the Improvement Loop consumes, and it is the same substrate Meta-Harness-style outer loops require.
A minimal AGENTS.md is a navigation file, not a prompt dump:
# AGENTS.md
You are operating inside this engineering repository.
Start here:
1. Read ARCHITECTURE.md for system boundaries.
2. Read SECURITY.md before touching PII, auth, payment, or sensitive data.
3. Read RELIABILITY.md before changing runtime flows.
4. Read docs/execution-plans/active/ for active work.
5. Run `make verify` before opening a PR.
6. Do not bypass policy gates.
7. Every change must include tests, telemetry, and rollback notes.
Harness layout:
- harness/packs/ — Context Pack bundles (versioned)
- harness/policies/ — policy bundles; JsonLogic; do not edit without ADR
- harness/tools/ — tool manifests; schema-checked in CI
- harness/evals/ — golden sets; replay before any pack change
- harness/reviewers/ — reviewer-agent skills; one per concernThe Meta-Harness paper finds that a short, well-written skill that constrains outputs and forbidden actions — while leaving the proposer free to inspect anything — is the strongest lever on harness search quality. AGENTS.md plays the same role for any coding agent operating in the repo.
Feature-flagged harness rollout
A new harness component (a pack version, a policy change, a tool addition, a planner skill) is never released to all traffic at once. The canonical rollout stages:
| Stage | What is true | Gate to advance |
|---|---|---|
0%_shadow | Runs in parallel; emits scorecard but does not affect outcome | Scorecard delta within bounds for ≥ N runs |
1%_internal | Internal users only; full telemetry | No regression on safety / policy guardrails |
5%_low_risk | Limited cohort; low-blast-radius intents only | Adoption rate of corrections trending down |
25%_monitored | Broader rollout; tail-based sampling on every escalation | Evaluator scorecards stable across cohorts |
100% | Full rollout; previous version pinned for replay | Two-week canary period clean |
Every stage carries a kill switch: a single control-plane operation that pins the prior version and re-routes traffic to it. Replay against the pinned snapshot must reproduce the prior DecisionRecord byte-for-byte; this is the contract that makes rollback meaningful.
Reviewer agents
Specialized review agents inspect every proposed change before it reaches a human approver. They are not a replacement for human judgment; they are a way to make humans focus on judgment instead of repetitive checks. The full taxonomy is documented in Reviewer Agents; the seven canonical roles are:
| Reviewer | Concern |
|---|---|
| Architecture | Layering, plane boundaries, dependency direction |
| Security | PII, secrets, auth, injection, sandbox profile |
| Reliability | Timeouts, retries, fallbacks, idempotency, rollback |
| Product | User experience, edge cases, intent fidelity |
| Data | Event schema, evidence coverage, analytics impact |
| Cost | Tokens, infra cost, retrieval cost, run-budget headroom |
| Compliance | Audit, consent, regulated actions, ApprovalMode binding |
Reviewer output is a typed envelope (status, findings, severity, policy_id, recommendation) that lands in the same change-control queue as Improvement Loop proposals.
Interfaces
Inputs
- Intent / task contract (from product surface or upstream agent)
RunContext(user, agent, tenant, claims, budget, safety mode)- Context Pack version pin
- Policy bundle version pin
- Tool registry version pin
Outputs
CompiledContext(per request, by the Context Pack compiler)DecisionRecord(per turn, with evidence_refs, approvals, controls_active, trace_id)- OTEL trace bundle (per run)
- Scorecard (per run, on the five evaluators)
- Improvement-loop proposals (per pattern, typed)
Versioned artifacts
- Context Pack bundle (
packs/) - Policy bundle (
policies/) - Tool manifest (
tools/) - Evaluator suite + golden sets (
evals/) - Planner / Executor / Critic skills (
skills/) - Reviewer skills (
reviewers/)
Failure modes
- One giant prompt file. Becomes stale and ignored; the harness loses any property that was supposed to live in it.
- No tool schema. Agents misuse tools; the Tool Gateway cannot enforce side-effect classification or approval binding.
- No policy engine. Unsafe actions reach production; “the model knows the rules” is never a safety property.
- Only manual review. Does not scale; every additional intent multiplies the review queue linearly.
- Only happy-path evals. Real-world failures escape because the golden set captures success, not the long tail of edge cases.
- No telemetry. Failures cannot be debugged; there is nothing for the Improvement Loop to consume.
- No feature flags. Rollout becomes a binary cutover; a bad change hits 100% before it can be detected.
- No rollback. A small issue becomes an incident because the prior version is not pinned for replay.
- No domain ontology. Agent output remains generic; evidence is unbound; replay is non-deterministic.
- No ownership. The harness decays — no team owns the policy bundle, the eval set, or the tool registry, so they age silently.
- Compressed feedback to the optimizer. The Improvement Loop or any outer-loop search reads only summaries and scalar scores, losing the diagnostic signal needed to identify confounds (Meta-Harness Section 4.1 ablation).
- Search-set / test-set leakage. Goldens and scenarios used for tuning are also used for release gating; scorecards stop being honest.
Operational concerns
- Pack, policy, tool, and eval bundles are versioned independently; a release is a tuple of pinned versions, not a single SHA.
- Every long-running session has a kill switch wired to the rollback manager; the kill switch operates on the pinned-version tuple, not on individual primitives.
- Trace retention bands track data classification: traces touching regulated data follow the data’s retention rule, not the platform default.
- Run-budget headroom is measured per intent; intents trending toward budget exhaustion trigger an Autotune proposal before they fail under load.
- The reviewer-agent suite is itself a harness component; reviewer skills are versioned and golden-set-evaluated like any other.
- Search-set and test-set are kept disjoint by construction; the proposer (human or automated) never sees test-set scores during iteration.
- Lightweight interface validation runs before any full evaluation: import the candidate, instantiate, call its public methods on a tiny fixture, and require it to pass. Most malformed candidates die in seconds.
Evaluation metrics
Engineering metrics for the harness itself:
| Metric | Why it matters |
|---|---|
| Pack adoption rate | Fraction of intents on the latest pack version |
| Eval pass rate | Fraction of candidates clearing the release gate on first submission |
| Policy violation rate | Per-intent, per-tier; should trend to zero as the harness matures |
| Human intervention rate | Fraction of decisions that crossed a human approval gate |
| Mean time to repair | From regression detection to released fix |
| Change failure rate | Released proposals retired within 90 days |
| Rollback rate | Releases that triggered a kill switch |
| Cost per verified success | Tokens + tool calls + infra, scoped to runs that passed the scorecard |
| Lead time reduction | Operator correction → released StrategyRule |
Product metrics carried by the harness’s DecisionRecord and replay log are documented in the Metrics Glossary.
Example
A condensed end-to-end harness flow for a sensitive action:
1. Intent: "support.refund.execute" arrives with RunContext.
2. Context plane: Context Pack v5.2.0 compiles a CompiledContext (8 stages).
- intelligence_refs: customer + order + prior corrections
- policy_layer: ApprovalMode=destructive, requires evidence(refund_window)
- tooling_layer: adp_payments.issue_refund (destructive), adp_policy.eval (read_only)
- evaluation_layer: scorecard target { policy>=1.0, safety>=1.0 }
3. Decision plane: Planner proposes [eval, refund]; Critic.verify checks
evidence coverage and ApprovalMode binding before execution.
4. Trust plane: Policy Engine evaluates JsonLogic bundle; rule
R_HIGH_VALUE_REQUIRES_APPROVAL requires approval gate GATE_FINANCE_APPROVAL.
5. Action plane: Tool Gateway blocks adp_payments.issue_refund pending approval token.
6. Reviewer agents (security, compliance, cost) emit findings; all clear.
7. Approver signs the gate; reversal token issued alongside the refund call.
8. DecisionRecord written: status=COMPLETED, evidence_refs=[...],
approvals=[GATE_FINANCE_APPROVAL/op_22], reversal_token=rev_x7y, trace_id=...
9. Scorecard: policy=1.00, utility=0.94, safety=1.00, latency=1830ms, cost=$0.0091.
10. Replay against pack v5.2.0 reproduces the [DecisionRecord](/docs/implementation/decision-record) exactly.The same intent on pack v5.1.0 would produce a different DecisionRecord only where v5.2.0’s policy bundle changed; replay isolates the cause.
Common misconceptions
- The harness is not a prompt. The prompt is one input to the Context Pack compiler. Most harness behavior lives outside any prompt: policy, gates, evidence binding, evaluation, replay.
- Harness Engineering is not “prompt engineering done well.” Prompt engineering is one tactic inside one stage of the Context Pack compiler. Harness Engineering covers what the model sees, what it can do, what counts as done, and how the system improves.
- The model is not the optimization target. The harness is. Changing the harness around a fixed model can produce 6× swings on the same benchmark; weight changes are the smaller lever in most enterprise settings.
- “Reviewed by a human” is not a sufficient guarantee. Reviewer agents and golden replays catch the patterns humans drift on; humans approve, but the harness is what enforces.
- Improvement is not auto-pilot. Every Improvement Loop primitive emits proposals. Proposals run through the same release gate as packs and policies. Nothing auto-applies without an approver, except where policy explicitly permits it for a specific class.
- The harness is not a one-time build. It is a long-lived versioned artifact whose components age, decay, and must be re-baselined on a cadence. Treat it like any other production system.