The most dangerous sentence in agent engineering is: “we have a harness.”
Most teams do not. They have a prompt, a tool registry, a few evals, a dashboard, and a human escalation path that works when the right person is awake. Those are useful components. They are not yet a production harness.
An Agent Harness is the governed runtime control plane around an agent: context, policy, tools, state, memory, evaluation, telemetry, approvals, rollback, and continuous improvement. It decides what the agent may know, what it may do, which rules it must obey, how every action is observed, how failures are reversed, and how corrections become safer future behavior.
A production agent without a harness is not autonomous software. It is a probabilistic process with access to tools.
This audit separates real harnesses from hopeful wrappers.
file:line evidence, and hands back a fix queue. No artifact, no pass.The original ContextOS eight harness properties are still the right outcome groups: context-aware, policy-governed, tool-controlled, validated, observable, reversible, measurable, and continuously improving. They are not enough as a checklist. A serious audit needs to inspect the actual runtime surfaces underneath those outcomes.
Why the bar is higher now
Modern agent frameworks are converging on the same lesson: the important runtime work lives outside the prompt.
OpenAI Agents SDK treats tools, handoffs, sessions, context management, guardrails, human-in-the-loop, MCP, and tracing as runtime concerns. Its tracing guide captures LLM generations, tool calls, handoffs, guardrails, and custom events as part of the run record. Its guardrails guide distinguishes input, output, and tool guardrails at different workflow boundaries.
Google ADK puts similar pressure on lifecycle surfaces: callbacks can block model or tool execution, request credentials, manage state deliberately, and save or load artifacts. NIST AI RMF frames AI risk management around Govern, Map, Measure, and Manage, with governance as the cross-cutting function across the lifecycle. OWASP’s LLM Top 10 names prompt injection, sensitive information disclosure, insecure plugin design, excessive agency, and overreliance as core application risks. MCP security guidance calls out confused deputy, token passthrough, SSRF, and session hijacking risks in tool and resource ecosystems. OpenTelemetry gives a vendor-neutral anchor for correlating traces, metrics, and logs across those boundaries.
That is the operating context for this audit. A harness is not a confidence story. It is evidence.
What a harness is not
A harness is not the system prompt. The prompt can describe behavior, but it cannot reliably enforce identity, network boundaries, side effects, redaction, retention, or rollback.
A harness is not just an agent framework. LangGraph, ADK, OpenAI Agents SDK, Semantic Kernel, CrewAI, or custom orchestration can be the execution substrate. The harness is the governed runtime discipline around the agent: manifests, policy, tool envelopes, approval gates, trace spans, evals, replay, and release control.
A harness is not only observability. Traces are necessary, but a trace that records unsafe execution after the fact is not a control. The harness has to prevent, gate, degrade, abort, escalate, replay, and improve.
A harness is not a tool registry. A registry names capabilities. A harness decides which capability is visible, callable, authorized, approved, sandboxed, idempotent, traced, reversible, and safe under this run’s identity and risk class.
The four audit planes
These are audit planes, not a replacement for the ContextOS five-plane model. Use them to group evidence during readiness review.
| Audit plane | What it controls | Typical artifacts |
|---|---|---|
| Control plane | Agent contract, autonomy, context, policy, identity, release tuple | AgentSpec, autonomy matrix, source registry, policy bundle, ReleaseTuple |
| Execution plane | Planning, tools, state, memory, durable workflow, fallback behavior | plan artifact, ToolEnvelope, state machine, checkpoint, memory proposal |
| Risk plane | Data class, privacy, approvals, sandboxing, red-team coverage, incident response | data labels, redaction trace, approval record, egress policy, playbook |
| Learning plane | Offline evals, trajectory evals, telemetry, business metrics, corrections, promotion | EvalScorecard, trace dashboard, replay record, fix queue, promotion gate |
The audit protocol
The old “30-minute audit” is still useful, but only as a smoke test. It should find obvious launch blockers before a demo, beta, or leadership review. It is not the full production audit.
| Audit mode | When to use | Duration | Output |
|---|---|---|---|
| Smoke audit | Before demo, beta, or leadership review | 30 min | Obvious launch blockers |
| Production readiness audit | Before real user launch | 2-4 hours | Full checklist scorecard |
| Red-team audit | Before high-risk tool or action rollout | 1-2 days | Exploit paths, abuse cases, mitigation backlog |
| Post-incident audit | After failure or near miss | Same day | Root cause, failed control, replay evidence, fix queue |
| Quarterly harness audit | For mature production agents | Quarterly | Drift, regressions, new risks, maturity roadmap |
For every mode, inspect at least two runs:
- One ordinary successful run for a representative intent.
- One boundary run that crossed a policy denial, tool error, approval gate, evaluator failure, escalation, rollback, replay, or fallback path.
The happy path tells you what the harness does when nothing is stressed. The boundary path tells you whether the harness exists when it matters.
Scoring
Score each control with the same strict rule.
| Score | Meaning |
|---|---|
| Pass | A script, trace, manifest, dashboard, or record proves the control for a real run. |
| Partial | The control exists, but coverage is incomplete, manual, delayed, undocumented, or not enforced at the right boundary. |
| Fail | The control is absent, unenforced, unverifiable, or only described in prose. |
If it takes more than five minutes to find the evidence, score it as fail. That is not pedantry. A control that cannot be found under pressure will not protect the system under pressure.
Severity is separate from pass state.
| Severity | Meaning |
|---|---|
| P0 | Launch blocker for production agents with real users, real tools, money, PII, regulated data, or external side effects. |
| P1 | Required for reliable scale; acceptable only in controlled beta with compensating controls. |
| P2 | Maturity improvement; not always a launch blocker, but needed for enterprise-grade operations. |
The complete checklist
Use this as the working surface for the readiness audit.
| # | Harness facet | Audit question | Minimum pass evidence | Immediate fail signal | Severity |
|---|---|---|---|---|---|
| 1 | Agent charter | Is the agent’s job, scope, user type, and autonomy level explicitly declared? | Versioned AgentSpec with purpose, owner, allowed intents, denied intents, autonomy class | ”The prompt describes it” is the only source of truth | P0 |
| 2 | Autonomy boundary | Does the system know what the agent can answer, recommend, decide, or execute? | Autonomy matrix: inform, recommend, draft, execute_with_approval, execute_directly | Same path used for low-risk Q&A and high-risk actions | P0 |
| 3 | Intent taxonomy | Are runs mapped to stable intent IDs? | intent_id, confidence, router decision, fallback intent in trace | Metrics only show global agent success | P1 |
| 4 | Planner/executor split | Is planning separated from execution for non-trivial tasks? | Plan artifact, execution steps, approval gates, state transitions | Model jumps directly from user prompt to tool execution | P0 |
| 5 | Context source registry | Are all eligible context sources declared and governed? | Source registry with owner, freshness, sensitivity, access mode, TTL | Retrieval pulls from undocumented indexes or ad hoc APIs | P0 |
| 6 | Context compiler | Can you prove what context entered the model? | CompiledContext, pack version, source hashes, truncation record, token budget | Runtime string concatenation with no manifest | P0 |
| 7 | Grounding and evidence | Are claims grounded in retrieved or tool-backed evidence? | Evidence manifest with citations, source IDs, and confidence | Final answer includes factual claims with no source lineage | P1 |
| 8 | Context budget control | Does the system control what gets dropped when the context window fills? | Budget policy by source type, truncation reason, priority order | Oldest or random text is dropped silently | P1 |
| 9 | Memory read policy | Is memory retrieval intentional, scoped, and auditable? | Memory query log, memory IDs used, purpose, freshness, consent basis | Agent loads long-term memory by default without reason | P0 |
| 10 | Memory write policy | Are new memories validated before persistence? | Memory write proposal, dedup, sensitivity check, TTL, reviewer or auto-approval rule | Every conversation summary becomes memory | P0 |
| 11 | Contradiction handling | Can memory conflicts be detected and resolved? | Conflict record, recency, source confidence, supersession rule | Old incorrect memory keeps influencing future runs | P1 |
| 12 | Policy engine | Are rules enforced outside the model? | Versioned policy bundle, rule IDs, policy_decision_id, verdict, inputs | ”The model has been instructed not to” | P0 |
| 13 | Data classification | Does the harness know what data class it is handling? | Data labels: public, internal, confidential, PII, payment, regulatory | Tool or prompt receives sensitive data without classification | P0 |
| 14 | Privacy controls | Are redaction, minimization, and retention enforced? | Redaction trace, retention policy, purpose limitation, deletion path | PII appears in prompts, traces, or eval sets without controls | P0 |
| 15 | Tool manifest | Are available tools declared, versioned, owned, and scoped to least privilege? | Tool manifest with schema, owner, risk class, timeout, retry, auth mode, plus a tool surface scoped to what the role or intent actually needs | Tool list lives only in the prompt, or every agent gets the full toolbox regardless of task | P0 |
| 16 | Tool Gateway | Are tool calls validated before execution? | Schema validation, argument validation, policy check, approval mode in envelope | Model-named tools can be invoked dynamically | P0 |
| 17 | Tool risk class | Are side effects classified? | Canonical approval mode plus side-effect class such as financial, destructive, or regulated | Refund, cancel, payment, or delete tool treated like search | P0 |
| 18 | Identity and authorization | Does every tool call run under scoped identity? | User or service identity, scoped token, RBAC or ABAC decision, expiry | Shared static credentials used by all agents | P0 |
| 19 | Secret handling | Are credentials isolated from model-visible context? | Secrets vault, no secrets in prompts or logs, scoped runtime injection | API keys or tokens can enter model context | P0 |
| 20 | Network and sandbox controls | Is external access constrained? | Egress allowlist, sandbox policy, file and network restrictions | Agent can call arbitrary URLs or execute arbitrary code | P0 |
| 21 | Human approval | Are approval gates explicit for high-risk actions? | Approval request, approver identity, decision, expiry, reason | Human approval happens informally over Slack or chat | P0 |
| 22 | Escalation path | Can the agent hand off cleanly when confidence or risk is low? | Escalation policy, queue, reason code, transcript and context package | Agent keeps trying after repeated failure | P1 |
| 23 | State machine | Is execution state explicit and durable? | State transitions, checkpoints, event log, current state visible | State exists only in chat history or process memory | P1 |
| 24 | Idempotency | Are repeated tool calls safe? | Idempotency keys for write tools, duplicate detection | Retry can create duplicate booking, refund, ticket, or action | P0 |
| 25 | Durable execution | Can long-running tasks resume after failure? | Checkpoint and replay mechanism, resumable workflow ID | Failure requires restarting from user prompt | P1 |
| 26 | Offline evals | Are changes tested before release? | Golden set, scenario set, regression suite, model/prompt/tool version comparison | Prompt or model changes go live without replay | P0 |
| 27 | Trajectory evals | Does evaluation check the path, not just final answer? | Expected vs actual tool trajectory scored on coverage, precision, resource scope (correct object arguments), and minimality | Final answer judged “good” despite wrong process, or judged on output alone with no path assertion | P1 |
| 28 | Online validation | Can bad outputs be blocked at runtime? | Live critic or evaluator, policy-respect check, safety/utility score before finalization | Evals run only weekly, monthly, or offline | P0 |
| 29 | Red-team coverage | Is the harness tested against adversarial behavior and realistic perturbation? | Tests span indirect injection (planted instructions do not propagate), ambiguous goals (agent clarifies instead of acting irreversibly), and tool errors (#44), plus tool abuse, data leakage, and jailbreak suites | Only happy-path demo queries are tested; injection and ambiguity are untested | P0 |
| 30 | Observability | Can an engineer reconstruct the full run? | Trace spans for context, model, tools, policy, eval, approval, final response | Logs stop at “LLM returned response” | P0 |
| 31 | Standard telemetry | Are traces, logs, and metrics correlated? | Stable trace_id, run_id, intent_id, user_id, session_id, tool_call_id | Logs exist but cannot be joined | P1 |
| 32 | Cost and latency controls | Are token, tool, and runtime costs bounded? | Budget policy, per-intent cost, latency SLO, timeout behavior | Agent loops until budget is exhausted | P1 |
| 33 | Model/provider routing | Is model choice explicit and measurable? | Model routing policy, fallback model, quality/cost/latency comparison | Model changed without traceable release | P1 |
| 34 | Fallback behavior | What happens when model, tool, policy, or eval fails? | Typed fallback: retry, degrade, ask user, escalate, abort | Agent produces generic apology or retries blindly | P1 |
| 35 | Release tuple | Are all moving parts versioned together? | Tuple: prompt, model, policy, tools, context pack, eval suite, memory schema | Prompt, model, and tool changes tracked separately | P0 |
| 36 | Replayability | Can a past run be reconstructed? | Pinned inputs, context, tool outputs, policies, model version, evaluator version | Historical trace cannot be replayed | P0 |
| 37 | Rollback and compensation | Can damage be stopped or reversed? | Rollback command, previous release tuple, compensation path for writes | Rollback means “ask people not to use it” | P0 |
| 38 | Incident response | Is there an agent-specific incident playbook? | Severity matrix, owner, kill switch, escalation channel, postmortem template | No one knows who owns a bad agent action | P0 |
| 39 | Business measurement | Are agent outcomes tied to real impact? | Task success, conversion, deflection, CSAT, revenue, risk, cost by intent/version | Only number of chats and thumbs-up are tracked | P1 |
| 40 | Continuous improvement | Do failures become governed improvements? | Correction -> proposal -> replay -> review -> approval -> promotion -> live monitoring | Fixes happen as unreviewed prompt edits | P1 |
| 41 | Resource and object scope binding | Are tool calls bound to an authorized set of objects, not just a valid schema? | Write and read tools resolve the target object against a per-task or per-user authorized scope (allowlist, ownership check, or task-derived scope), enforced outside the model; out-of-scope object access is denied and logged | A schema-valid call can act on any ID the model emits: the right tool on the wrong customer, file, or record | P0 |
| 42 | Outbound disclosure control | Is what leaves the agent (final answers, tool arguments, handoffs, forwarded content) checked against data class, not just the recipient? | Sensitive fields minimized or redacted before they enter a tool argument, an inter-agent message, or the final response; an egress check proves a classified field did not reach an unauthorized sink | Recipient is authorized but the payload is over-shared: PII, secrets, or out-of-scope records flow through handoff context or the final answer | P0 |
| 43 | Inter-agent communication policy | In a multi-agent harness, are who-may-talk-to-whom, tool ownership, and delegation boundaries declared and enforced? | Communication topology (allowed role-to-role channels), role-local authority, and delegation boundaries are explicit and enforced; N/A for single-agent harnesses | Any agent can call any tool or message any agent; a coordinator oversteps by executing what it should delegate | P0 |
| 44 | Honest failure under tool error | When a tool or backend misbehaves, does the agent report failure instead of fabricating success? | On a tool error, empty result, or junk return, the agent acknowledges the failure in output or state and retries within scope or safely defers | Agent invents a result, claims completion with no supporting tool call, or takes an out-of-scope action after the failure | P1 |
The eight properties as outcome groups
The eight properties are still the scorecard executives and product owners can remember. The forty-four controls are how engineering proves them.
| Rollup outcome | Controls it covers |
|---|---|
| Context-aware | Source registry, context compiler, grounding, budget control, memory read/write, contradiction handling |
| Policy-governed | Agent charter, autonomy boundary, policy engine, data classification, privacy controls, human approval, outbound disclosure control, inter-agent communication policy |
| Tool-controlled | Tool manifest, Tool Gateway, side-effect classification, identity, authorization, secrets, sandboxing, resource and object scope binding |
| Validated | Offline evals, trajectory evals, online validation, red-team tests, honest failure under tool error |
| Observable | Traceability, structured telemetry, audit records, event logs |
| Reversible | Idempotency, durable state, replay, rollback, compensation, release tuple |
| Measurable | Intent taxonomy, cost, latency, quality, business metrics, version dashboards |
| Continuously improving | Correction pipeline, golden-set growth, proposal review, promotion gates, live monitoring |
This gives the audit two levels: a full engineering checklist and a compact outcome score.
Blocking failures
These should stop production launch immediately unless the agent is fully read-only, isolated from real users, and confined to a controlled beta.
| Blocker | Why it stops launch |
|---|---|
| No explicit autonomy boundary | The system cannot distinguish answer, recommendation, draft, approved execution, and direct execution. |
| Rules enforced only by prompt | Safety, compliance, financial, and privacy controls are suggestions instead of runtime decisions. |
| Tool calls bypass a gateway | The model can reach capabilities that were never resolved, authorized, validated, or approved. |
| No scoped identity for actions | Audit cannot prove which agent, user, service, or delegation chain caused the effect. |
| Tool calls bound only by schema, not object scope | The right tool on the wrong customer, file, or record passes validation and crosses a boundary. |
| Outbound content has no disclosure control | Sensitive data leaks through handoffs, tool arguments, or final answers even to authorized recipients. |
| Secrets can enter prompts or traces | Credential exposure becomes a normal runtime possibility. |
| Sensitive data has no classification | Privacy, retention, redaction, and eval-set rules cannot be enforced. |
| High-risk actions lack approval records | Human oversight cannot be audited or replayed. |
| Write tools lack idempotency | Retries can create duplicate external effects. |
| No online validation before finalization | Bad outputs can ship even when an evaluator would have caught them. |
| No replayable release tuple | Incidents cannot be reconstructed against the versions that actually ran. |
| No rollback or compensation path | The team can observe damage but cannot stop or reverse it. |
| No incident owner or kill switch | Operational response begins with searching for responsibility. |
Evidence bundle
The audit should end with links to the exact evidence inspected. No artifact, no pass.
| Evidence artifact | Required fields |
|---|---|
AgentSpec | Purpose, owner, autonomy class, allowed intents, denied intents |
CompiledContext | Context pack version, sources used, omissions, token budget |
PolicyDecision | Rule IDs, inputs, verdict, enforcement point |
ToolEnvelope | Tool name, schema version, arguments, risk class, approval mode |
RunTrace | Model calls, tool calls, handoffs, policy checks, evals, approvals, errors |
EvalScorecard | Safety, policy, utility, trajectory, latency, cost |
DecisionRecord | Final decision, evidence, policy verdicts, tool outputs, user-visible response |
ReleaseTuple | Prompt, model, policy, tool, context, eval, memory versions |
ReplayRecord | Reconstructability status and replay result |
FixQueue | Control gap, owner, severity, due date, expected evidence |
ContextOS names some of these artifacts directly: CompiledContext, ToolEnvelope, DecisionRecord, and release-gated evaluation records. Other stacks will use different names. The audit does not require ContextOS terminology. It requires equivalent evidence.
Maturity model
Not every agent needs the same bar on day one. Every agent does need a declared maturity band so the risk conversation is explicit.
| Maturity | Appropriate use | Required controls | Not allowed |
|---|---|---|---|
| Prototype | Internal exploration, no real side effects, synthetic or low-risk data | Agent charter, basic eval set, trace capture, tool sandbox | Real users, PII, money movement, durable memory |
| Controlled beta | Limited users, explicit supervision, compensating controls | P0 controls for touched surfaces, approval gates, offline evals, trace review, fix queue | Direct high-risk execution without human gate |
| Production | Real users, real tools, monitored release lifecycle | Full P0 pass, P1 gaps owned, live validation, replay, rollback, incident playbook | Unversioned prompt/model/tool changes |
| Regulated or high-risk | Regulated data, financial movement, legal, health, security, destructive actions | Full P0/P1 pass, red-team audit, retention policy, evidence retention, formal release governance | Informal approval, undocumented memory, non-replayable action |
The maturity band is not a marketing label. It determines which failures block launch.
A sample smoke audit
The smoke audit is the 30-minute version. Use it before demos, controlled beta gates, or leadership review.
| Minute | Action |
|---|---|
| 0-5 | Pick one successful trace and one boundary trace. Create the scorecard. |
| 5-10 | Inspect AgentSpec, autonomy boundary, intent_id, and plan artifact. |
| 10-15 | Inspect CompiledContext, source registry, evidence manifest, memory access, and budget report. |
| 15-20 | Inspect policy bundle, PolicyDecision, data labels, privacy controls, and approval record. |
| 20-25 | Inspect tool manifest, ToolEnvelope, identity claim, sandbox policy, idempotency key, and fallback behavior. |
| 25-30 | Inspect evaluator result, trace spans, release tuple, replay status, rollback path, and dashboard dimensions. |
The smoke audit will feel rushed the first time. That is the point. Production does not give you a week to discover where evidence lives.
How to read the score
Treat P0 failures as launch blockers. Treat P1 failures as beta constraints or reliability debt with named owners. Treat P2 failures as maturity work unless they combine with a higher-risk surface.
The most common dependency pattern looks like this:
| If this fails | It usually blocks |
|---|---|
| Agent charter | Autonomy, policy, release governance |
| Context compiler | Grounding, validation, observability, replay |
| Policy engine | Tool control, approval, privacy, compliance |
| Identity and authorization | Tool safety, resource-scope binding, incident response, audit |
| Resource and object scope binding | Trustworthy writes, disclosure control, tenancy isolation |
| Inter-agent communication policy | Outbound disclosure, role-local authority, multi-agent safety |
| Observability | Replay, rollback, measurement, incident analysis |
| Release tuple | Offline evals, rollback, regression management |
| Continuous improvement | Sustainable quality and post-incident repair |
Do not launch dozens of workstreams. Pick the most load-bearing failure, fix it well, and rerun the audit. A good harness is built in passes, not declarations.
The stronger standard
The post should no longer be read as “here are eight properties every harness should guarantee.”
The stronger standard is:
Here is the production readiness audit for agent harnesses: forty-four controls grouped into eight outcomes, with evidence required for every pass.
After this audit, the sentence “we have a harness” should either become precise or disappear. Precise sounds like this:
For this trace, we can reconstruct the context, prove policy enforcement, enumerate allowed tools, show identity and approval, validate the trajectory, follow the trace, replay the release tuple, roll back or compensate the effect, compare metrics by intent and version, and promote the correction through review.
A harness is evidence, not confidence.
What to read next
- Run the Harness Audit — this checklist as a runnable Claude Code skill
- Agent Harness Whitepaper
- Harness Engineering
- How to Develop an Agent with an Agent Harness, End to End
- Replay Is the Real Audit Log
