Skip to content
Back to Blog
Start here
May 9, 2026
·by ·21 min read

The Agent Harness Audit: A Production Readiness Checklist for Governed AI Agents

Share:XBSMRedditHNEmail
The Agent Harness Audit: A Production Readiness Checklist for Governed AI Agents illustration

The most dangerous sentence in agent engineering is: “we have a harness.”

Most teams do not. They have a prompt, a tool registry, a few evals, a dashboard, and a human escalation path that works when the right person is awake. Those are useful components. They are not yet a production harness.

An Agent Harness is the governed runtime control plane around an agent: context, policy, tools, state, memory, evaluation, telemetry, approvals, rollback, and continuous improvement. It decides what the agent may know, what it may do, which rules it must obey, how every action is observed, how failures are reversed, and how corrections become safer future behavior.

A production agent without a harness is not autonomous software. It is a probabilistic process with access to tools.

This audit separates real harnesses from hopeful wrappers.

Run this audit against your own agent
The 44 controls below ship as a Claude Code skill that reads your real repo and traces, scores each control with file:line evidence, and hands back a fix queue. No artifact, no pass.
Run the harness audit →

The original ContextOS eight harness properties are still the right outcome groups: context-aware, policy-governed, tool-controlled, validated, observable, reversible, measurable, and continuously improving. They are not enough as a checklist. A serious audit needs to inspect the actual runtime surfaces underneath those outcomes.

Why the bar is higher now

Modern agent frameworks are converging on the same lesson: the important runtime work lives outside the prompt.

OpenAI Agents SDK treats tools, handoffs, sessions, context management, guardrails, human-in-the-loop, MCP, and tracing as runtime concerns. Its tracing guide captures LLM generations, tool calls, handoffs, guardrails, and custom events as part of the run record. Its guardrails guide distinguishes input, output, and tool guardrails at different workflow boundaries.

Google ADK puts similar pressure on lifecycle surfaces: callbacks can block model or tool execution, request credentials, manage state deliberately, and save or load artifacts. NIST AI RMF frames AI risk management around Govern, Map, Measure, and Manage, with governance as the cross-cutting function across the lifecycle. OWASP’s LLM Top 10 names prompt injection, sensitive information disclosure, insecure plugin design, excessive agency, and overreliance as core application risks. MCP security guidance calls out confused deputy, token passthrough, SSRF, and session hijacking risks in tool and resource ecosystems. OpenTelemetry gives a vendor-neutral anchor for correlating traces, metrics, and logs across those boundaries.

That is the operating context for this audit. A harness is not a confidence story. It is evidence.

Audit rule
No artifact, no pass.
A team can be sophisticated and still fail the audit if the proof is missing. The audit judges the runtime record, not the architecture diagram.
Scope
Real run
Inspect an ordinary success and a boundary case, not a demo transcript.
Evidence
Pinned artifacts
Specs, context, policy decisions, tool envelopes, scorecards, traces, and release tuples.
Decision
Launch gate
P0 failures block production. P1 failures require beta constraints or compensating controls.
Output
Fix queue
Every gap needs an owner, due date, severity, expected evidence, and next audit date.

What a harness is not

A harness is not the system prompt. The prompt can describe behavior, but it cannot reliably enforce identity, network boundaries, side effects, redaction, retention, or rollback.

A harness is not just an agent framework. LangGraph, ADK, OpenAI Agents SDK, Semantic Kernel, CrewAI, or custom orchestration can be the execution substrate. The harness is the governed runtime discipline around the agent: manifests, policy, tool envelopes, approval gates, trace spans, evals, replay, and release control.

A harness is not only observability. Traces are necessary, but a trace that records unsafe execution after the fact is not a control. The harness has to prevent, gate, degrade, abort, escalate, replay, and improve.

A harness is not a tool registry. A registry names capabilities. A harness decides which capability is visible, callable, authorized, approved, sandboxed, idempotent, traced, reversible, and safe under this run’s identity and risk class.

The four audit planes

These are audit planes, not a replacement for the ContextOS five-plane model. Use them to group evidence during readiness review.

Audit planeWhat it controlsTypical artifacts
Control planeAgent contract, autonomy, context, policy, identity, release tupleAgentSpec, autonomy matrix, source registry, policy bundle, ReleaseTuple
Execution planePlanning, tools, state, memory, durable workflow, fallback behaviorplan artifact, ToolEnvelope, state machine, checkpoint, memory proposal
Risk planeData class, privacy, approvals, sandboxing, red-team coverage, incident responsedata labels, redaction trace, approval record, egress policy, playbook
Learning planeOffline evals, trajectory evals, telemetry, business metrics, corrections, promotionEvalScorecard, trace dashboard, replay record, fix queue, promotion gate

The audit protocol

The old “30-minute audit” is still useful, but only as a smoke test. It should find obvious launch blockers before a demo, beta, or leadership review. It is not the full production audit.

Audit modeWhen to useDurationOutput
Smoke auditBefore demo, beta, or leadership review30 minObvious launch blockers
Production readiness auditBefore real user launch2-4 hoursFull checklist scorecard
Red-team auditBefore high-risk tool or action rollout1-2 daysExploit paths, abuse cases, mitigation backlog
Post-incident auditAfter failure or near missSame dayRoot cause, failed control, replay evidence, fix queue
Quarterly harness auditFor mature production agentsQuarterlyDrift, regressions, new risks, maturity roadmap

For every mode, inspect at least two runs:

  • One ordinary successful run for a representative intent.
  • One boundary run that crossed a policy denial, tool error, approval gate, evaluator failure, escalation, rollback, replay, or fallback path.

The happy path tells you what the harness does when nothing is stressed. The boundary path tells you whether the harness exists when it matters.

Scoring

Score each control with the same strict rule.

ScoreMeaning
PassA script, trace, manifest, dashboard, or record proves the control for a real run.
PartialThe control exists, but coverage is incomplete, manual, delayed, undocumented, or not enforced at the right boundary.
FailThe control is absent, unenforced, unverifiable, or only described in prose.

If it takes more than five minutes to find the evidence, score it as fail. That is not pedantry. A control that cannot be found under pressure will not protect the system under pressure.

Severity is separate from pass state.

SeverityMeaning
P0Launch blocker for production agents with real users, real tools, money, PII, regulated data, or external side effects.
P1Required for reliable scale; acceptable only in controlled beta with compensating controls.
P2Maturity improvement; not always a launch blocker, but needed for enterprise-grade operations.

The complete checklist

Use this as the working surface for the readiness audit.

#Harness facetAudit questionMinimum pass evidenceImmediate fail signalSeverity
1Agent charterIs the agent’s job, scope, user type, and autonomy level explicitly declared?Versioned AgentSpec with purpose, owner, allowed intents, denied intents, autonomy class”The prompt describes it” is the only source of truthP0
2Autonomy boundaryDoes the system know what the agent can answer, recommend, decide, or execute?Autonomy matrix: inform, recommend, draft, execute_with_approval, execute_directlySame path used for low-risk Q&A and high-risk actionsP0
3Intent taxonomyAre runs mapped to stable intent IDs?intent_id, confidence, router decision, fallback intent in traceMetrics only show global agent successP1
4Planner/executor splitIs planning separated from execution for non-trivial tasks?Plan artifact, execution steps, approval gates, state transitionsModel jumps directly from user prompt to tool executionP0
5Context source registryAre all eligible context sources declared and governed?Source registry with owner, freshness, sensitivity, access mode, TTLRetrieval pulls from undocumented indexes or ad hoc APIsP0
6Context compilerCan you prove what context entered the model?CompiledContext, pack version, source hashes, truncation record, token budgetRuntime string concatenation with no manifestP0
7Grounding and evidenceAre claims grounded in retrieved or tool-backed evidence?Evidence manifest with citations, source IDs, and confidenceFinal answer includes factual claims with no source lineageP1
8Context budget controlDoes the system control what gets dropped when the context window fills?Budget policy by source type, truncation reason, priority orderOldest or random text is dropped silentlyP1
9Memory read policyIs memory retrieval intentional, scoped, and auditable?Memory query log, memory IDs used, purpose, freshness, consent basisAgent loads long-term memory by default without reasonP0
10Memory write policyAre new memories validated before persistence?Memory write proposal, dedup, sensitivity check, TTL, reviewer or auto-approval ruleEvery conversation summary becomes memoryP0
11Contradiction handlingCan memory conflicts be detected and resolved?Conflict record, recency, source confidence, supersession ruleOld incorrect memory keeps influencing future runsP1
12Policy engineAre rules enforced outside the model?Versioned policy bundle, rule IDs, policy_decision_id, verdict, inputs”The model has been instructed not to”P0
13Data classificationDoes the harness know what data class it is handling?Data labels: public, internal, confidential, PII, payment, regulatoryTool or prompt receives sensitive data without classificationP0
14Privacy controlsAre redaction, minimization, and retention enforced?Redaction trace, retention policy, purpose limitation, deletion pathPII appears in prompts, traces, or eval sets without controlsP0
15Tool manifestAre available tools declared, versioned, owned, and scoped to least privilege?Tool manifest with schema, owner, risk class, timeout, retry, auth mode, plus a tool surface scoped to what the role or intent actually needsTool list lives only in the prompt, or every agent gets the full toolbox regardless of taskP0
16Tool GatewayAre tool calls validated before execution?Schema validation, argument validation, policy check, approval mode in envelopeModel-named tools can be invoked dynamicallyP0
17Tool risk classAre side effects classified?Canonical approval mode plus side-effect class such as financial, destructive, or regulatedRefund, cancel, payment, or delete tool treated like searchP0
18Identity and authorizationDoes every tool call run under scoped identity?User or service identity, scoped token, RBAC or ABAC decision, expiryShared static credentials used by all agentsP0
19Secret handlingAre credentials isolated from model-visible context?Secrets vault, no secrets in prompts or logs, scoped runtime injectionAPI keys or tokens can enter model contextP0
20Network and sandbox controlsIs external access constrained?Egress allowlist, sandbox policy, file and network restrictionsAgent can call arbitrary URLs or execute arbitrary codeP0
21Human approvalAre approval gates explicit for high-risk actions?Approval request, approver identity, decision, expiry, reasonHuman approval happens informally over Slack or chatP0
22Escalation pathCan the agent hand off cleanly when confidence or risk is low?Escalation policy, queue, reason code, transcript and context packageAgent keeps trying after repeated failureP1
23State machineIs execution state explicit and durable?State transitions, checkpoints, event log, current state visibleState exists only in chat history or process memoryP1
24IdempotencyAre repeated tool calls safe?Idempotency keys for write tools, duplicate detectionRetry can create duplicate booking, refund, ticket, or actionP0
25Durable executionCan long-running tasks resume after failure?Checkpoint and replay mechanism, resumable workflow IDFailure requires restarting from user promptP1
26Offline evalsAre changes tested before release?Golden set, scenario set, regression suite, model/prompt/tool version comparisonPrompt or model changes go live without replayP0
27Trajectory evalsDoes evaluation check the path, not just final answer?Expected vs actual tool trajectory scored on coverage, precision, resource scope (correct object arguments), and minimalityFinal answer judged “good” despite wrong process, or judged on output alone with no path assertionP1
28Online validationCan bad outputs be blocked at runtime?Live critic or evaluator, policy-respect check, safety/utility score before finalizationEvals run only weekly, monthly, or offlineP0
29Red-team coverageIs the harness tested against adversarial behavior and realistic perturbation?Tests span indirect injection (planted instructions do not propagate), ambiguous goals (agent clarifies instead of acting irreversibly), and tool errors (#44), plus tool abuse, data leakage, and jailbreak suitesOnly happy-path demo queries are tested; injection and ambiguity are untestedP0
30ObservabilityCan an engineer reconstruct the full run?Trace spans for context, model, tools, policy, eval, approval, final responseLogs stop at “LLM returned response”P0
31Standard telemetryAre traces, logs, and metrics correlated?Stable trace_id, run_id, intent_id, user_id, session_id, tool_call_idLogs exist but cannot be joinedP1
32Cost and latency controlsAre token, tool, and runtime costs bounded?Budget policy, per-intent cost, latency SLO, timeout behaviorAgent loops until budget is exhaustedP1
33Model/provider routingIs model choice explicit and measurable?Model routing policy, fallback model, quality/cost/latency comparisonModel changed without traceable releaseP1
34Fallback behaviorWhat happens when model, tool, policy, or eval fails?Typed fallback: retry, degrade, ask user, escalate, abortAgent produces generic apology or retries blindlyP1
35Release tupleAre all moving parts versioned together?Tuple: prompt, model, policy, tools, context pack, eval suite, memory schemaPrompt, model, and tool changes tracked separatelyP0
36ReplayabilityCan a past run be reconstructed?Pinned inputs, context, tool outputs, policies, model version, evaluator versionHistorical trace cannot be replayedP0
37Rollback and compensationCan damage be stopped or reversed?Rollback command, previous release tuple, compensation path for writesRollback means “ask people not to use it”P0
38Incident responseIs there an agent-specific incident playbook?Severity matrix, owner, kill switch, escalation channel, postmortem templateNo one knows who owns a bad agent actionP0
39Business measurementAre agent outcomes tied to real impact?Task success, conversion, deflection, CSAT, revenue, risk, cost by intent/versionOnly number of chats and thumbs-up are trackedP1
40Continuous improvementDo failures become governed improvements?Correction -> proposal -> replay -> review -> approval -> promotion -> live monitoringFixes happen as unreviewed prompt editsP1
41Resource and object scope bindingAre tool calls bound to an authorized set of objects, not just a valid schema?Write and read tools resolve the target object against a per-task or per-user authorized scope (allowlist, ownership check, or task-derived scope), enforced outside the model; out-of-scope object access is denied and loggedA schema-valid call can act on any ID the model emits: the right tool on the wrong customer, file, or recordP0
42Outbound disclosure controlIs what leaves the agent (final answers, tool arguments, handoffs, forwarded content) checked against data class, not just the recipient?Sensitive fields minimized or redacted before they enter a tool argument, an inter-agent message, or the final response; an egress check proves a classified field did not reach an unauthorized sinkRecipient is authorized but the payload is over-shared: PII, secrets, or out-of-scope records flow through handoff context or the final answerP0
43Inter-agent communication policyIn a multi-agent harness, are who-may-talk-to-whom, tool ownership, and delegation boundaries declared and enforced?Communication topology (allowed role-to-role channels), role-local authority, and delegation boundaries are explicit and enforced; N/A for single-agent harnessesAny agent can call any tool or message any agent; a coordinator oversteps by executing what it should delegateP0
44Honest failure under tool errorWhen a tool or backend misbehaves, does the agent report failure instead of fabricating success?On a tool error, empty result, or junk return, the agent acknowledges the failure in output or state and retries within scope or safely defersAgent invents a result, claims completion with no supporting tool call, or takes an out-of-scope action after the failureP1

The eight properties as outcome groups

The eight properties are still the scorecard executives and product owners can remember. The forty-four controls are how engineering proves them.

Rollup outcomeControls it covers
Context-awareSource registry, context compiler, grounding, budget control, memory read/write, contradiction handling
Policy-governedAgent charter, autonomy boundary, policy engine, data classification, privacy controls, human approval, outbound disclosure control, inter-agent communication policy
Tool-controlledTool manifest, Tool Gateway, side-effect classification, identity, authorization, secrets, sandboxing, resource and object scope binding
ValidatedOffline evals, trajectory evals, online validation, red-team tests, honest failure under tool error
ObservableTraceability, structured telemetry, audit records, event logs
ReversibleIdempotency, durable state, replay, rollback, compensation, release tuple
MeasurableIntent taxonomy, cost, latency, quality, business metrics, version dashboards
Continuously improvingCorrection pipeline, golden-set growth, proposal review, promotion gates, live monitoring

This gives the audit two levels: a full engineering checklist and a compact outcome score.

Blocking failures

These should stop production launch immediately unless the agent is fully read-only, isolated from real users, and confined to a controlled beta.

BlockerWhy it stops launch
No explicit autonomy boundaryThe system cannot distinguish answer, recommendation, draft, approved execution, and direct execution.
Rules enforced only by promptSafety, compliance, financial, and privacy controls are suggestions instead of runtime decisions.
Tool calls bypass a gatewayThe model can reach capabilities that were never resolved, authorized, validated, or approved.
No scoped identity for actionsAudit cannot prove which agent, user, service, or delegation chain caused the effect.
Tool calls bound only by schema, not object scopeThe right tool on the wrong customer, file, or record passes validation and crosses a boundary.
Outbound content has no disclosure controlSensitive data leaks through handoffs, tool arguments, or final answers even to authorized recipients.
Secrets can enter prompts or tracesCredential exposure becomes a normal runtime possibility.
Sensitive data has no classificationPrivacy, retention, redaction, and eval-set rules cannot be enforced.
High-risk actions lack approval recordsHuman oversight cannot be audited or replayed.
Write tools lack idempotencyRetries can create duplicate external effects.
No online validation before finalizationBad outputs can ship even when an evaluator would have caught them.
No replayable release tupleIncidents cannot be reconstructed against the versions that actually ran.
No rollback or compensation pathThe team can observe damage but cannot stop or reverse it.
No incident owner or kill switchOperational response begins with searching for responsibility.

Evidence bundle

The audit should end with links to the exact evidence inspected. No artifact, no pass.

Evidence artifactRequired fields
AgentSpecPurpose, owner, autonomy class, allowed intents, denied intents
CompiledContextContext pack version, sources used, omissions, token budget
PolicyDecisionRule IDs, inputs, verdict, enforcement point
ToolEnvelopeTool name, schema version, arguments, risk class, approval mode
RunTraceModel calls, tool calls, handoffs, policy checks, evals, approvals, errors
EvalScorecardSafety, policy, utility, trajectory, latency, cost
DecisionRecordFinal decision, evidence, policy verdicts, tool outputs, user-visible response
ReleaseTuplePrompt, model, policy, tool, context, eval, memory versions
ReplayRecordReconstructability status and replay result
FixQueueControl gap, owner, severity, due date, expected evidence

ContextOS names some of these artifacts directly: CompiledContext, ToolEnvelope, DecisionRecord, and release-gated evaluation records. Other stacks will use different names. The audit does not require ContextOS terminology. It requires equivalent evidence.

Maturity model

Not every agent needs the same bar on day one. Every agent does need a declared maturity band so the risk conversation is explicit.

MaturityAppropriate useRequired controlsNot allowed
PrototypeInternal exploration, no real side effects, synthetic or low-risk dataAgent charter, basic eval set, trace capture, tool sandboxReal users, PII, money movement, durable memory
Controlled betaLimited users, explicit supervision, compensating controlsP0 controls for touched surfaces, approval gates, offline evals, trace review, fix queueDirect high-risk execution without human gate
ProductionReal users, real tools, monitored release lifecycleFull P0 pass, P1 gaps owned, live validation, replay, rollback, incident playbookUnversioned prompt/model/tool changes
Regulated or high-riskRegulated data, financial movement, legal, health, security, destructive actionsFull P0/P1 pass, red-team audit, retention policy, evidence retention, formal release governanceInformal approval, undocumented memory, non-replayable action

The maturity band is not a marketing label. It determines which failures block launch.

A sample smoke audit

The smoke audit is the 30-minute version. Use it before demos, controlled beta gates, or leadership review.

MinuteAction
0-5Pick one successful trace and one boundary trace. Create the scorecard.
5-10Inspect AgentSpec, autonomy boundary, intent_id, and plan artifact.
10-15Inspect CompiledContext, source registry, evidence manifest, memory access, and budget report.
15-20Inspect policy bundle, PolicyDecision, data labels, privacy controls, and approval record.
20-25Inspect tool manifest, ToolEnvelope, identity claim, sandbox policy, idempotency key, and fallback behavior.
25-30Inspect evaluator result, trace spans, release tuple, replay status, rollback path, and dashboard dimensions.

The smoke audit will feel rushed the first time. That is the point. Production does not give you a week to discover where evidence lives.

How to read the score

Treat P0 failures as launch blockers. Treat P1 failures as beta constraints or reliability debt with named owners. Treat P2 failures as maturity work unless they combine with a higher-risk surface.

The most common dependency pattern looks like this:

If this failsIt usually blocks
Agent charterAutonomy, policy, release governance
Context compilerGrounding, validation, observability, replay
Policy engineTool control, approval, privacy, compliance
Identity and authorizationTool safety, resource-scope binding, incident response, audit
Resource and object scope bindingTrustworthy writes, disclosure control, tenancy isolation
Inter-agent communication policyOutbound disclosure, role-local authority, multi-agent safety
ObservabilityReplay, rollback, measurement, incident analysis
Release tupleOffline evals, rollback, regression management
Continuous improvementSustainable quality and post-incident repair

Do not launch dozens of workstreams. Pick the most load-bearing failure, fix it well, and rerun the audit. A good harness is built in passes, not declarations.

The stronger standard

The post should no longer be read as “here are eight properties every harness should guarantee.”

The stronger standard is:

Here is the production readiness audit for agent harnesses: forty-four controls grouped into eight outcomes, with evidence required for every pass.

After this audit, the sentence “we have a harness” should either become precise or disappear. Precise sounds like this:

For this trace, we can reconstruct the context, prove policy enforcement, enumerate allowed tools, show identity and approval, validate the trajectory, follow the trace, replay the release tuple, roll back or compensate the effect, compare metrics by intent and version, and promote the correction through review.

A harness is evidence, not confidence.

Found this useful? Share it.

Share:XBSMRedditHNEmail