The Eight-Property Harness Audit: A 30-Minute Readiness Test for Production Agents

The dangerous sentence is: “we have a harness.”

Most teams mean something softer. They have a prompt template, a tool registry, a few evals, a dashboard, and a human escalation path that works when the right person is awake. Those are useful pieces. They are not yet a production harness.

A harness is load-bearing only when it can prove the agent saw the right context, obeyed policy outside the model, used only approved tools, passed validation, emitted an audit trail, could be reversed, was measured by intent and version, and improved from failures without tribal memory. If any of those guarantees is implicit, the system is still operating on trust.

This audit turns “we have a harness” into evidence. It is intentionally short enough to run before lunch and concrete enough to survive an incident review.

The eight properties come from the ContextOS Harness Engineering contract. This post is the operator version: what to inspect, what disqualifies the property, and the smallest fix that moves the system back toward a pass.

Field audit

Thirty minutes. Two traces. Three artifacts.

Treat the audit like a production readiness drill: one normal run, one boundary run, and enough evidence that another engineer can reproduce the score without asking what you meant.

Input

Two trace IDs

One clean success and one run that crossed a policy, tool, eval, approval, replay, or rollback boundary.

Method

Evidence before opinion

Manifests, envelopes, scorecards, traces, dashboards, and release tuples count. Narrative confidence does not.

Output

A ranked fix queue

Three repairs, each tied to the property it unblocks, an owner, and the next audit date.

The output

Do not end the audit with a discussion. End it with three artifacts:

Artifact	What it contains	Why it matters
Scorecard	`pass`, `partial`, or `fail` for each property	Makes readiness visible without interpretation
Evidence bundle	Links to the run, manifests, traces, policy decisions, evaluator results, and dashboards inspected	Lets another engineer verify the score later
Fix queue	The top three repairs, each with owner, due date, and blocking property	Converts audit pain into change control

The scorecard is not a maturity model. It is a release instrument. A team can be sophisticated and still fail the audit if the proof is missing. The inverse is also true: a small, plain harness with typed manifests, replayable runs, and boring gates may pass where a grand architecture cannot.

The scoring rule

For each property, choose one of three answers.

Score	Meaning
Pass	A script, trace, manifest, or dashboard proves the property for a real run.
Partial	The property exists, but it depends on manual process, incomplete coverage, delayed checks, or undocumented convention.
Fail	The property is absent, unenforced, unverifiable, or only described in prose.

Pass

The property is enforced and visible on a real trace.

Partial

The shape exists, but coverage, timing, or enforcement is not load-bearing.

Fail

The claim cannot be proved in five minutes from artifacts.

If it takes more than five minutes to find the evidence, score it as fail. That may sound harsh, but the audit is testing operability as much as architecture. A control that cannot be found under pressure will not protect the system under pressure.

Run the audit on two traces: one ordinary successful run and one run that crossed a boundary - a policy denial, approval gate, evaluator failure, tool error, escalation, rollback, or replay. The happy path tells you what the harness does when nothing is stressed. The boundary path tells you whether the harness exists when it matters.

The checklist table

Use this table as the working surface for the audit. The long sections below explain how to judge each row; this is the version you keep open while inspecting traces.

Score	Property	Minimum pass evidence	Immediate fail signal	First repair
`pass / partial / fail`	Context-aware	Pinned `CompiledContext`, pack version, evidence manifest, tool manifest, policy manifest, budget report	Runtime prompt assembly with no manifest or truncation record	Move assembly into a typed compiler that emits manifests
`pass / partial / fail`	Policy-governed	Versioned policy bundle, `policy_decision_id`, rule inputs, rule ids, verdict in the `DecisionRecord`	”The model knows the rule” is the only enforcement	Lift one high-risk rule into an external policy bundle
`pass / partial / fail`	Tool-controlled	Tool manifest plus `ToolEnvelope` showing schema, owner, approval mode, policy decision, and exact args	Unknown or deprecated tools can still be invoked	Put tools behind the Gateway and refuse unresolved modes
`pass / partial / fail`	Validated	Live critic/evaluator scorecard with Policy, Utility, Latency, Safety, and Cost dimensions	Evals run only offline or only before deployment	Wire one live evaluator before completion
`pass / partial / fail`	Observable	Trace spans from compile through final `DecisionRecord`, with stable ids across every control	Logs stop at “model returned” or require human reconstruction	Standardize span attributes and trace every boundary
`pass / partial / fail`	Reversible	Prior release tuple, idempotency key, reversal path, and replayable historical record	Rollback is a manual message or historical inputs were overwritten	Pin release tuples and add idempotency to the first write tool
`pass / partial / fail`	Measurable	Dashboard sliced by `intent_id`, `pack_version`, evaluator scores, cost, latency, and business outcome	Only global averages exist	Require `intent_id` and `pack_version` on every run row
`pass / partial / fail`	Continuously improving	Correction becomes typed proposal, golden replay, reviewer verdict, approval, promotion, live scorecard	Prompt edits or retro notes are the only learning path	Capture the next correction as structured improvement data

1. Context-aware

Claim. The agent sees the right task-specific information - no more, no less.

Evidence to collect. Pull the CompiledContext for the chosen trace. It should carry a pinned ContextPack version, an evidence_manifest with source hashes, a policy_manifest, a tool_manifest, and a budget_report that names every bucket and every truncation. You should be able to answer four questions from the artifact alone:

Which sources were eligible?
Which sources actually entered the context?
What was redacted, summarized, or truncated?
Which pack version made that decision?

Pass line. The same request can be recompiled from pinned inputs and produce the same context envelope. Every evidence item has provenance. Every omission is explained by a budget, source-priority, or policy rule.

Failure pattern. A monolithic prompt assembled at request time. Retrieval, memory, policy hints, and tool descriptions are concatenated into one string. When the context window fills, the wrong material is cut, and nobody can reconstruct what the model saw.

Smallest fix. Move context assembly into a typed compiler. Start with the eight-stage Context Pack Compiler shape even if several stages are still simple. The first win is not better retrieval. The first win is a manifest that names what happened.

2. Policy-governed

Claim. The agent cannot violate safety, compliance, or business rules because allow and deny decisions are enforced outside the model.

Evidence to collect. Open the policy bundle in effect for the trace. It should be versioned, signed or otherwise pinned, and evaluated at the boundary where the decision matters. Then find a policy_decision_id in the DecisionRecord and trace it back to the rule inputs, rule ids, and verdict.

Ask for a run where policy actually fired. A clean run is not enough. You need to see a real deny, require-approval, redact, or downgrade decision.

Pass line. A tool call, answer, or approval step can be blocked by policy without relying on the model to remember the rule. The policy verdict is present in the audit record and replayable from the inputs.

Failure pattern. The system prompt says “do not refund above the limit” or “never expose PII.” The model usually follows it. That is not governance. It is a suggestion in a probabilistic channel.

Smallest fix. Lift the most expensive or most violated rule out of the prompt and into a JsonLogic policy bundle. Add policy_decisions[] to the run record. One enforced rule changes the architecture; the rest become a backlog.

3. Tool-controlled

Claim. The agent can call only approved tools with declared schemas, ownership, and side-effect classifications.

Evidence to collect. Inspect the tool manifest. Every capability should declare a name, schema, owner, data class, timeout, retry policy, and approval_mode such as read_only, local_write, network, delegated, or destructive. Now inspect a real ToolEnvelope. The envelope should show the effective approval mode for that call, the policy decision that allowed it, and the exact arguments sent.

Pay special attention to negative space: tools that used to exist, beta tools, admin-only tools, and destructive tools that are easy to misclassify.

Pass line. The Tool Gateway refuses unknown tools, schema-invalid arguments, unresolved approval modes, and calls outside the active manifest. A tool cannot become reachable just because the model named it.

Failure pattern. The registry lives in a prompt, a YAML file nobody validates, or a loose SDK wrapper. Deprecated capabilities remain visible. A destructive operation runs as ordinary network access because nobody gave the gateway a stronger class.

Smallest fix. Put the tool list in a manifest read by the Tool Gateway. Refuse any call whose schema, owner, or approval mode cannot be resolved. This single move removes an entire class of “the model called something it should not have” incidents.

4. Validated

Claim. Output is checked by evaluators, rules, and tests before it counts as done.

Evidence to collect. Walk the trace after the model produced its proposal. What ran before the answer shipped or the action executed? You should see a critic step, evaluator scorecard, policy-respect check, and release-style gate. The scorecard should include at least Policy, Utility, Latency, Safety, and Cost dimensions, and it should be attached to the DecisionRecord.

Do not accept “we have evals” until you know where they run. Quarterly offline evals are useful for model selection. They are not live validation.

Pass line. A bad model output can be stopped after generation and before completion. The stop reason is typed, logged, and visible in the run record.

Failure pattern. The system validates prompts before deployment but not decisions during runtime. The output is treated as done because the model produced it, not because the harness accepted it.

Smallest fix. Wire one evaluator into the live path. Start with a mechanical Policy evaluator that returns 1.0 or 0.0 based on whether the policy verdict was respected. Then add Utility, Safety, Cost, and Latency scoring as separate concerns.

5. Observable

Claim. Every decision and action is traceable end to end.

Evidence to collect. Pick a trace_id from yesterday. Pull the trace bundle. It should cover compile, plan, critic, tool call, policy decision, evaluator, approval, consolidate, and final DecisionRecord. The model output should be one span inside the run, not the run itself.

The trace should use stable identifiers: run_id, trace_id, intent_id, pack_version, policy_bundle_version, tool_name, approval_mode, policy_decision_id, and decision_record_id.

Pass line. An on-call engineer can answer “what happened and why” from the trace without reading chat logs, guessing at runtime state, or asking the model to explain itself after the fact.

Failure pattern. Logs exist but do not compose. JSON appears beside stack traces, policy verdicts are missing, tool arguments are redacted without replacement evidence, and the final answer is disconnected from the controls that shaped it.

Smallest fix. Standardize span attributes and emit them consistently. Adopt W3C Trace Context and the conventions in Evaluation and Observability. You may not need new infrastructure. You probably need cleaner contracts.

6. Reversible

Claim. Failures can be rolled back, retracted, compensated, or replayed.

Evidence to collect. For every destructive capability, find the reversal token, idempotency key, and named compensation path. For every pack, policy, tool, or evaluator release, find the prior pinned tuple. Then run or inspect a replay that reconstructs a past DecisionRecord from pinned inputs.

Reversibility has two sides: operational rollback and forensic replay. You need both. A system that can roll back but cannot explain what happened is risky. A system that can explain what happened but cannot repair it is also risky.

Pass line. The team can stop new damage, restore the previous harness tuple, compensate the external action where possible, and replay the incident trace against the pinned snapshot.

Failure pattern. Rollback means a Slack message asking people to stop using the feature. Tool writes are not idempotent. Destructive operations have no reversal operation. Historical replay fails because the pack, policy, or knowledge snapshot was overwritten.

Smallest fix. Pin releases as a tuple: pack, policy, tool manifest, evaluator suite. Keep the prior tuple one command away. Add an idempotency key to the most important write tool. Then implement the matching reversal path. The full operating model is in Failure Playbooks.

7. Measurable

Claim. Quality, cost, latency, safety, and business impact are tracked per intent and per pack version.

Evidence to collect. Open the dashboard for the chosen intent. You should see rows or slices for intent_id, pack_version, policy_bundle_version, model, evaluator scores, cost, latency, deflection or task success, approval rate, and escalation rate. Trend lines should go back far enough to show whether the current release improved or regressed behavior.

Global averages are not evidence. They hide precisely the regressions that hurt production agents: one high-value intent, one policy-sensitive path, one recently changed pack.

Pass line. A team can compare two pack versions for the same intent and know whether quality improved, cost rose, latency regressed, or safety got worse.

Failure pattern. The dashboard says “agent satisfaction is up” while refund approvals, booking cancellations, or regulated disclosures are moving in different directions. The team cannot attribute a regression because the data lacks version dimensions.

Smallest fix. Make intent_id and pack_version required dimensions on every run row. The first dashboard can be crude. The important move is to stop averaging away the thing you need to operate.

8. Continuously improving

Claim. Failures upgrade the harness, not just the individual answer.

Evidence to collect. Take a real operator correction from last month. Follow it through the Improvement Loop: correction, typed proposal, golden replay, reviewer verdict, approval, promotion, and live scorecard. The trail should show what changed in the harness and why that change was safe to release.

If the story ends with “we edited the prompt” or “we discussed it in retro,” score fail. The system may have learned socially, but the harness did not learn operationally.

Pass line. A production failure becomes a typed artifact that can be replayed, reviewed, promoted, and measured. The next similar run benefits from the correction without depending on the same human being remembering the incident.

Failure pattern. Corrections live in Slack, Notion, or memory. Prompt edits bypass release gates. Golden sets do not expand after failures. The next debugging session starts from scratch.

Smallest fix. Capture the next correction as structured data: input, observed behavior, expected behavior, evidence, violated property, proposed harness change, and replay requirement. A spreadsheet is enough for the first week if it preserves the contract.

How to read the scorecard

The most common mistake is to treat all failures equally. They are not equal. Some properties are upstream of the others.

If this fails	It usually blocks
Context-aware	Validated, Observable, Measurable, Replay
Policy-governed	Tool-controlled, Validated, Compliance audit
Tool-controlled	Reversible, Observable, Approval gates
Observable	Reversible, Measurable, Incident response
Reversible	Safe rollout, Fast iteration, Regulated actions
Measurable	Continuous improvement, Release confidence

In most teams, the first repair should be one of three things:

Policy before prompts when the risk is legal, financial, privacy, or safety.
Tool Gateway before tool expansion when the agent can touch external systems.
Replay before rapid iteration when releases are frequent and incidents are expensive.

Do not launch eight workstreams. Pick the most load-bearing failure, fix it well, and rerun the audit. A good harness is built in passes, not declarations.

A sample 30-minute run

Here is the rhythm that works in practice:

Minute	Action
0-5	Pick one successful trace and one boundary trace. Create the scorecard.
5-10	Inspect `CompiledContext`, pack version, evidence manifest, and budget report.
10-15	Inspect policy bundle, policy decisions, and Tool Gateway envelope.
15-20	Inspect evaluator result, trace spans, and final `DecisionRecord`.
20-25	Inspect rollback tuple, reversal path, replay status, and dashboard dimensions.
25-30	Name the top three fixes with owner and due date.

The audit will feel rushed the first time. That is the point. Production does not give you a week to discover where evidence lives. The second run is faster because the first run forces the team to make the harness visible.

Closing

A real harness is not a vibe around an agent. It is an execution contract with evidence.

After this audit, the sentence “we have a harness” should either become precise or disappear. Precise sounds like this:

For this trace, we can reconstruct the context, prove policy enforcement, enumerate allowed tools, show validation, follow the trace, roll back the release tuple, compare metrics by intent and version, and promote the correction through replay.

That is the standard. Anything less may still be useful software, but it is not yet a production harness.

Run the audit before you ship. Run it after every meaningful architecture change. Run it cold before the renewal conversation. The value is not the score. The value is forcing the system to prove the guarantees it has been claiming.