Find the controls your agent is missing before production finds them.
Run /harness-audit from your agent repository. The audit reads real code, prompts, policies, evals, tools, and traces, then scores 40 runtime controls with file:line evidence and a prioritized fix queue.
$ /harness-audit ./agents/refund --traces ./logs/runs
scanning repo...
found LangGraph workflow, tool registry, policy bundle
checking 40 controls across 8 harness properties
result: BLOCKED
band: controlled beta
next fix: validate tool args before dispatchTool Gateway, idempotency, replayability, and rollback block launch.
Each verdict points to the implementation or trace that proves it.
Fix the most load-bearing runtime gap first, then rerun the audit.
Eight harness properties, scored from real artifacts.
The audit does not ask whether your system is safe. It looks for the code, config, trace, or evaluator artifact that would make the claim true.
Context-aware
Compiled context matches intent; irrelevant memory, tools, and evidence are trimmed instead of hidden in prose.
Policy-governed
Policy runs outside model code at compile, plan, and tool-execution boundaries.
Tool-controlled
Only declared capabilities are callable; schemas, approval modes, and sandbox profiles are enforced.
Validated
Completion is gated by evaluators, rubrics, and critic verdicts that are versioned like code.
Observable
Runs emit trace IDs, tool envelopes, evidence refs, and scorecard events that can be joined later.
Reversible
Risky writes have idempotency, replay bundles, rollback paths, and compensation records where applicable.
Measurable
Metrics attach to intents, tenants, pack versions, and business outcomes instead of vague agent quality.
Continuously improving
Failures become typed proposals that pass replay, review, and release gates before promotion.
A production audit has to inspect the runtime, not the story around it.
Architecture diagrams and checklists are useful, but they cannot prove that a tool call is validated before execution or that a failed write can be replayed. The audit treats the repository as the source of truth.
No artifact, no pass
If the control is only described in prose, it is not counted as implemented.
Pass, partial, or fail with evidence
Partial scores are explicit. They show where a control exists but does not yet cover the full risk surface.
Built for iteration
The fix queue is dependency ordered so teams can make one high-leverage change and rerun quickly.
Install once. Run it from any agent repo.
Use the Claude Code command when you want a launch gate, a quarterly harness review, or a focused check after changing tools, policy, memory, or replay.
Install once
Register the ContextOS skills marketplace in Claude Code and install the audit command.
Run from the repo
Point the audit at your agent root, or at a subpath plus traces for the workflow under review.
Fix the P0s
Work the dependency-ordered fix queue, commit evidence, and rerun until blockers clear.
# in Claude Code
/plugin marketplace add contextosai/skills
/plugin install harness-audit@contextosai-skills# from the root of your agent repository
/harness-audit
# or scope the review to one workflow plus traces
/harness-audit ./agents/refund --traces ./logs/runsManual install also works: clone github.com/contextosai/skills and copy skills/harness-audit into your .claude/skills/ directory.
The output is a scorecard and a work queue.
The report is meant to be used by the team that owns the agent: platform engineers, applied AI engineers, PMs, reviewers, and anyone accountable for launch readiness.
# Harness Audit Scorecard - refund-agent
Target: ./agents/refund
Framework: LangGraph
Evidence date: 2026-05-31
Evidenced maturity band: Controlled beta
Claimed band: Production
Launch decision: BLOCKED - 4 P0 failures
Outcome rollup
Policy-governed pass Tool-controlled partial
Validated partial Reversible fail
Observable pass Measurable partial
Blocking failures
#16 Tool Gateway FAIL tools dispatch straight from model output
agents/refund/graph.py:142
#24 Idempotency FAIL retry can double-refund
tools/refund.py:31
#36 Replayability FAIL no pinned inputs; runs cannot be reconstructed
#37 Rollback FAIL refund write has no compensation path
Fix queue
1. Validate tool args and approval_mode before dispatch
2. Add an idempotency key to the refund write
3. Persist one replay bundle per production runMaturity band
A grounded launch read: prototype, controlled beta, production-ready, or blocked.
P0 blockers
Failures that can create unsafe writes, unreplayable decisions, hidden policy bypasses, or unbounded costs.
Evidence ledger
Every pass and fail cites the file, line, trace, or missing artifact used to make the call.
Rerun loop
The report is shaped for iteration: fix one load-bearing gap, rerun, and compare the result.
Turn harness risk into a concrete fix queue.
Start with the audit, fix the highest-risk runtime gap, then rerun until your launch claim is backed by artifacts.