Skip to content
Runnable production audit

Find the controls your agent is missing before production finds them.

Run /harness-audit from your agent repository. The audit reads real code, prompts, policies, evals, tools, and traces, then scores 40 runtime controls with file:line evidence and a prioritized fix queue.

40
runtime controls
pass, partial, or fail
8
harness properties
from context to improvement
1
fix queue
ordered by launch risk
Claude Code audit session
evidence requiredno artifact, no pass
$ /harness-audit ./agents/refund --traces ./logs/runs

scanning repo...
found LangGraph workflow, tool registry, policy bundle
checking 40 controls across 8 harness properties

result: BLOCKED
band: controlled beta
next fix: validate tool args before dispatch
4 P0 failures

Tool Gateway, idempotency, replayability, and rollback block launch.

File:line evidence

Each verdict points to the implementation or trace that proves it.

Prioritized fix queue

Fix the most load-bearing runtime gap first, then rerun the audit.

What it checks

Eight harness properties, scored from real artifacts.

The audit does not ask whether your system is safe. It looks for the code, config, trace, or evaluator artifact that would make the claim true.

01

Context-aware

Compiled context matches intent; irrelevant memory, tools, and evidence are trimmed instead of hidden in prose.

02

Policy-governed

Policy runs outside model code at compile, plan, and tool-execution boundaries.

03

Tool-controlled

Only declared capabilities are callable; schemas, approval modes, and sandbox profiles are enforced.

04

Validated

Completion is gated by evaluators, rubrics, and critic verdicts that are versioned like code.

05

Observable

Runs emit trace IDs, tool envelopes, evidence refs, and scorecard events that can be joined later.

06

Reversible

Risky writes have idempotency, replay bundles, rollback paths, and compensation records where applicable.

07

Measurable

Metrics attach to intents, tenants, pack versions, and business outcomes instead of vague agent quality.

08

Continuously improving

Failures become typed proposals that pass replay, review, and release gates before promotion.

Why this works

A production audit has to inspect the runtime, not the story around it.

Architecture diagrams and checklists are useful, but they cannot prove that a tool call is validated before execution or that a failed write can be replayed. The audit treats the repository as the source of truth.

No artifact, no pass

If the control is only described in prose, it is not counted as implemented.

Pass, partial, or fail with evidence

Partial scores are explicit. They show where a control exists but does not yet cover the full risk surface.

Built for iteration

The fix queue is dependency ordered so teams can make one high-leverage change and rerun quickly.

Run the audit

Install once. Run it from any agent repo.

Use the Claude Code command when you want a launch gate, a quarterly harness review, or a focused check after changing tools, policy, memory, or replay.

Install once

Register the ContextOS skills marketplace in Claude Code and install the audit command.

Run from the repo

Point the audit at your agent root, or at a subpath plus traces for the workflow under review.

Fix the P0s

Work the dependency-ordered fix queue, commit evidence, and rerun until blockers clear.

Commands
1Install the audit command
# in Claude Code
/plugin marketplace add contextosai/skills
/plugin install harness-audit@contextosai-skills
2Run it against the target harness
# from the root of your agent repository
/harness-audit

# or scope the review to one workflow plus traces
/harness-audit ./agents/refund --traces ./logs/runs

Manual install also works: clone github.com/contextosai/skills and copy skills/harness-audit into your .claude/skills/ directory.

Report shape

The output is a scorecard and a work queue.

The report is meant to be used by the team that owns the agent: platform engineers, applied AI engineers, PMs, reviewers, and anyone accountable for launch readiness.

# Harness Audit Scorecard - refund-agent
Target: ./agents/refund
Framework: LangGraph
Evidence date: 2026-05-31

Evidenced maturity band: Controlled beta
Claimed band: Production
Launch decision: BLOCKED - 4 P0 failures

Outcome rollup
  Policy-governed     pass       Tool-controlled   partial
  Validated           partial    Reversible        fail
  Observable          pass       Measurable        partial

Blocking failures
  #16 Tool Gateway     FAIL  tools dispatch straight from model output
                             agents/refund/graph.py:142
  #24 Idempotency      FAIL  retry can double-refund
                             tools/refund.py:31
  #36 Replayability    FAIL  no pinned inputs; runs cannot be reconstructed
  #37 Rollback         FAIL  refund write has no compensation path

Fix queue
  1. Validate tool args and approval_mode before dispatch
  2. Add an idempotency key to the refund write
  3. Persist one replay bundle per production run

Maturity band

A grounded launch read: prototype, controlled beta, production-ready, or blocked.

P0 blockers

Failures that can create unsafe writes, unreplayable decisions, hidden policy bypasses, or unbounded costs.

Evidence ledger

Every pass and fail cites the file, line, trace, or missing artifact used to make the call.

Rerun loop

The report is shaped for iteration: fix one load-bearing gap, rerun, and compare the result.

Turn harness risk into a concrete fix queue.

Start with the audit, fix the highest-risk runtime gap, then rerun until your launch claim is backed by artifacts.