Reviewer agents are the part of the harness that most teams build last and then wish they had built first. They are also the part that benefits the most from copying — once the first one is wired up, the others fall in behind it.
This post walks one reviewer end-to-end. I am picking the compliance reviewer because the rubric is small, the false-positive cost is low, and the inputs it needs already exist in any team that has read Harness Engineering. The full taxonomy of reviewer roles is in Reviewer Agents; this post is the build-along.
2026 update: reviewers are control evidence
The useful upgrade is to stop treating reviewer agents as helpful code reviewers and start treating them as control evidence. A compliance reviewer is valuable only if its verdict can be shown to an approver, attached to a release tuple, replayed against a golden set, and queried during audit.
That changes the bar. The reviewer does not need broad judgment. It needs stable predicates, typed findings, rule ids, evidence refs, and a deterministic recommendation. Small, boring checks beat eloquent review prose because they can block a release without interpretation.
What a reviewer agent actually is
It is not a chat bot. It is a deterministic skill that takes a typed input — a proposal, a DecisionRecord, a pack diff — and emits a typed envelope:
type ReviewerVerdict = {
reviewer_id: string // "compliance.v1"
reviewer_version: string // "2026.05.07"
status: "pass" | "warn" | "fail"
findings: Array<{
severity: "info" | "warn" | "error"
rule_id: string // e.g. "C-001"
message: string // human-readable
evidence_ref?: string // pointer into the run / proposal
recommendation?: string // what to change
}>
policy_id_refs: string[] // policies the reviewer cross-checked
recommendation: "merge" | "block" | "needs_human"
trace_id: string
}The shape of the output matters more than the cleverness of the rubric. The envelope lands in the same change-control queue as Improvement Loop proposals, which means any human approver and any automation downstream can read every reviewer’s output identically. That is what makes the reviewer composable.
The compliance rubric
A useful first compliance reviewer covers four checks. Each is a small predicate over the proposal or run; the rubric is small enough to fit on a notecard, which is the point.
| Rule | What it checks | What pass looks like |
|---|---|---|
C-001 | Every adapter capability with side effects has an approval_mode declared | approval_mode ∈ five-tier set; destructive requires requires_approver: true |
C-002 | Regulated data classes (PII, PHI, PAN, RESTRICTED) carry a classification at ingestion | every evidence_ref into the proposal has a classification field |
C-003 | Destructive runs in the proposal carry a reversal_token and an idempotency_key | both fields present on every tool_call flagged destructive |
C-004 | Every policy decision in the run has a policy_decision_id that resolves to an active rule | resolver returns a rule whose effective_window includes the run’s timestamp |
Four rules. Each one fails loudly. Each one points at a fix. None of them require the reviewer to “understand” the proposal — they are mechanical predicates over the typed artifacts.
The skill file
Here is the compliance reviewer skill. It is deliberately short — a long reviewer is a sign the rubric drifted into preferences instead of constraints.
import type { Proposal, ReviewerVerdict, EvidenceRef, ToolCall } from "@/types"
import { resolvePolicyRule } from "@/policy"
const REGULATED = new Set(["PII", "PHI", "PAN", "RESTRICTED"])
const FIVE_TIER = new Set([
"read_only",
"local_write",
"network",
"delegated",
"destructive",
])
export function reviewCompliance(p: Proposal): ReviewerVerdict {
const findings: ReviewerVerdict["findings"] = []
// C-001 — approval_mode declared, destructive requires approver
for (const cap of p.tooling_layer.capabilities) {
if (!FIVE_TIER.has(cap.approval_mode)) {
findings.push({
severity: "error",
rule_id: "C-001",
message: `capability ${cap.id} has unknown approval_mode "${cap.approval_mode}"`,
evidence_ref: `proposal:${p.id}#tooling.${cap.id}`,
recommendation: "set approval_mode to one of the five canonical tiers",
})
}
if (cap.approval_mode === "destructive" && !cap.requires_approver) {
findings.push({
severity: "error",
rule_id: "C-001",
message: `destructive capability ${cap.id} missing requires_approver`,
evidence_ref: `proposal:${p.id}#tooling.${cap.id}`,
recommendation: "set requires_approver: true and bind an approval gate",
})
}
}
// C-002 — regulated data classes carry classification
for (const ref of p.evidence_refs) {
if (REGULATED.has(ref.data_class) && !ref.classification) {
findings.push({
severity: "error",
rule_id: "C-002",
message: `regulated evidence ${ref.id} has no classification`,
evidence_ref: ref.id,
recommendation: "set classification at ingestion in the KG pipeline",
})
}
}
// C-003 — destructive calls have reversal + idempotency
for (const tc of toolCallsOf(p)) {
if (tc.approval_mode === "destructive") {
if (!tc.reversal_token) {
findings.push({
severity: "error",
rule_id: "C-003",
message: `destructive call ${tc.id} missing reversal_token`,
evidence_ref: tc.id,
recommendation: "issue a reversal_token at the gateway and persist it",
})
}
if (!tc.idempotency_key) {
findings.push({
severity: "error",
rule_id: "C-003",
message: `destructive call ${tc.id} missing idempotency_key`,
evidence_ref: tc.id,
recommendation: "include idempotency_key in the toolCall envelope",
})
}
}
}
// C-004 — every policy_decision resolves to an active rule
for (const pd of p.policy_decisions ?? []) {
const rule = resolvePolicyRule(pd.policy_decision_id, p.run_at)
if (!rule) {
findings.push({
severity: "warn",
rule_id: "C-004",
message: `policy_decision ${pd.policy_decision_id} resolves to no active rule`,
evidence_ref: pd.policy_decision_id,
recommendation: "verify rule effective_window covers run_at",
})
}
}
const status: ReviewerVerdict["status"] =
findings.some((f) => f.severity === "error") ? "fail" :
findings.some((f) => f.severity === "warn") ? "warn" : "pass"
return {
reviewer_id: "compliance.v1",
reviewer_version: "2026.05.07",
status,
findings,
policy_id_refs: ["POLICY_REGULATED_DATA", "POLICY_APPROVAL_TIERS"],
recommendation: status === "fail" ? "block"
: status === "warn" ? "needs_human" : "merge",
trace_id: p.trace_id,
}
}
function toolCallsOf(p: Proposal): ToolCall[] {
return p.runs.flatMap((r) => r.tool_calls)
}That is the whole reviewer. The rules are explicit, the envelope is typed, and the recommendations point the proposer at the fix. There is nothing in here a teammate cannot read in three minutes.
The golden set
A reviewer without a golden set is a guess. Five rows is enough to start; you will add to them as soon as you ship. The format is plain and replay-friendly:
- name: clean_refund_proposal
input_path: ./fixtures/proposal_refund_clean.json
expected:
status: pass
finding_rule_ids: []
- name: missing_reversal_token
input_path: ./fixtures/proposal_refund_no_reversal.json
expected:
status: fail
finding_rule_ids: [C-003]
recommendation: block
- name: pii_without_classification
input_path: ./fixtures/proposal_kg_pii_unclassified.json
expected:
status: fail
finding_rule_ids: [C-002]
recommendation: block
- name: stale_policy_reference
input_path: ./fixtures/proposal_policy_expired.json
expected:
status: warn
finding_rule_ids: [C-004]
recommendation: needs_human
- name: destructive_no_approver
input_path: ./fixtures/proposal_dest_no_approver.json
expected:
status: fail
finding_rule_ids: [C-001]
recommendation: blockThe harness/evals runner replays this set on every reviewer change. If the reviewer flags or unflags differently than the goldens say, the change does not merge. This is what keeps the reviewer’s judgment pinned: you can update the rubric, but you have to update (or add) goldens at the same PR.
Where the verdict goes
The reviewer’s envelope lands in the change-control queue alongside Improvement Loop proposals. A simple convention:
queue/
proposals/{proposal_id}/
proposal.json # the input
reviewers/
compliance.v1.json # the verdict, exactly as emitted
security.v1.json # peer reviewers as they come online
reliability.v1.json
decisions/
human_approver.json # final human verdictA proposal merges when:
- Every reviewer’s
recommendationismerge, or - Any reviewer says
block— in which case the proposal does not merge, period, or - Any reviewer says
needs_humanand a named approver signs off, citing the relevantrule_idthey accepted as known-and-tolerated.
That third path is the one that earns its keep. needs_human is the reviewer saying I have a judgment and I want a human to apply it. Humans approve, the harness records the approver and rule_id, and the next run that hits the same rule includes that decision in its lineage. Nothing is silently waved through.
Drift: how the reviewer itself stays honest
Reviewer agents are versioned just like packs and policies. The cadence:
- Pin the reviewer version alongside pack and policy in every release tuple.
- Replay the golden set on every reviewer change before merge.
- Add a golden every time a finding pattern shows up in production that the reviewer missed (or was wrong about).
- Sample produced verdicts quarterly — if the reviewer has flagged the same issue 1,000 times and no human has ever taken action on it, the rule is probably overcalibrated. Either tighten it or drop it.
Reviewers age. The policy bundle they cross-check shifts. New regulated classes appear. The discipline is the same as for any other harness component: version the artifact, replay against goldens, evaluate the verdicts.
Ship checklist
| Item | Done when |
|---|---|
| Typed input | The reviewer accepts a proposal, pack diff, run bundle, or Decision Record envelope, never raw prose. |
| Rule ids | Every finding maps to a durable C-NNN rule with owner and policy refs. |
| Evidence refs | Every block/warn finding points to the exact artifact that failed. |
| Golden set | Clean, fail, warn, stale-policy, and destructive-action cases replay in CI. |
| Change control | Reviewer verdicts land beside human approvals and are retained with the release tuple. |
| Drift review | False positives, missed findings, and ignored warnings get reviewed on a fixed cadence. |
What to build next
Once the compliance reviewer is in place, the cheapest follow-ons are usually:
- Reliability reviewer — checks that timeouts, retries, and rollbacks are declared on every tool. The rubric is similar to compliance: a handful of mechanical predicates over the proposal.
- Security reviewer — PII paths, secret leakage, sandbox profile, injection-resistant prompts. This one borrows directly from your existing security-review patterns; it just types the verdict.
- Cost reviewer — token budgets, retrieval cost, run-budget headroom. Easy to write once you have per-pack-version cost dashboards.
The pattern is the same each time: small rubric, typed envelope, golden set, change-control queue. The first reviewer is the hard one because you are also building the queue and the envelope. Numbers two through seven cost a day each.
The team that ships the first reviewer this quarter usually ships three more by the end of the next one. The cost curve is generous.
Closing
The compliance reviewer is sixty lines, four rules, and five goldens. It blocks four classes of incident that would otherwise reach a human approver as a vague “this looks risky” feeling. It produces a typed envelope that the rest of the harness can read.
That is the whole shape of the work. You can copy this skeleton today, fill in the policies your platform actually has, and have a reviewer running by Friday. The hard part is not writing the reviewer; it is deciding to make every reviewer a typed verdict instead of a chat thread. Once the typed verdict is the convention, the rest is small.