Skip to content
Back to Blog
Reviewers & improvement
March 15, 2026
·by Piyush·6 min read

Building a Compliance Reviewer Agent in 60 Lines and a Golden Set

ContextOS
Harness Engineering
Reviewer Agents
Compliance
Change Control
Share:XHN

Reviewer agents are the part of the harness that most teams build last and then wish they had built first. They are also the part that benefits the most from copying — once the first one is wired up, the others fall in behind it.

This post walks one reviewer end-to-end. I am picking the compliance reviewer because the rubric is small, the false-positive cost is low, and the inputs it needs already exist in any team that has read Harness Engineering. The full taxonomy of reviewer roles is in Reviewer Agents; this post is the build-along.

2026 update: reviewers are control evidence

The useful upgrade is to stop treating reviewer agents as helpful code reviewers and start treating them as control evidence. A compliance reviewer is valuable only if its verdict can be shown to an approver, attached to a release tuple, replayed against a golden set, and queried during audit.

That changes the bar. The reviewer does not need broad judgment. It needs stable predicates, typed findings, rule ids, evidence refs, and a deterministic recommendation. Small, boring checks beat eloquent review prose because they can block a release without interpretation.

What a reviewer agent actually is

It is not a chat bot. It is a deterministic skill that takes a typed input — a proposal, a DecisionRecord, a pack diff — and emits a typed envelope:

type ReviewerVerdict = {
  reviewer_id: string             // "compliance.v1"
  reviewer_version: string        // "2026.05.07"
  status: "pass" | "warn" | "fail"
  findings: Array<{
    severity: "info" | "warn" | "error"
    rule_id: string               // e.g. "C-001"
    message: string               // human-readable
    evidence_ref?: string         // pointer into the run / proposal
    recommendation?: string       // what to change
  }>
  policy_id_refs: string[]        // policies the reviewer cross-checked
  recommendation: "merge" | "block" | "needs_human"
  trace_id: string
}

The shape of the output matters more than the cleverness of the rubric. The envelope lands in the same change-control queue as Improvement Loop proposals, which means any human approver and any automation downstream can read every reviewer’s output identically. That is what makes the reviewer composable.

The compliance rubric

A useful first compliance reviewer covers four checks. Each is a small predicate over the proposal or run; the rubric is small enough to fit on a notecard, which is the point.

RuleWhat it checksWhat pass looks like
C-001Every adapter capability with side effects has an approval_mode declaredapproval_mode ∈ five-tier set; destructive requires requires_approver: true
C-002Regulated data classes (PII, PHI, PAN, RESTRICTED) carry a classification at ingestionevery evidence_ref into the proposal has a classification field
C-003Destructive runs in the proposal carry a reversal_token and an idempotency_keyboth fields present on every tool_call flagged destructive
C-004Every policy decision in the run has a policy_decision_id that resolves to an active ruleresolver returns a rule whose effective_window includes the run’s timestamp

Four rules. Each one fails loudly. Each one points at a fix. None of them require the reviewer to “understand” the proposal — they are mechanical predicates over the typed artifacts.

The skill file

Here is the compliance reviewer skill. It is deliberately short — a long reviewer is a sign the rubric drifted into preferences instead of constraints.

harness/reviewers/compliance.v1.ts
import type { Proposal, ReviewerVerdict, EvidenceRef, ToolCall } from "@/types"
import { resolvePolicyRule } from "@/policy"
 
const REGULATED = new Set(["PII", "PHI", "PAN", "RESTRICTED"])
const FIVE_TIER = new Set([
  "read_only",
  "local_write",
  "network",
  "delegated",
  "destructive",
])
 
export function reviewCompliance(p: Proposal): ReviewerVerdict {
  const findings: ReviewerVerdict["findings"] = []
 
  // C-001 — approval_mode declared, destructive requires approver
  for (const cap of p.tooling_layer.capabilities) {
    if (!FIVE_TIER.has(cap.approval_mode)) {
      findings.push({
        severity: "error",
        rule_id: "C-001",
        message: `capability ${cap.id} has unknown approval_mode "${cap.approval_mode}"`,
        evidence_ref: `proposal:${p.id}#tooling.${cap.id}`,
        recommendation: "set approval_mode to one of the five canonical tiers",
      })
    }
    if (cap.approval_mode === "destructive" && !cap.requires_approver) {
      findings.push({
        severity: "error",
        rule_id: "C-001",
        message: `destructive capability ${cap.id} missing requires_approver`,
        evidence_ref: `proposal:${p.id}#tooling.${cap.id}`,
        recommendation: "set requires_approver: true and bind an approval gate",
      })
    }
  }
 
  // C-002 — regulated data classes carry classification
  for (const ref of p.evidence_refs) {
    if (REGULATED.has(ref.data_class) && !ref.classification) {
      findings.push({
        severity: "error",
        rule_id: "C-002",
        message: `regulated evidence ${ref.id} has no classification`,
        evidence_ref: ref.id,
        recommendation: "set classification at ingestion in the KG pipeline",
      })
    }
  }
 
  // C-003 — destructive calls have reversal + idempotency
  for (const tc of toolCallsOf(p)) {
    if (tc.approval_mode === "destructive") {
      if (!tc.reversal_token) {
        findings.push({
          severity: "error",
          rule_id: "C-003",
          message: `destructive call ${tc.id} missing reversal_token`,
          evidence_ref: tc.id,
          recommendation: "issue a reversal_token at the gateway and persist it",
        })
      }
      if (!tc.idempotency_key) {
        findings.push({
          severity: "error",
          rule_id: "C-003",
          message: `destructive call ${tc.id} missing idempotency_key`,
          evidence_ref: tc.id,
          recommendation: "include idempotency_key in the toolCall envelope",
        })
      }
    }
  }
 
  // C-004 — every policy_decision resolves to an active rule
  for (const pd of p.policy_decisions ?? []) {
    const rule = resolvePolicyRule(pd.policy_decision_id, p.run_at)
    if (!rule) {
      findings.push({
        severity: "warn",
        rule_id: "C-004",
        message: `policy_decision ${pd.policy_decision_id} resolves to no active rule`,
        evidence_ref: pd.policy_decision_id,
        recommendation: "verify rule effective_window covers run_at",
      })
    }
  }
 
  const status: ReviewerVerdict["status"] =
    findings.some((f) => f.severity === "error") ? "fail" :
    findings.some((f) => f.severity === "warn") ? "warn" : "pass"
 
  return {
    reviewer_id: "compliance.v1",
    reviewer_version: "2026.05.07",
    status,
    findings,
    policy_id_refs: ["POLICY_REGULATED_DATA", "POLICY_APPROVAL_TIERS"],
    recommendation: status === "fail" ? "block"
                  : status === "warn" ? "needs_human" : "merge",
    trace_id: p.trace_id,
  }
}
 
function toolCallsOf(p: Proposal): ToolCall[] {
  return p.runs.flatMap((r) => r.tool_calls)
}

That is the whole reviewer. The rules are explicit, the envelope is typed, and the recommendations point the proposer at the fix. There is nothing in here a teammate cannot read in three minutes.

The golden set

A reviewer without a golden set is a guess. Five rows is enough to start; you will add to them as soon as you ship. The format is plain and replay-friendly:

harness/reviewers/compliance.v1.golden.yaml
- name: clean_refund_proposal
  input_path: ./fixtures/proposal_refund_clean.json
  expected:
    status: pass
    finding_rule_ids: []
 
- name: missing_reversal_token
  input_path: ./fixtures/proposal_refund_no_reversal.json
  expected:
    status: fail
    finding_rule_ids: [C-003]
    recommendation: block
 
- name: pii_without_classification
  input_path: ./fixtures/proposal_kg_pii_unclassified.json
  expected:
    status: fail
    finding_rule_ids: [C-002]
    recommendation: block
 
- name: stale_policy_reference
  input_path: ./fixtures/proposal_policy_expired.json
  expected:
    status: warn
    finding_rule_ids: [C-004]
    recommendation: needs_human
 
- name: destructive_no_approver
  input_path: ./fixtures/proposal_dest_no_approver.json
  expected:
    status: fail
    finding_rule_ids: [C-001]
    recommendation: block

The harness/evals runner replays this set on every reviewer change. If the reviewer flags or unflags differently than the goldens say, the change does not merge. This is what keeps the reviewer’s judgment pinned: you can update the rubric, but you have to update (or add) goldens at the same PR.

Where the verdict goes

The reviewer’s envelope lands in the change-control queue alongside Improvement Loop proposals. A simple convention:

queue/
  proposals/{proposal_id}/
    proposal.json                    # the input
    reviewers/
      compliance.v1.json             # the verdict, exactly as emitted
      security.v1.json               # peer reviewers as they come online
      reliability.v1.json
    decisions/
      human_approver.json            # final human verdict

A proposal merges when:

  1. Every reviewer’s recommendation is merge, or
  2. Any reviewer says block — in which case the proposal does not merge, period, or
  3. Any reviewer says needs_human and a named approver signs off, citing the relevant rule_id they accepted as known-and-tolerated.

That third path is the one that earns its keep. needs_human is the reviewer saying I have a judgment and I want a human to apply it. Humans approve, the harness records the approver and rule_id, and the next run that hits the same rule includes that decision in its lineage. Nothing is silently waved through.

Drift: how the reviewer itself stays honest

Reviewer agents are versioned just like packs and policies. The cadence:

  • Pin the reviewer version alongside pack and policy in every release tuple.
  • Replay the golden set on every reviewer change before merge.
  • Add a golden every time a finding pattern shows up in production that the reviewer missed (or was wrong about).
  • Sample produced verdicts quarterly — if the reviewer has flagged the same issue 1,000 times and no human has ever taken action on it, the rule is probably overcalibrated. Either tighten it or drop it.

Reviewers age. The policy bundle they cross-check shifts. New regulated classes appear. The discipline is the same as for any other harness component: version the artifact, replay against goldens, evaluate the verdicts.

Ship checklist

ItemDone when
Typed inputThe reviewer accepts a proposal, pack diff, run bundle, or Decision Record envelope, never raw prose.
Rule idsEvery finding maps to a durable C-NNN rule with owner and policy refs.
Evidence refsEvery block/warn finding points to the exact artifact that failed.
Golden setClean, fail, warn, stale-policy, and destructive-action cases replay in CI.
Change controlReviewer verdicts land beside human approvals and are retained with the release tuple.
Drift reviewFalse positives, missed findings, and ignored warnings get reviewed on a fixed cadence.

What to build next

Once the compliance reviewer is in place, the cheapest follow-ons are usually:

  1. Reliability reviewer — checks that timeouts, retries, and rollbacks are declared on every tool. The rubric is similar to compliance: a handful of mechanical predicates over the proposal.
  2. Security reviewer — PII paths, secret leakage, sandbox profile, injection-resistant prompts. This one borrows directly from your existing security-review patterns; it just types the verdict.
  3. Cost reviewer — token budgets, retrieval cost, run-budget headroom. Easy to write once you have per-pack-version cost dashboards.

The pattern is the same each time: small rubric, typed envelope, golden set, change-control queue. The first reviewer is the hard one because you are also building the queue and the envelope. Numbers two through seven cost a day each.

The team that ships the first reviewer this quarter usually ships three more by the end of the next one. The cost curve is generous.

Closing

The compliance reviewer is sixty lines, four rules, and five goldens. It blocks four classes of incident that would otherwise reach a human approver as a vague “this looks risky” feeling. It produces a typed envelope that the rest of the harness can read.

That is the whole shape of the work. You can copy this skeleton today, fill in the policies your platform actually has, and have a reviewer running by Friday. The hard part is not writing the reviewer; it is deciding to make every reviewer a typed verdict instead of a chat thread. Once the typed verdict is the convention, the rest is small.

Found this useful? Share it.

Share:XHN
Analytics consent

We use Google Analytics to understand site usage. You can opt in or decline.