Building a Reliability Reviewer Agent: 70 Lines Past the Compliance One

The compliance reviewer was the cheapest reviewer to ship. The reliability reviewer is the second cheapest, and the reason is that it inherits most of the scaffolding the compliance one built. Same ReviewerVerdict envelope. Same change-control queue. Same golden-set replay loop. Different rubric.

This post is the build-along for the second reviewer in the series. It is shorter on purpose: by post two you have already seen the pattern in Building a Compliance Reviewer Agent; the value here is the rubric and the four mechanical predicates that catch a different class of bug.

2026 update: reliability is a release contract

The reliability reviewer should be wired before broad rollout, not after the first incident. Timeouts, idempotency, rollback, and retry policy are not operational niceties; they are what keep a bounded Decision loop from becoming an unbounded production event.

The strongest reviewer here is intentionally mechanical. It does not ask whether the workflow is elegant. It asks whether every side-effecting path has a time limit, duplicate protection, failure classification, retry budget, reversal path, and observable trace.

What the reliability reviewer owns

While compliance flags policy and data-class violations, reliability flags operational omissions. The four checks:

Rule	What it checks	What pass looks like
`R-001`	Every adapter capability declares a `default_timeout_ms`	non-null integer between 100 and 60_000
`R-002`	Every non-`read_only` capability declares an `idempotency_header`	header name present; default class is `required`
`R-003`	Every `destructive` capability declares a `reversal_op` and `reversal_endpoint`	both present; reversal endpoint reachable in registry
`R-004`	Tool calls in the proposal that hit declared retryable errors include a backoff schedule	`retry_with_backoff` step has `backoff_ms[]` ≥ 2 entries

Four checks, mechanical, no LLM judgment. Same shape as compliance.

The skill file

harness/reviewers/reliability.v1.ts

import type { Proposal, ReviewerVerdict } from "@/types"
import { listAdapterRegistry, listToolEnvelopes } from "@/store"
 
export async function reviewReliability(p: Proposal): Promise<ReviewerVerdict> {
  const findings: ReviewerVerdict["findings"] = []
  const registry = await listAdapterRegistry()
 
  // R-001 — default_timeout_ms declared and within bounds
  for (const adp of registry) {
    for (const cap of adp.capabilities) {
      const t = cap.timeout_ms ?? adp.default_timeout_ms
      if (typeof t !== "number" || t < 100 || t > 60_000) {
        findings.push({
          severity: "error",
          rule_id: "R-001",
          message: `${adp.adapter_id}.${cap.id} missing default_timeout_ms (or out of bounds)`,
          evidence_ref: `adapter:${adp.adapter_id}.${cap.id}`,
          recommendation: "set default_timeout_ms in [100, 60000]",
        })
      }
    }
  }
 
  // R-002 — idempotency header declared for non-read_only capabilities
  for (const adp of registry) {
    for (const cap of adp.capabilities) {
      if (cap.approval_mode === "read_only") continue
      const header = cap.idempotency_header ?? adp.idempotency_header
      const def = cap.default_idempotency ?? adp.default_idempotency
      if (!header) {
        findings.push({
          severity: "error",
          rule_id: "R-002",
          message: `${adp.adapter_id}.${cap.id} missing idempotency_header`,
          evidence_ref: `adapter:${adp.adapter_id}.${cap.id}`,
          recommendation: 'declare idempotency_header (typically "Idempotency-Key")',
        })
      }
      if (def !== "required") {
        findings.push({
          severity: "warn",
          rule_id: "R-002",
          message: `${adp.adapter_id}.${cap.id} idempotency default is "${def ?? "unset"}", expected "required"`,
        })
      }
    }
  }
 
  // R-003 — destructive capabilities have reversal_op + reachable endpoint
  for (const adp of registry) {
    for (const cap of adp.capabilities) {
      if (cap.approval_mode !== "destructive") continue
      if (!cap.reversal_op) {
        findings.push({
          severity: "error",
          rule_id: "R-003",
          message: `destructive ${adp.adapter_id}.${cap.id} missing reversal_op`,
          recommendation: "declare a reversal endpoint paired with this capability",
        })
        continue
      }
      // reversal_op must be a registered operation on the same adapter
      const reverse = adp.capabilities.find((c) => c.id === cap.reversal_op)
      if (!reverse) {
        findings.push({
          severity: "error",
          rule_id: "R-003",
          message: `${adp.adapter_id}.${cap.id} declares reversal_op ${cap.reversal_op}, not registered on adapter`,
        })
      }
    }
  }
 
  // R-004 — runs that hit retryable errors had backoff_ms[] declared
  const calls = await listToolEnvelopes(p)
  for (const tc of calls) {
    if (tc.failure_classified !== "transient_timeout" && tc.failure_classified !== "server_error_5xx") continue
    const cap = registry.flatMap((a) => a.capabilities).find((c) => c.id === tc.capability_id)
    const schedule = cap?.failure_handling?.[tc.failure_classified]?.backoff_ms ?? []
    if (!Array.isArray(schedule) || schedule.length < 2) {
      findings.push({
        severity: "warn",
        rule_id: "R-004",
        message: `${tc.adapter_id}.${tc.capability_id} hit ${tc.failure_classified} but has < 2 backoff_ms entries`,
        evidence_ref: tc.id,
        recommendation: "set backoff_ms: [200, 600, 1800] or similar",
      })
    }
  }
 
  const errors = findings.filter((f) => f.severity === "error").length
  const status: ReviewerVerdict["status"] =
    errors > 0 ? "fail" : findings.length > 0 ? "warn" : "pass"
 
  return {
    reviewer_id: "reliability.v1",
    reviewer_version: "2026.03.18",
    status,
    findings,
    policy_id_refs: ["POLICY_TIMEOUTS", "POLICY_IDEMPOTENCY", "POLICY_REVERSAL"],
    recommendation: status === "fail" ? "block"
                  : status === "warn" ? "needs_human" : "merge",
    trace_id: p.trace_id,
  }
}

The shape mirrors the compliance reviewer exactly. That is the reuse the second-reviewer post earns: ReviewerVerdict envelope, findings[] with severities, recommendation mapping to the change-control queue. Pick the rubric, write the predicates, ship.

The golden set

Five rows, kept in the same harness/reviewers/ directory:

harness/reviewers/reliability.v1.golden.yaml

- name: clean_adapter
  input_path: ./fixtures/proposal_clean.json
  expected:
    status: pass
    finding_rule_ids: []
 
- name: missing_timeout
  input_path: ./fixtures/proposal_missing_timeout.json
  expected:
    status: fail
    finding_rule_ids: [R-001]
    recommendation: block
 
- name: idempotency_not_required
  input_path: ./fixtures/proposal_idempotency_optional.json
  expected:
    status: warn
    finding_rule_ids: [R-002]
    recommendation: needs_human
 
- name: destructive_without_reversal
  input_path: ./fixtures/proposal_no_reversal.json
  expected:
    status: fail
    finding_rule_ids: [R-003]
    recommendation: block
 
- name: timeout_no_backoff
  input_path: ./fixtures/proposal_no_backoff_after_timeout.json
  expected:
    status: warn
    finding_rule_ids: [R-004]
    recommendation: needs_human

The five rows cover all four rules plus the clean baseline. The golden runner replays this set on every change to reliability.v1.ts; failed assertions block the PR.

How the verdicts compose

A proposal merges only when every active reviewer says merge. With reliability and compliance both wired:

queue/proposals/{proposal_id}/
  proposal.json
  reviewers/
    compliance.v1.json     # status: pass | warn | fail
    reliability.v1.json    # status: pass | warn | fail
  decisions/
    human_approver.json    # only if any reviewer says needs_human

If compliance says merge and reliability says block, the proposal does not merge — block always wins. If both say needs_human, the human approver sees both findings sets in the same UI and signs once. The audit trail records every reviewer’s verdict.

What reliability catches that compliance does not

Three failure patterns that show up in production reliably and that the compliance rubric never flags:

Adapter without timeout. A new adapter ships, gets called from a production intent, and one slow upstream means the run hangs for the runtime’s hard ceiling (often 60 seconds). Latency scorecard tanks; users wait. Reliability reviewer catches this at proposal time, not at first incident.

local_write without idempotency. A retry of a write that already succeeded duplicates the row. The compliance reviewer does not look at idempotency. The reliability reviewer does, and R-002 blocks every non-read_only capability that did not declare a header.

Destructive without reversal. A destructive capability ships with no declared reversal_op. The first time something goes wrong, there is no rollback path; the team writes one in a panic at 3 a.m. Reliability rejects the proposal at change-control time.

Production checklist

Control	Minimum bar
Timeout	Every adapter and capability has bounded `default_timeout_ms`, with intent-specific latency budgets.
Idempotency	Every non-`read_only` call requires an idempotency key and records duplicate handling.
Retry	Retryable errors map to bounded backoff schedules and loop-guard budgets.
Reversal	`destructive` capabilities declare a tested reversal operation or a documented irreversible class.
Trace	Tool envelopes carry `trace_id`, attempt number, failure class, and recovery action.
Release gate	Reliability findings block promotion unless a named approver accepts the residual risk.

What to wire next

After reliability, the third-cheapest reviewer is security — PII paths, secret leakage, sandbox profile, injection-resistant prompt construction. The shape stays the same; the rubric is different again. By the third reviewer, the whole pattern is muscle memory:

1. Fork the prior reviewer's skill file.
2. Replace the rubric (the `R-NNN` predicates).
3. Add 5 golden-set rows.
4. Wire the reviewer's id into the runReviewerRecommendations() in the gate request builder.
5. Ship.

A day per reviewer. Three reviewers covers compliance, reliability, security — the three concerns that block 90% of destructive-action incidents on agent stacks I have seen. Cost is low, leverage is high.

The compliance reviewer was the cheapest first one. The reliability reviewer was the cheapest second one. Numbers three through seven are all cheaper than the first.