The compliance reviewer was the cheapest reviewer to ship. The reliability reviewer is the second cheapest, and the reason is that it inherits most of the scaffolding the compliance one built. Same ReviewerVerdict envelope. Same change-control queue. Same golden-set replay loop. Different rubric.
This post is the build-along for the second reviewer in the series. It is shorter on purpose: by post two you have already seen the pattern in Building a Compliance Reviewer Agent; the value here is the rubric and the four mechanical predicates that catch a different class of bug.
2026 update: reliability is a release contract
The reliability reviewer should be wired before broad rollout, not after the first incident. Timeouts, idempotency, rollback, and retry policy are not operational niceties; they are what keep a bounded Decision loop from becoming an unbounded production event.
The strongest reviewer here is intentionally mechanical. It does not ask whether the workflow is elegant. It asks whether every side-effecting path has a time limit, duplicate protection, failure classification, retry budget, reversal path, and observable trace.
What the reliability reviewer owns
While compliance flags policy and data-class violations, reliability flags operational omissions. The four checks:
| Rule | What it checks | What pass looks like |
|---|---|---|
R-001 | Every adapter capability declares a default_timeout_ms | non-null integer between 100 and 60_000 |
R-002 | Every non-read_only capability declares an idempotency_header | header name present; default class is required |
R-003 | Every destructive capability declares a reversal_op and reversal_endpoint | both present; reversal endpoint reachable in registry |
R-004 | Tool calls in the proposal that hit declared retryable errors include a backoff schedule | retry_with_backoff step has backoff_ms[] ≥ 2 entries |
Four checks, mechanical, no LLM judgment. Same shape as compliance.
The skill file
import type { Proposal, ReviewerVerdict } from "@/types"
import { listAdapterRegistry, listToolEnvelopes } from "@/store"
export async function reviewReliability(p: Proposal): Promise<ReviewerVerdict> {
const findings: ReviewerVerdict["findings"] = []
const registry = await listAdapterRegistry()
// R-001 — default_timeout_ms declared and within bounds
for (const adp of registry) {
for (const cap of adp.capabilities) {
const t = cap.timeout_ms ?? adp.default_timeout_ms
if (typeof t !== "number" || t < 100 || t > 60_000) {
findings.push({
severity: "error",
rule_id: "R-001",
message: `${adp.adapter_id}.${cap.id} missing default_timeout_ms (or out of bounds)`,
evidence_ref: `adapter:${adp.adapter_id}.${cap.id}`,
recommendation: "set default_timeout_ms in [100, 60000]",
})
}
}
}
// R-002 — idempotency header declared for non-read_only capabilities
for (const adp of registry) {
for (const cap of adp.capabilities) {
if (cap.approval_mode === "read_only") continue
const header = cap.idempotency_header ?? adp.idempotency_header
const def = cap.default_idempotency ?? adp.default_idempotency
if (!header) {
findings.push({
severity: "error",
rule_id: "R-002",
message: `${adp.adapter_id}.${cap.id} missing idempotency_header`,
evidence_ref: `adapter:${adp.adapter_id}.${cap.id}`,
recommendation: 'declare idempotency_header (typically "Idempotency-Key")',
})
}
if (def !== "required") {
findings.push({
severity: "warn",
rule_id: "R-002",
message: `${adp.adapter_id}.${cap.id} idempotency default is "${def ?? "unset"}", expected "required"`,
})
}
}
}
// R-003 — destructive capabilities have reversal_op + reachable endpoint
for (const adp of registry) {
for (const cap of adp.capabilities) {
if (cap.approval_mode !== "destructive") continue
if (!cap.reversal_op) {
findings.push({
severity: "error",
rule_id: "R-003",
message: `destructive ${adp.adapter_id}.${cap.id} missing reversal_op`,
recommendation: "declare a reversal endpoint paired with this capability",
})
continue
}
// reversal_op must be a registered operation on the same adapter
const reverse = adp.capabilities.find((c) => c.id === cap.reversal_op)
if (!reverse) {
findings.push({
severity: "error",
rule_id: "R-003",
message: `${adp.adapter_id}.${cap.id} declares reversal_op ${cap.reversal_op}, not registered on adapter`,
})
}
}
}
// R-004 — runs that hit retryable errors had backoff_ms[] declared
const calls = await listToolEnvelopes(p)
for (const tc of calls) {
if (tc.failure_classified !== "transient_timeout" && tc.failure_classified !== "server_error_5xx") continue
const cap = registry.flatMap((a) => a.capabilities).find((c) => c.id === tc.capability_id)
const schedule = cap?.failure_handling?.[tc.failure_classified]?.backoff_ms ?? []
if (!Array.isArray(schedule) || schedule.length < 2) {
findings.push({
severity: "warn",
rule_id: "R-004",
message: `${tc.adapter_id}.${tc.capability_id} hit ${tc.failure_classified} but has < 2 backoff_ms entries`,
evidence_ref: tc.id,
recommendation: "set backoff_ms: [200, 600, 1800] or similar",
})
}
}
const errors = findings.filter((f) => f.severity === "error").length
const status: ReviewerVerdict["status"] =
errors > 0 ? "fail" : findings.length > 0 ? "warn" : "pass"
return {
reviewer_id: "reliability.v1",
reviewer_version: "2026.03.18",
status,
findings,
policy_id_refs: ["POLICY_TIMEOUTS", "POLICY_IDEMPOTENCY", "POLICY_REVERSAL"],
recommendation: status === "fail" ? "block"
: status === "warn" ? "needs_human" : "merge",
trace_id: p.trace_id,
}
}The shape mirrors the compliance reviewer exactly. That is the reuse the second-reviewer post earns: ReviewerVerdict envelope, findings[] with severities, recommendation mapping to the change-control queue. Pick the rubric, write the predicates, ship.
The golden set
Five rows, kept in the same harness/reviewers/ directory:
- name: clean_adapter
input_path: ./fixtures/proposal_clean.json
expected:
status: pass
finding_rule_ids: []
- name: missing_timeout
input_path: ./fixtures/proposal_missing_timeout.json
expected:
status: fail
finding_rule_ids: [R-001]
recommendation: block
- name: idempotency_not_required
input_path: ./fixtures/proposal_idempotency_optional.json
expected:
status: warn
finding_rule_ids: [R-002]
recommendation: needs_human
- name: destructive_without_reversal
input_path: ./fixtures/proposal_no_reversal.json
expected:
status: fail
finding_rule_ids: [R-003]
recommendation: block
- name: timeout_no_backoff
input_path: ./fixtures/proposal_no_backoff_after_timeout.json
expected:
status: warn
finding_rule_ids: [R-004]
recommendation: needs_humanThe five rows cover all four rules plus the clean baseline. The golden runner replays this set on every change to reliability.v1.ts; failed assertions block the PR.
How the verdicts compose
A proposal merges only when every active reviewer says merge. With reliability and compliance both wired:
queue/proposals/{proposal_id}/
proposal.json
reviewers/
compliance.v1.json # status: pass | warn | fail
reliability.v1.json # status: pass | warn | fail
decisions/
human_approver.json # only if any reviewer says needs_humanIf compliance says merge and reliability says block, the proposal does not merge — block always wins. If both say needs_human, the human approver sees both findings sets in the same UI and signs once. The audit trail records every reviewer’s verdict.
What reliability catches that compliance does not
Three failure patterns that show up in production reliably and that the compliance rubric never flags:
Adapter without timeout. A new adapter ships, gets called from a production intent, and one slow upstream means the run hangs for the runtime’s hard ceiling (often 60 seconds). Latency scorecard tanks; users wait. Reliability reviewer catches this at proposal time, not at first incident.
local_write without idempotency. A retry of a write that already succeeded duplicates the row. The compliance reviewer does not look at idempotency. The reliability reviewer does, and R-002 blocks every non-read_only capability that did not declare a header.
Destructive without reversal. A destructive capability ships with no declared reversal_op. The first time something goes wrong, there is no rollback path; the team writes one in a panic at 3 a.m. Reliability rejects the proposal at change-control time.
Production checklist
| Control | Minimum bar |
|---|---|
| Timeout | Every adapter and capability has bounded default_timeout_ms, with intent-specific latency budgets. |
| Idempotency | Every non-read_only call requires an idempotency key and records duplicate handling. |
| Retry | Retryable errors map to bounded backoff schedules and loop-guard budgets. |
| Reversal | destructive capabilities declare a tested reversal operation or a documented irreversible class. |
| Trace | Tool envelopes carry trace_id, attempt number, failure class, and recovery action. |
| Release gate | Reliability findings block promotion unless a named approver accepts the residual risk. |
What to wire next
After reliability, the third-cheapest reviewer is security — PII paths, secret leakage, sandbox profile, injection-resistant prompt construction. The shape stays the same; the rubric is different again. By the third reviewer, the whole pattern is muscle memory:
1. Fork the prior reviewer's skill file.
2. Replace the rubric (the `R-NNN` predicates).
3. Add 5 golden-set rows.
4. Wire the reviewer's id into the runReviewerRecommendations() in the gate request builder.
5. Ship.A day per reviewer. Three reviewers covers compliance, reliability, security — the three concerns that block 90% of destructive-action incidents on agent stacks I have seen. Cost is low, leverage is high.
The compliance reviewer was the cheapest first one. The reliability reviewer was the cheapest second one. Numbers three through seven are all cheaper than the first.