The whole point of a harness is that failures upgrade the harness, not just the output. If an operator overrides a refund decision today and the same class of decision goes wrong tomorrow, the harness has not improved — it has been used. The Improvement Loop is the discipline of making every correction land somewhere durable.
I am going to walk one correction end-to-end — from the moment the operator clicks override to the day the resulting StrategyRule replaces the prompt advice that allowed the mistake. Code is concrete. Schemas are real. The full canonical version of this loop is in Improvement Loop; this post is the build-along.
2026 update: corrections need a release path
The biggest mistake teams make with feedback is stopping at capture. They collect thumbs-downs, comments, overrides, and support notes, then leave the harness unchanged. That creates a product analytics loop, not an improvement loop.
The ContextOS bar is stricter: a correction should either become a typed change candidate or be intentionally closed with a reason. The durable path is FeedbackEntry -> InsightCluster -> StrategyProposal -> ReplayVerdict -> StrategyRule, and the proposal does not ship until replay proves it fixes the intended traces without regressing the golden set.
The shape of the chain
Five typed artifacts hand off, each one feeding the next:
operator override
→ FeedbackEntry (immutable, signed)
→ InsightCluster (deduped pattern)
→ StrategyProposal (typed candidate change)
→ ReplayVerdict (against goldens)
→ StrategyRule (released, pinned, versioned)Each artifact has a fixed schema. Each transition has a typed function. The chain is auditable end-to-end because every transition writes a row to the experience store with the prior artifact’s hash.
Step 1 — capturing the correction
The operator UI is a thin wrapper over a typed write to the FeedbackStore. Free-text comments are fine; they go into a notes field. The structured fields are what the loop runs on.
export type FeedbackEntry = {
id: string // fb_2026_05_09_a17
trace_id: string // run that prompted the correction
decision_record_id: string
decision_key: string // "support.refund.execute"
override_kind: "approve" | "deny" | "modify"
override_reason_class: string // "refund_window_misjudged"
override_reason_text?: string // free text from the operator
expected_outcome: {
terminal_state?: string
field_overrides?: Record<string, unknown>
}
evidence_refs_seen: string[] // what the operator saw
evidence_refs_missing?: string[] // what they wished they had
signed_by: string // operator user_id
signed_at: string
}Two design choices the schema enforces.
The override_reason_class is a typed enum, not free text. The list lives in harness/feedback/reason-classes.json and grows under change control. Free text is captured separately, in override_reason_text, where it cannot accidentally become a clustering key. The enum is what makes deduplication tractable.
The entry is signed and immutable. Once written, it cannot be edited. Mistakes get a new entry that supersedes the prior one with a supersedes: link. This matters when the audit asks “what did the operator believe at 09:31?”, and the answer needs to be the original feedback, not a later edit of it.
The write itself is one function:
import type { FeedbackEntry } from "./types"
import { writeFeedbackEntry, signEntry } from "@/store"
import { resolveDecisionRecord } from "@/runtime"
export async function captureOverride(
trace_id: string,
override: Omit<FeedbackEntry, "id" | "decision_record_id" | "decision_key"
| "evidence_refs_seen" | "signed_at">,
): Promise<FeedbackEntry> {
const dr = await resolveDecisionRecord(trace_id)
const entry: FeedbackEntry = {
id: `fb_${new Date().toISOString().slice(0, 10).replaceAll("-", "_")}_${shortid()}`,
trace_id,
decision_record_id: dr.record_id,
decision_key: dr.decision_key,
evidence_refs_seen: dr.evidence_refs,
signed_at: new Date().toISOString(),
...override,
}
return writeFeedbackEntry(signEntry(entry))
}That is the whole capture surface. Operators interact with the UI; the UI calls this function. The function refuses entries that lack a resolvable trace_id or that the signing identity is not authorized to override. The store accepts the result append-only.
Step 2 — clustering corrections into Insights
Single corrections rarely justify a harness change. Patterns of corrections do. The InsightSynthesizer’s job is to find those patterns:
import type { FeedbackEntry } from "./types"
import { listFeedbackSince, writeInsight } from "@/store"
export type InsightCluster = {
insight_id: string // ins_a17
decision_key: string
override_reason_class: string
count: number
feedback_ids: string[] // pointers back to entries
earliest: string
latest: string
representative_evidence_refs: string[]
representative_field_overrides: Record<string, unknown>
}
const MIN_CLUSTER_SIZE = 5
export async function synthesizeInsights(since: Date): Promise<InsightCluster[]> {
const entries = await listFeedbackSince(since.toISOString())
// group by (decision_key, override_reason_class)
const groups = new Map<string, FeedbackEntry[]>()
for (const e of entries) {
const k = `${e.decision_key}::${e.override_reason_class}`
if (!groups.has(k)) groups.set(k, [])
groups.get(k)!.push(e)
}
const insights: InsightCluster[] = []
for (const [key, group] of groups) {
if (group.length < MIN_CLUSTER_SIZE) continue
const [decision_key, override_reason_class] = key.split("::")
insights.push({
insight_id: `ins_${shortid()}`,
decision_key,
override_reason_class,
count: group.length,
feedback_ids: group.map((e) => e.id),
earliest: group.reduce((a, e) => e.signed_at < a ? e.signed_at : a, group[0].signed_at),
latest: group.reduce((a, e) => e.signed_at > a ? e.signed_at : a, group[0].signed_at),
representative_evidence_refs: pickModeRefs(group),
representative_field_overrides: pickModeOverrides(group),
})
}
for (const i of insights) await writeInsight(i)
return insights
}The clustering predicate is intentionally narrow: same decision key, same reason class. That gives a tight cluster — high precision, low recall. Fancier embedding-based clustering can come later, but the typed enum is doing 80% of the work for free.
MIN_CLUSTER_SIZE = 5 is the only knob. Five corrections is enough to claim a pattern; below that, you are calling individual judgment calls “patterns” and proposing changes the team will reject. Tune up, never down.
Step 3 — typed proposals
An Insight is a fact about the past. A StrategyProposal is a typed candidate for changing the future. The compiler is the part of the loop where opinion enters; it is also the part the team cares about most, so it gets the most code.
import type { InsightCluster } from "./insights"
export type StrategyProposal = {
proposal_id: string // sp_b03
insight_id: string
proposal_class:
| "tighten_evidence_requirement"
| "lower_approval_threshold"
| "raise_approval_threshold"
| "adjust_intent_rubric"
| "add_redaction_rule"
| "add_policy_rule"
patch: {
target: "policy_bundle" | "context_pack" | "intent_rubric"
target_id: string // "POLICY_RETURNS_V4"
op: "add" | "modify" | "deprecate"
body: Record<string, unknown> // typed per target
}
rationale: string // human-readable, derived from cluster
expected_impact: {
runs_affected_per_day: number
expected_decision_change_rate: number // 0..1
}
status: "proposed" | "in_replay" | "ready_for_review" | "released" | "rejected"
}
export function proposeFromInsight(i: InsightCluster): StrategyProposal {
// class-specific compilers; the ladder below catches the four shapes we see most
switch (i.override_reason_class) {
case "refund_window_misjudged":
return {
proposal_id: `sp_${shortid()}`,
insight_id: i.insight_id,
proposal_class: "tighten_evidence_requirement",
patch: {
target: "policy_bundle",
target_id: "POLICY_RETURNS_V4",
op: "modify",
body: {
rule_id: "R_REFUND_REQUIRES_WINDOW_PROOF",
applies_to: { intent: "support.refund.execute" },
then: {
allow: true,
requires: ["refund_window_evidence", "order_lookup"],
},
rationale: `Refund window must be evidenced; ${i.count} prior corrections.`,
},
},
rationale: `Cluster ${i.insight_id}: ${i.count} operator overrides for misjudged refund window. Tightening evidence requirement.`,
expected_impact: {
runs_affected_per_day: estimateRuns("support.refund.execute"),
expected_decision_change_rate: 0.012,
},
status: "proposed",
}
case "missing_idv_check":
return /* analogous proposal targeting IDV evidence */ ({} as StrategyProposal)
case "approval_threshold_too_low":
return /* analogous proposal raising the gate */ ({} as StrategyProposal)
default:
throw new Error(`no proposal compiler for reason_class=${i.override_reason_class}`)
}
}The proposal compiler is the layer that earns its keep when it stays small. Three rules.
A proposal class names a specific kind of harness change — tightening an evidence requirement, raising an approval threshold, adding a redaction rule. A class without a clear shape ends up as a free-form patch that the reviewer cannot evaluate.
Each compiler is a switch arm, not a chain of conditionals. Adding a new reason class is a code change with an ADR. The discipline that prevents the compiler from becoming a 600-line monolith is enforced by code review on this file specifically.
The compiler refuses to compile what it does not know. The default arm throws, surfacing a known pattern that has not yet been wired up. Better to fail loudly than to emit a malformed proposal.
Step 4 — replay against goldens
A proposal is not credible until it has run against the historical traces it claims to fix. Replay is the gate.
import type { StrategyProposal } from "./proposals"
import { listGoldensFor, applyPatch, replayDecisionRecord } from "@/runtime"
import { writeReplayVerdict } from "@/store"
export type ReplayVerdict = {
proposal_id: string
goldens_total: number
goldens_changed: number
unchanged_baseline: number
changed_to_expected: number
changed_unexpected: number
scorecards_delta: {
policy: number // mean delta across goldens
safety: number
utility: number
latency: number
cost: number
}
status: "replay_clean" | "replay_regression" | "replay_partial"
details: Array<{
trace_id: string
classification: "unchanged_baseline" | "changed_to_expected" | "changed_unexpected"
note?: string
}>
}
export async function runReplayGate(p: StrategyProposal): Promise<ReplayVerdict> {
const goldens = await listGoldensFor(p.patch.target_id)
const details: ReplayVerdict["details"] = []
const sumDelta = { policy: 0, safety: 0, utility: 0, latency: 0, cost: 0 }
for (const g of goldens) {
const baseline = await replayDecisionRecord(g.trace_id, /* current pinned tuple */)
const candidate = await replayDecisionRecord(g.trace_id, applyPatch(p.patch))
const changed = baseline.outputs_hash !== candidate.outputs_hash
let classification: ReplayVerdict["details"][number]["classification"]
if (!changed) classification = "unchanged_baseline"
else if (g.expected_post_patch_outcome &&
candidate.outputs_hash === g.expected_post_patch_outcome) {
classification = "changed_to_expected"
} else {
classification = "changed_unexpected"
}
details.push({ trace_id: g.trace_id, classification })
for (const k of Object.keys(sumDelta) as Array<keyof typeof sumDelta>) {
sumDelta[k] += candidate.scorecard.scores[k].score - baseline.scorecard.scores[k].score
}
}
const n = Math.max(1, goldens.length)
const verdict: ReplayVerdict = {
proposal_id: p.proposal_id,
goldens_total: goldens.length,
goldens_changed: details.filter((d) => d.classification !== "unchanged_baseline").length,
unchanged_baseline: details.filter((d) => d.classification === "unchanged_baseline").length,
changed_to_expected: details.filter((d) => d.classification === "changed_to_expected").length,
changed_unexpected: details.filter((d) => d.classification === "changed_unexpected").length,
scorecards_delta: {
policy: sumDelta.policy / n,
safety: sumDelta.safety / n,
utility: sumDelta.utility / n,
latency: sumDelta.latency / n,
cost: sumDelta.cost / n,
},
status:
details.some((d) => d.classification === "changed_unexpected") ? "replay_regression"
: details.every((d) => d.classification === "unchanged_baseline") ? "replay_partial"
: "replay_clean",
details,
}
await writeReplayVerdict(verdict)
return verdict
}Three categories matter, and the gate distinguishes them deliberately.
unchanged_baseline — the patch did not change the verdict. Useful as a sanity check; if every golden is unchanged, the patch has no effect, which usually means the proposal is wrong.
changed_to_expected — the patch changed the verdict, and the golden’s expected_post_patch_outcome matches. This is the win case. The proposal does what it claims.
changed_unexpected — the patch changed the verdict in a way the golden did not expect. This is the danger case. Even one of these blocks the proposal until a human investigates the diff.
The output is a typed ReplayVerdict that the release gate reads.
Step 5 — release as StrategyRule
If the replay verdict is replay_clean and the scorecard delta passes the release gate from Wiring the Five Evaluators, the proposal becomes a versioned StrategyRule.
import type { StrategyProposal } from "./proposals"
import type { ReplayVerdict } from "./replay-gate"
import { writeStrategyRule, applyPatchToBundle } from "@/store"
export type StrategyRule = {
rule_id: string // str_b03
proposal_id: string
patch: StrategyProposal["patch"]
released_at: string
released_by: string
pinned_baseline: string // prior bundle version, kept for replay
effective_window: { from: string; to?: string }
status: "active" | "deprecated"
lineage: {
feedback_ids: string[] // every feedback that contributed
insight_id: string
replay_verdict_id: string
}
}
export async function releaseProposal(
p: StrategyProposal,
v: ReplayVerdict,
approver_id: string,
): Promise<StrategyRule> {
if (v.status !== "replay_clean") {
throw new Error(`refusing to release ${p.proposal_id}: replay status ${v.status}`)
}
const prior = await applyPatchToBundle(p.patch) // returns prior bundle version
return writeStrategyRule({
rule_id: `str_${shortid()}`,
proposal_id: p.proposal_id,
patch: p.patch,
released_at: new Date().toISOString(),
released_by: approver_id,
pinned_baseline: prior,
effective_window: { from: new Date().toISOString() },
status: "active",
lineage: {
feedback_ids: /* from the insight */ [],
insight_id: /* upstream insight id */ "",
replay_verdict_id: /* verdict id */ "",
},
})
}Two properties of the StrategyRule envelope earn their keep.
It pins the prior bundle version in pinned_baseline. This is what makes rollback meaningful: if the new rule misbehaves in production, the kill switch from Pack Rollout in Five Stages reverts to that exact prior version. Replay against pre-release trace_ids reproduces the prior DecisionRecord byte-for-byte.
It carries full lineage — every contributing feedback id, the insight id, the replay verdict id. When a regulator asks “why did this rule exist?”, the answer is a query: walk the lineage back through the experience store and produce the operator corrections that justified it. The audit story is built, not assembled.
Release-readiness checklist
| Check | Required before release |
|---|---|
| Feedback | Every correction has trace_id, decision_record_id, reason class, expected outcome, and signer. |
| Insight | Cluster size, time window, representative evidence, and affected decision key are explicit. |
| Proposal | Patch target, operation, expected impact, and rollback target are typed. |
| Replay | Candidate changes expected traces and does not introduce hard-gate scorecard regressions. |
| Reviewers | Compliance, reliability, and security reviewer verdicts are attached to the proposal. |
| Release | StrategyRule records approver, effective window, baseline version, and full lineage. |
What this gets you, on day 30
Three things that change the moment the loop is wired and a few cycles complete.
Operator effort compounds. A correction that takes 30 seconds to capture becomes a permanent change to harness behavior. The same operator does not see the same misjudgment a month later, because the rule already shipped. The team’s leverage on prompt-edits-for-prompt-edits drops to zero, in a good way.
The harness has a memory the team can navigate. Every active rule traces back to the corrections that produced it. New engineers do not ask “why is this policy here?”; they query the lineage. The harness becomes a self-documenting artifact.
Regressions become a different shape. A regression on a StrategyRule produces an attributable trail: which proposal, which replay verdict, which approver, which feedback cluster. The on-call engineer asks “did we recently release a rule on this code path?” and gets a precise answer in one query.
The loop has six files, including types: feedback/write.ts, feedback/insights.ts, feedback/proposals.ts, feedback/replay-gate.ts, feedback/release.ts, plus the schemas. The reviewer-agent layer (compliance, security, reliability) sits between Step 4 and Step 5, with each reviewer producing a typed envelope as documented in Building a Compliance Reviewer Agent.
That is the whole loop. Each artifact small and named. Each transition typed and replayable. The harness improves under change control; nothing changes silently.
Wire it up after the scorecard. Pick the most-overridden decision key in your fleet; that is your highest-leverage starting point. The first cycle takes a few weeks; the second one runs while you sleep.