Improvement Loop
Trust-plane primitives that turn failed and corrected runs into typed proposals — Insight Synthesizer, Strategy Compiler, Feedback Store, Chief-of-Staff, Research Queue, Autotune.
Six primitives that turn observed behavior into typed, release-gated proposals — never auto-applied.
- Trace bundles + DecisionRecords from Observability
- Replayable environment episodes and quality replay cases
- Scorecards from the Evaluation Engine
- Operator corrections and approval-gate verdicts
- Memory promotion records (especially correction-class)
- Open loops + queue state from the operator surface
- Post-release scorecard deltas, rollback triggers, and regression reports
- Insight, StrategyRule, FeedbackRecord, Note, ResearchTask, TuningProposal envelopes
- Replay obligations, source-closure verdicts, and evidence-bound benchmark candidates
- Bundle of proposals for change-control review
- Audit records bound to trace_id and decision_record_id
- Monitor verdicts that keep, roll back, supersede, or retire released proposals
- observe
- propose
- lint
- review
- gate
- release
- monitor
- Insight
- StrategyRule
- FeedbackRecord
- Note
- ResearchTask
- TuningProposal
The Improvement Loop is the Trust-plane discipline that turns observed runtime behavior into governed change proposals. It is what closes the loop between evaluation and the next pack version.
Definition
A coordinated set of typed primitives that consume traces, decision records, operator corrections, and approval-gate outcomes; surface recurring patterns; convert them into reusable strategy rules; queue autonomous research where the gap is knowledge; and generate proactive operational notes — all as proposals that land under the same change control as packs and policies.
Why it exists
LLM-driven systems drift. Without a closed loop that turns failures into typed proposals, every regression is rediscovered the hard way and every operator correction evaporates after the conversation ends. This plane formalizes the loop: every signal is captured, scored, and routed to the primitive that owns its kind of improvement.
The same discipline applies after release. A shipped proposal is not “done” until production monitoring shows that the intended scorecard improved without a hidden policy, safety, latency, cost, or trust regression.
How it works
- Observe — every completed run produces a
RunScore, aDecisionRecord, and a trace bundle. - Capture corrections — every operator override or correction enters the
FeedbackStorewith provenance. - Pin source evidence — if the signal came from a replay suite, task fixture, or bounded environment episode, the loop records the episode id, replay case id, baseline tuple, required pack, and stable state signature before mutation.
- Classify — the signal is routed as an evidence gap, strategy gap, policy gap, workflow gap, scorecard regression, environment/task failure, or operational bottleneck before anyone reaches for a prompt edit.
- Synthesize — the
InsightSynthesizerscans episodic memory and the trace store for recurring patterns (failure clusters, blocked decisions, common detours). - Compile strategies — the
StrategyCompilerconverts feedback + insights into reusable strategy rules adjusting prompts, plans, or tool selection. - Queue research — when a pattern signals a knowledge gap (not a strategy gap), the
ResearchQueueenqueues an autonomous research task. - Surface operational notes — the
ChiefOfStaffscans due tasks, open loops, and queue backlog to propose proactive operator notes. - Tune —
Autotuneproposes prompt, retrieval, or budget changes against a target metric. - Gate and release — every proposal lands in change control; nothing auto-applies without a human review (except where policy explicitly permits auto-promotion of a specific class).
- Close source pressure — the replay case, benchmark candidate, environment episode group, or correction cluster is marked
fixed,still_failing,invalid, orsupersededwith evidence. - Monitor — released proposals are compared against production scorecards, correction clusters, approval latency, and rollback triggers until the operator can keep, supersede, retire, or roll back the change with evidence.
Signal routing
The Improvement Loop does not assume that every failure is a prompt problem. Classification decides which surface owns the fix.
| Runtime signal | Route first | Typical proposal |
|---|---|---|
| Missing fact | Context Pack evidence source or ResearchQueue | ResearchTask or knowledge patch for review |
| Wrong tool choice | Tool catalog, planner rule, or strategy layer | StrategyRule against tool_selection or planner |
| Bad policy behavior | Governance policy and approval boundary | Policy change proposal with replay evidence |
| Confusing user output | Prompt examples, receipt language, or response rubric | StrategyRule against prompt plus scorecard update |
| Repeated escalation | Authority boundary, workflow design, or queue owner | Insight plus operational Note or governance proposal |
| Slow run | Retrieval path, tool path, or context shape | TuningProposal with latency guardrails |
| Expensive run | Budget allocation and context size | TuningProposal with economics target |
| Recurring operator correction | FeedbackStore and StrategyCompiler | FeedbackRecord leading to StrategyRule |
| User distrust | Receipts, approval copy, and explanation quality | Insight plus product-surface or rubric proposal |
| New risk pattern | Release gate, rollback trigger, or refusal policy | Policy gate proposal with production monitor criteria |
Environment-backed evidence
The SecondBrain improvement-loop runbook makes one operational point explicit: the best improvement signal is not a complaint, a dashboard screenshot, or a prompt diff. It is a replayable episode. A failed task should be captured with reset inputs, structured actions, observations, reward components, terminal flags, and a stable state signature so the same failure can be rerun before and after a candidate change.
ContextOS treats this as a harness-side evidence pattern, not as a new production tool plane. Production tools still execute through the Tool Gateway. Replay environments and task fixtures live in the measurement plane: they preserve trajectory and reward evidence so the Evaluation Engine can decide whether a proposal fixed the source pressure.
| SecondBrain pattern | ContextOS interpretation |
|---|---|
| Environment episode | A bounded replay fixture under harness/fixtures or harness/evals, with reset state, actions, observations, reward components, terminal status, and a stable state signature. |
| Quality replay case | Executable evidence derived from a failed run, environment episode, or operator correction. It must be rerun before promotion. |
| Benchmark candidate | A pressure source with source_type, source_ref, expected properties, severity, and resolution status. It is not trusted until it has executable expectations. |
| Target-bound pack | The required replay pack follows the source evidence. A hard failure cannot be closed by passing an unrelated smoke set. |
| Scientific promotion gate | Baseline and candidate are compared on paired cases; kept-but-underpowered runs remain review candidates, not closed improvements. |
| Source closure | After promotion, the original pressure is explicitly marked fixed, still_failing, invalid, or superseded; failed attempts add recurrence metadata instead of disappearing. |
This turns the loop into a falsifiable control system:
source pressure
-> replayable evidence
-> target-bound baseline measurement
-> bounded candidate mutation
-> treatment measurement on the same evidence
-> promotion gate
-> source closure
-> production monitor windowThe invariant is simple: if a source pressure cannot be replayed or otherwise converted into executable expectations, the loop may record an Insight or ResearchTask, but it cannot claim a closed improvement.
The portable contract for this evidence is the ReplayPacket. A quality replay case, environment episode, operator correction, or golden case may each produce a packet, but all packets must name the source pressure, pinned context pack, compiled-context hash, side-effect policy, and expected outcome before the loop can claim closure.
The six primitives
InsightSynthesizer
Scans episodic memory and the trace store for recurring patterns and emits typed Insight records.
{
"insight_id": "ins_2026_05_04_a17",
"kind": "failure_cluster",
"intent": "support.refund",
"pattern": "policy.eval denial after order_lookup when refund_amount > 800 INR",
"occurrences": 23,
"first_seen": "2026-04-21T09:14:00Z",
"last_seen": "2026-05-04T08:01:00Z",
"evidence_refs": ["dr_2026_04_21_x12", "dr_2026_05_03_q88"],
"confidence": 0.91,
"status": "proposed"
}Insight kinds: failure_cluster, gap_detected (recurring evidence misses), blocked_decision, common_detour, over-escalation.
StrategyCompiler
Converts validated feedback + insights into reusable strategy rules the runtime can apply at the right layer.
{
"strategy_rule_id": "str_2026_05_04_b03",
"applies_to": { "intent": "support.refund" },
"trigger": { "from_insight": "ins_2026_05_04_a17" },
"adjustment": {
"layer": "planner",
"type": "tool_preference",
"value": { "prefer_tool": "adp_policy.eval_with_promotion", "over": "adp_policy.eval" }
},
"release_gate_target": "support.refund",
"status": "proposed"
}Adjustment layers: prompt, planner, retrieval, tool_selection, memory_recall, budget_allocation. Status flow: proposed → reviewed → approved → released → retired.
FeedbackStore
Captures every operator correction and tip with provenance. Cited by future improvements; auditable.
{
"feedback_id": "fb_2026_05_04_c19",
"kind": "correction",
"context": { "decision_record_id": "dr_2026_05_04_a17", "intent": "support.refund" },
"operator": "user_finance_lead_77",
"correction": "Refund eligibility should consider prior corrections within 90 days",
"applied_to_record": "dr_2026_05_04_a17",
"evidence_refs": ["dr_2026_05_04_a17"],
"captured_at": "2026-05-04T09:32:00Z"
}Feedback kinds: correction, tip, escalation_rationale, gate_rationale. Every correction emits a learning signal to the InsightSynthesizer and writes a correction-class entry into Memory.
ChiefOfStaff
Scans due tasks, open loops, queue backlog, and approval-gate latency to propose proactive operational notes.
{
"note_id": "cs_2026_05_04_d04",
"kind": "open_loop_aging",
"subject": "GATE_FINANCE_APPROVAL pending > 24h on session sess_42f1",
"recommended_action": "page on-call finance approver or auto-deny per policy",
"evidence_refs": ["session:sess_42f1", "audit:audit_inv_22_GATE_FINANCE_APPROVAL"],
"status": "open"
}Note kinds: open_loop_aging, queue_backlog, repeated_escalation, eval_target_drift, recurring_correction. Output goes to the operator surface; never to the planner directly.
ResearchQueue
When an insight indicates a knowledge gap rather than a strategy gap, an autonomous research task is queued. Output is a typed Knowledge Patch the operator can review and promote.
{
"research_task_id": "rq_2026_05_04_e02",
"trigger_insight_id": "ins_2026_05_04_a17",
"question": "What are the supplier-side rules that override our default 90-day refund window?",
"scope": { "domain": "support.refund", "max_external_calls": 0 },
"status": "queued",
"produces": "knowledge_patch"
}Research tasks are budgeted (no live tool calls beyond the explicit scope); produced patches are evidence-bound knowledge-graph deltas, not free-form text.
Autotune
Proposes prompt, retrieval, or budget changes against a declared scorecard target. Bounded by the same release-gate process as a pack change.
{
"tuning_proposal_id": "at_2026_05_04_f01",
"target": { "intent": "support.refund", "metric": "economics_cents_per_decision", "direction": "decrease", "guardrails": ["safety>=1.0", "policy>=1.0"] },
"candidate_change": {
"layer": "retrieval",
"diff": { "max_hops": { "from": 3, "to": 2 }, "top_k": { "from": 10, "to": 6 } }
},
"expected_delta": { "economics_cents_per_decision": -0.18, "utility": -0.01 },
"evaluation_run_id": "eval_2026_05_04_a01",
"status": "proposed"
}Autotune never auto-applies; every proposal goes through the Evaluation Engine on a golden replay before review.
Before an autotune run starts, it must declare:
| Required field | Purpose |
|---|---|
target.intent | Keeps the search local to one business workflow. |
target.metric | Names the single primary metric being improved. |
guardrails | Floors for Policy and Safety, plus accepted Utility / Latency / Economics deltas. |
baseline_tuple | Pins the pack, policy, tool manifest, evaluator suite, and model profile being compared. |
tunable_surfaces | Enumerates exactly which fields the optimizer may change. |
search_set / heldout_test_set | Prevents the proposer from iterating on the release gate data. |
rollback_target | Names the prior released tuple before any rollout can start. |
Autotune candidates are ranked on the evaluator vector, not a single blended score. Policy and Safety are floor constraints; surviving candidates are compared by target-metric improvement, blast radius, and explainability. The output is a TuningProposal, not a production patch.
Proposal lifecycle
All six primitives produce typed records. The lifecycle is uniform:
proposed → reviewed → approved → released → (retired | superseded)
↘ rejected- proposed — primitive emitted the record; no human has looked at it.
- reviewed — operator triaged with a verdict and rationale.
- approved — change-control approver signed the promotion.
- released — applied to the relevant pack / policy / catalog version; pinned for replay.
- rejected / superseded / retired — terminal states with rationale recorded.
Interfaces
Inputs
- Trace bundles + Decision Records from Observability
- scorecards from the Evaluation Engine
- Operator corrections and approval-gate verdicts
- Memory promotion records (especially
correction-class) - Open loops and queue state from the operator surface
Outputs
- Typed
Insight,StrategyRule,FeedbackRecord,Note,ResearchTask,TuningProposalenvelopes - Bundle of proposals for change-control review
- Audit records bound to
trace_idanddecision_record_id - Monitor verdicts bound to released proposal id and baseline tuple
Failure modes
- Primitive auto-applies a proposal without review — invariant violation; release gate must always require an approver.
InsightSynthesizeremits noise (low-precision patterns) — mitigated by minimum-occurrence thresholds and operator-tuned filters.StrategyCompilerproduces conflicting rules across primitives — release-gate lint detects conflicts with active rules.FeedbackStorecollects corrections that are themselves wrong — every correction is auditable and reversible by a later correction with higher precedence.ChiefOfStafffloods the operator surface — note count budgeted per operator per day; aggregations preferred to per-note alerts.ResearchQueuetask exceeds its declared scope — refused at execution; emits a security event.Autotuneregresses a guardrail metric — release-gate blocks the proposal.- Runtime-affecting proposal is treated as finished after release — regression can persist silently; every runtime-affecting release needs a monitor window and rollback target.
Operational concerns
- Sampling stratified by intent and risk class so rare-but-high-value insights surface.
- Per-primitive cost budgets separate from production Run Budget.
- Quarterly rebaselining: prune retired/superseded records; refresh thresholds and rubrics.
- All proposals discoverable from the operator surface; nothing happens out-of-band.
- Append-only proposal store with full provenance; replay must reproduce the verdict.
- Recently released runtime-affecting proposals enter a post-release monitor window. The window compares production scorecard slices against the baseline tuple and records one of four verdicts: keep, roll back, supersede, or retire.
- New and recently modified workflows should have a regular operations review cadence. The review must decide which repeated corrections become proposals, which approval delays need owner changes, which tests are missing, and whether rollout should advance, pause, or roll back.
- Operating reviews produce artifacts, not meeting notes: accepted or rejected proposals, new replay cases, assigned owners, and explicit rollout decisions.
- The loop should satisfy lifecycle-monitoring and continual-improvement expectations, but the spec remains change-control first: standards alignment never replaces traces, scorecards, or rollback evidence.
Evaluation metrics
- Adoption rate — fraction of proposals that ship after review (target band: 30–60% — too low means noise, too high means rubber-stamping).
- Detection lead time — regression introduced → proposal emitted.
- Correction-to-proposal latency — operator correction → derived StrategyRule.
- Time to close repeated issue — first repeated signal → released and monitored fix with no active rollback trigger.
- Post-release rollback rate — fraction of released proposals rolled back or superseded during the monitor window.
- Retire rate — fraction of released proposals retired within 90 days (high retire rate signals over-aggressive shipping).
- Notes acted on — fraction of Chief-of-Staff notes the operator acts on (low rate signals notification fatigue).
- Research patch yield — fraction of research tasks that produce a knowledge patch the operator promotes.
Example
A complete proposal chain spawned by a single failed run:
Run sess_42f1 / dr_2026_05_04_a17 (status=ESCALATED, policy=1.0 but utility=0.62)
→ InsightSynthesizer detects 23 similar escalations
→ emits Insight ins_2026_05_04_a17
→ StrategyCompiler proposes StrategyRule str_2026_05_04_b03 (planner tool preference)
→ ResearchQueue enqueues rq_2026_05_04_e02 (supplier rule clarification)
→ ChiefOfStaff emits Note cs_2026_05_04_d04 (high open-loop count on this gate)
→ Autotune proposes at_2026_05_04_f01 (reduce hop budget on read path)
→ all six records land in change control with the same trace_idThe harness as a search target
The Improvement Loop is the entry point for an idea ContextOS treats as foundational: the harness itself is a versioned artifact that can be improved by search, not just by hand. Recent work — most directly the Stanford / MIT / KRAFTON Meta-Harness paper (Lee et al., 2026) — shows that an outer-loop optimizer with full access to prior code, scores, and execution traces can discover better harnesses than hand-engineering, on tasks ranging from text classification to long-horizon agentic coding (TerminalBench-2). The result generalizes the principle behind every primitive in this doc.
ContextOS does not ship an autonomous outer-loop optimizer in the spec. It ships the substrate one needs:
| Requirement | ContextOS primitive |
|---|---|
| Experience store of prior runs (code + scores + traces) | OTEL trace bundles + DecisionRecords + pinned pack/policy/tool versions |
| Multi-objective scoring (Pareto, not scalar) | Five evaluators (Policy / Utility / Latency / Safety / Economics) |
| Lightweight validation before expensive evaluation | Interface tests on packs, policies, and tools before any golden-set replay |
| Disjoint search-set and test-set | Golden sets and held-out replay sets; the search-set is what the proposer sees, the test-set is what gates release |
| Filesystem-shaped layout that grep + cat can navigate | The harness/ repo layout in Harness Engineering |
| Initialization from strong baselines | Existing pack versions are the search prior; proposals start from a published baseline, not a blank slate |
Patterns the substrate enables
Pareto-frontier proposals. Autotune already accepts a target with direction and guardrails. The Pareto framing is wider: the same proposal can be evaluated against multiple operating points (e.g., low-context-cost variant vs. high-utility variant), and the operator picks a frontier point rather than a single scalar winner. This is how Meta-Harness produces a family of harnesses that trade context against accuracy.
Causal reasoning over prior failures. When the InsightSynthesizer flags a regression, the proposer (human or automated) reads not the summary but the raw traces and code of the failing candidates. The Meta-Harness ablation makes this stark: scores-only proposers reach 34.6 median accuracy; scores-plus-summary reaches 34.9; full execution traces reach 50.0. Compressed feedback loses the diagnostic signal needed to identify confounds. ContextOS retains full traces by default for exactly this reason — replay against pinned snapshots is the contract that keeps prior experience interpretable.
Additive-change preference. Empirically, control-flow edits to the planner / executor / critic loop carry higher regression risk than purely-additive changes to retrieval, evidence injection, or environment bootstrap. A discovered Meta-Harness improvement on TerminalBench-2 was a single additive change — injecting an environment snapshot before the agent loop begins — that gained on 7 of 89 tasks without regressing the rest. StrategyCompiler proposals that sit in the additive layers (retrieval, memory_recall, prompt) carry less risk than ones in planner or tool_selection; the release gate weights this when ranking competing proposals.
Cross-run transfer. The proposer can read prior runs across domains, not just within one. A pattern discovered on one intent (e.g., evidence-injection helps the cold-start) may transfer to another. The Improvement Loop’s append-only proposal store, indexed by trace_id and intent, is what makes this transfer mechanically possible.
What the Improvement Loop does not do
- It does not replace human approval. Every proposal — automated or hand-authored — passes through the same release gate.
- It does not learn during inference. All discovery happens offline against pinned snapshots.
- It does not optimize for a single scalar. Every proposal is evaluated on the full evaluator vector.
- It does not search-set its way past safety guardrails.
safetyandpolicyare floor constraints, not optimization targets; any proposal that would regress them is rejected before it reaches the test-set. - It does not stop at release. Production monitoring is the evidence that a released proposal should be kept, rolled back, superseded, or retired.
Common misconceptions
- Improvement is not auto-pilot. Every primitive produces proposals; humans approve.
- The Improvement Loop is not separate from change control. Its outputs land under the same release-gate process as packs, policies, and catalogs.
- Release is not the end of the loop. A shipped proposal enters a monitor window; post-release regressions produce rollback, supersede, or retire decisions.
- Insights are not log lines. They are typed records with occurrence counts, evidence_refs, and a status lifecycle.
- Autotune is not learning during inference. It is offline tuning against a golden replay, gated by scorecard deltas.
- Compressed feedback is not equivalent to traces. Summaries collapse the diagnostic signal a proposer needs to identify confounds; the experience store keeps full traces and pinned versions for this reason.
- The harness is the optimization target, not the model. Changing the harness around a fixed model can produce 6× swings on the same benchmark; the Improvement Loop is the seam through which those changes are proposed, evaluated, and shipped.