Improvement Loop

Trust-plane primitives that turn failed and corrected runs into typed proposals — Insight Synthesizer, Strategy Compiler, Feedback Store, Chief-of-Staff, Research Queue, Autotune.

Foundational SpecLast reviewed: 2026-07-25 Edit on GitHub

At a glance

Trust planeControl over the other four

Six primitives that turn observed behavior into typed, release-gated proposals — never auto-applied.

Inputs

Trace bundles + DecisionRecords from Observability
Replayable environment episodes and quality replay cases
Scorecards from the Evaluation Engine
Operator corrections and approval-gate verdicts
Memory promotion records (especially correction-class)
Open loops + queue state from the operator surface
Post-release scorecard deltas, rollback triggers, and regression reports

Outputs

Insight, StrategyRule, FeedbackRecord, Note, ResearchTask, TuningProposal envelopes
Replay obligations, source-closure verdicts, and evidence-bound benchmark candidates
Bundle of proposals for change-control review
Audit records bound to trace_id and decision_record_id
Monitor verdicts that keep, roll back, supersede, or retire released proposals

Lifecycle

observe
propose
lint
review
gate
release
monitor

Canonical types

Insight
StrategyRule
FeedbackRecord
Note
ResearchTask
TuningProposal

The Improvement Loop is the Trust-plane discipline that turns observed runtime behavior into governed change proposals. It is what closes the loop between evaluation and the next pack version.

Definition

A coordinated set of typed primitives that consume traces, decision records, operator corrections, and approval-gate outcomes; surface recurring patterns; convert them into reusable strategy rules; queue autonomous research where the gap is knowledge; and generate proactive operational notes — all as proposals that land under the same change control as packs and policies.

Why it exists

LLM-driven systems drift. Without a closed loop that turns failures into typed proposals, every regression is rediscovered the hard way and every operator correction evaporates after the conversation ends. This plane formalizes the loop: every signal is captured, scored, and routed to the primitive that owns its kind of improvement.

The same discipline applies after release. A shipped proposal is not “done” until production monitoring shows that the intended scorecard improved without a hidden policy, safety, latency, cost, or trust regression.

How it works

Observe — every completed run produces a RunScore, a DecisionRecord, and a trace bundle.
Capture corrections — every operator override or correction enters the FeedbackStore with provenance.
Pin source evidence — if the signal came from a replay suite, task fixture, or bounded environment episode, the loop records the episode id, replay case id, baseline tuple, required pack, and stable state signature before mutation.
Classify — the signal is routed as an evidence gap, strategy gap, policy gap, workflow gap, scorecard regression, environment/task failure, or operational bottleneck before anyone reaches for a prompt edit.
Synthesize — the InsightSynthesizer scans episodic memory and the trace store for recurring patterns (failure clusters, blocked decisions, common detours).
Compile strategies — the StrategyCompiler converts feedback + insights into reusable strategy rules adjusting prompts, plans, or tool selection.
Queue research — when a pattern signals a knowledge gap (not a strategy gap), the ResearchQueue enqueues an autonomous research task.
Surface operational notes — the ChiefOfStaff scans due tasks, open loops, and queue backlog to propose proactive operator notes.
Tune — Autotune proposes prompt, retrieval, or budget changes against a target metric.
Gate and release — every proposal lands in change control; nothing auto-applies without a human review (except where policy explicitly permits auto-promotion of a specific class).
Close source pressure — the replay case, benchmark candidate, environment episode group, or correction cluster is marked fixed, still_failing, invalid, or superseded with evidence.
Monitor — released proposals are compared against production scorecards, correction clusters, approval latency, and rollback triggers until the operator can keep, supersede, retire, or roll back the change with evidence.

Signal routing

The Improvement Loop does not assume that every failure is a prompt problem. Classification decides which surface owns the fix.

Runtime signal	Route first	Typical proposal
Missing fact	Context Pack evidence source or ResearchQueue	`ResearchTask` or knowledge patch for review
Wrong tool choice	Tool catalog, planner rule, or strategy layer	`StrategyRule` against `tool_selection` or `planner`
Bad policy behavior	Governance policy and approval boundary	Policy change proposal with replay evidence
Confusing user output	Prompt examples, receipt language, or response rubric	`StrategyRule` against `prompt` plus scorecard update
Repeated escalation	Authority boundary, workflow design, or queue owner	`Insight` plus operational `Note` or governance proposal
Slow run	Retrieval path, tool path, or context shape	`TuningProposal` with latency guardrails
Expensive run	Budget allocation and context size	`TuningProposal` with economics target
Recurring operator correction	FeedbackStore and StrategyCompiler	`FeedbackRecord` leading to `StrategyRule`
User distrust	Receipts, approval copy, and explanation quality	`Insight` plus product-surface or rubric proposal
New risk pattern	Release gate, rollback trigger, or refusal policy	Policy gate proposal with production monitor criteria

Correction promotion ladder

Agent reflection is a useful signal, but it is not trusted learning by itself. A model saying “I should do this differently next time” creates a candidate observation bound to the current trace. The Improvement Loop promotes the correction only after it chooses the smallest durable surface and proves that the change generalizes.

Evidence pattern	Promotion target	What must be true
One-off user or task correction	Run Context, session state, or active execution plan	Scope is intentionally limited to the current objective
Repeated navigation or repository mistake	Scoped agent guidance	Same failure recurs in the same owned subtree
Repeated multi-step method	Versioned skill	Inputs, outputs, verification, and failure handling are stable
Deterministic invariant	Validator, hook, policy, type, or CI gate	The rule can be checked without model judgment
Stable recurring operation	Scheduled runner around a pinned skill	Replay passes; run is idempotent, isolated, owned, budgeted, and reversible

Every promotion retains the originating trace_id, correction or reflection reference, recurrence count, target surface, and replay obligation. A rejected promotion remains evidence; it does not disappear or become memory by repetition alone.

This also separates method from cadence. Skills and workflow specs define how work is performed. Schedulers decide when a proven method runs. If a recurring job still needs frequent steering, its method is not stable enough to automate; the loop should emit a workflow proposal instead of a schedule.

Environment-backed evidence

The SecondBrain improvement-loop runbook makes one operational point explicit: the best improvement signal is not a complaint, a dashboard screenshot, or a prompt diff. It is a replayable episode. A failed task should be captured with reset inputs, structured actions, observations, reward components, terminal flags, and a stable state signature so the same failure can be rerun before and after a candidate change.

ContextOS treats this as a harness-side evidence pattern, not as a new production tool plane. Production tools still execute through the Tool Gateway. Replay environments and task fixtures live in the measurement plane: they preserve trajectory and reward evidence so the Evaluation Engine can decide whether a proposal fixed the source pressure.

SecondBrain pattern	ContextOS interpretation
Environment episode	A bounded replay fixture under `harness/fixtures` or `harness/evals`, with reset state, actions, observations, reward components, terminal status, and a stable state signature.
Quality replay case	Executable evidence derived from a failed run, environment episode, or operator correction. It must be rerun before promotion.
Benchmark candidate	A pressure source with `source_type`, `source_ref`, expected properties, severity, and resolution status. It is not trusted until it has executable expectations.
Target-bound pack	The required replay pack follows the source evidence. A hard failure cannot be closed by passing an unrelated smoke set.
Scientific promotion gate	Baseline and candidate are compared on paired cases; kept-but-underpowered runs remain review candidates, not closed improvements.
Source closure	After promotion, the original pressure is explicitly marked `fixed`, `still_failing`, `invalid`, or `superseded`; failed attempts add recurrence metadata instead of disappearing.

This turns the loop into a falsifiable control system:

source pressure
  -> replayable evidence
  -> target-bound baseline measurement
  -> bounded candidate mutation
  -> treatment measurement on the same evidence
  -> promotion gate
  -> source closure
  -> production monitor window

The invariant is simple: if a source pressure cannot be replayed or otherwise converted into executable expectations, the loop may record an Insight or ResearchTask, but it cannot claim a closed improvement.

The portable contract for this evidence is the ReplayPacket. A quality replay case, environment episode, operator correction, or golden case may each produce a packet, but all packets must name the source pressure, pinned context pack, compiled-context hash, side-effect policy, and expected outcome before the loop can claim closure.

The six primitives

InsightSynthesizer

Scans episodic memory and the trace store for recurring patterns and emits typed Insight records.

{
  "insight_id": "ins_2026_05_04_a17",
  "kind": "failure_cluster",
  "intent": "support.refund",
  "pattern": "policy.eval denial after order_lookup when refund_amount > 800 INR",
  "occurrences": 23,
  "first_seen": "2026-04-21T09:14:00Z",
  "last_seen": "2026-05-04T08:01:00Z",
  "evidence_refs": ["dr_2026_04_21_x12", "dr_2026_05_03_q88"],
  "confidence": 0.91,
  "status": "proposed"
}

Insight kinds: failure_cluster, gap_detected (recurring evidence misses), blocked_decision, common_detour, over-escalation.

StrategyCompiler

Converts validated feedback + insights into reusable strategy rules the runtime can apply at the right layer.

{
  "strategy_rule_id": "str_2026_05_04_b03",
  "applies_to": { "intent": "support.refund" },
  "trigger": { "from_insight": "ins_2026_05_04_a17" },
  "adjustment": {
    "layer": "planner",
    "type": "tool_preference",
    "value": { "prefer_tool": "adp_policy.eval_with_promotion", "over": "adp_policy.eval" }
  },
  "release_gate_target": "support.refund",
  "status": "proposed"
}

Adjustment layers: prompt, planner, retrieval, tool_selection, memory_recall, budget_allocation. Status flow: proposed → reviewed → approved → released → retired.

FeedbackStore

Captures every operator correction and tip with provenance. Cited by future improvements; auditable.

{
  "feedback_id": "fb_2026_05_04_c19",
  "kind": "correction",
  "context": { "decision_record_id": "dr_2026_05_04_a17", "intent": "support.refund" },
  "operator": "user_finance_lead_77",
  "correction": "Refund eligibility should consider prior corrections within 90 days",
  "applied_to_record": "dr_2026_05_04_a17",
  "evidence_refs": ["dr_2026_05_04_a17"],
  "captured_at": "2026-05-04T09:32:00Z"
}

Feedback kinds: correction, tip, escalation_rationale, gate_rationale. Every correction emits a learning signal to the InsightSynthesizer and writes a correction-class entry into Memory.

ChiefOfStaff

Scans due tasks, open loops, queue backlog, and approval-gate latency to propose proactive operational notes.

{
  "note_id": "cs_2026_05_04_d04",
  "kind": "open_loop_aging",
  "subject": "GATE_FINANCE_APPROVAL pending > 24h on session sess_42f1",
  "recommended_action": "page on-call finance approver or auto-deny per policy",
  "evidence_refs": ["session:sess_42f1", "audit:audit_inv_22_GATE_FINANCE_APPROVAL"],
  "status": "open"
}

Note kinds: open_loop_aging, queue_backlog, repeated_escalation, eval_target_drift, recurring_correction. Output goes to the operator surface; never to the planner directly.

ResearchQueue

When an insight indicates a knowledge gap rather than a strategy gap, an autonomous research task is queued. Output is a typed Knowledge Patch the operator can review and promote.

{
  "research_task_id": "rq_2026_05_04_e02",
  "trigger_insight_id": "ins_2026_05_04_a17",
  "question": "What are the supplier-side rules that override our default 90-day refund window?",
  "scope": { "domain": "support.refund", "max_external_calls": 0 },
  "status": "queued",
  "produces": "knowledge_patch"
}

Research tasks are budgeted (no live tool calls beyond the explicit scope); produced patches are evidence-bound knowledge-graph deltas, not free-form text.

Autotune

Proposes prompt, retrieval, or budget changes against a declared scorecard target. Bounded by the same release-gate process as a pack change.

{
  "tuning_proposal_id": "at_2026_05_04_f01",
  "target": { "intent": "support.refund", "metric": "economics_cents_per_decision", "direction": "decrease", "guardrails": ["safety>=1.0", "policy>=1.0"] },
  "candidate_change": {
    "layer": "retrieval",
    "diff": { "max_hops": { "from": 3, "to": 2 }, "top_k": { "from": 10, "to": 6 } }
  },
  "expected_delta": { "economics_cents_per_decision": -0.18, "utility": -0.01 },
  "evaluation_run_id": "eval_2026_05_04_a01",
  "status": "proposed"
}

Autotune never auto-applies; every proposal goes through the Evaluation Engine on a golden replay before review.

Before an autotune run starts, it must declare:

Required field	Purpose
`target.intent`	Keeps the search local to one business workflow.
`target.metric`	Names the single primary metric being improved.
`guardrails`	Floors for Policy and Safety, plus accepted Utility / Latency / Economics deltas.
`baseline_tuple`	Pins the pack, policy, tool manifest, evaluator suite, and model profile being compared.
`tunable_surfaces`	Enumerates exactly which fields the optimizer may change.
`search_set` / `heldout_test_set`	Prevents the proposer from iterating on the release gate data.
`rollback_target`	Names the prior released tuple before any rollout can start.

Autotune candidates are ranked on the evaluator vector, not a single blended score. Policy and Safety are floor constraints; surviving candidates are compared by target-metric improvement, blast radius, and explainability. The output is a TuningProposal, not a production patch.

Proposal lifecycle

All six primitives produce typed records. The lifecycle is uniform:

proposed → reviewed → approved → released → (retired | superseded)
                  ↘ rejected

proposed — primitive emitted the record; no human has looked at it.
reviewed — operator triaged with a verdict and rationale.
approved — change-control approver signed the promotion.
released — applied to the relevant pack / policy / catalog version; pinned for replay.
rejected / superseded / retired — terminal states with rationale recorded.

Interfaces

Inputs

Trace bundles + Decision Records from Observability
scorecards from the Evaluation Engine
Operator corrections and approval-gate verdicts
Memory promotion records (especially correction-class)
Open loops and queue state from the operator surface

Outputs

Typed Insight, StrategyRule, FeedbackRecord, Note, ResearchTask, TuningProposal envelopes
Bundle of proposals for change-control review
Audit records bound to trace_id and decision_record_id
Monitor verdicts bound to released proposal id and baseline tuple

Failure modes

Primitive auto-applies a proposal without review — invariant violation; release gate must always require an approver.
InsightSynthesizer emits noise (low-precision patterns) — mitigated by minimum-occurrence thresholds and operator-tuned filters.
StrategyCompiler produces conflicting rules across primitives — release-gate lint detects conflicts with active rules.
FeedbackStore collects corrections that are themselves wrong — every correction is auditable and reversible by a later correction with higher precedence.
ChiefOfStaff floods the operator surface — note count budgeted per operator per day; aggregations preferred to per-note alerts.
ResearchQueue task exceeds its declared scope — refused at execution; emits a security event.
Autotune regresses a guardrail metric — release-gate blocks the proposal.
Runtime-affecting proposal is treated as finished after release — regression can persist silently; every runtime-affecting release needs a monitor window and rollback target.

Operational concerns

Sampling stratified by intent and risk class so rare-but-high-value insights surface.
Per-primitive cost budgets separate from production Run Budget.
Quarterly rebaselining: prune retired/superseded records; refresh thresholds and rubrics.
All proposals discoverable from the operator surface; nothing happens out-of-band.
Append-only proposal store with full provenance; replay must reproduce the verdict.
Recently released runtime-affecting proposals enter a post-release monitor window. The window compares production scorecard slices against the baseline tuple and records one of four verdicts: keep, roll back, supersede, or retire.
New and recently modified workflows should have a regular operations review cadence. The review must decide which repeated corrections become proposals, which approval delays need owner changes, which tests are missing, and whether rollout should advance, pause, or roll back.
Operating reviews produce artifacts, not meeting notes: accepted or rejected proposals, new replay cases, assigned owners, and explicit rollout decisions.
The loop should satisfy lifecycle-monitoring and continual-improvement expectations, but the spec remains change-control first: standards alignment never replaces traces, scorecards, or rollback evidence.

Evaluation metrics

Adoption rate — fraction of proposals that ship after review (target band: 30–60% — too low means noise, too high means rubber-stamping).
Detection lead time — regression introduced → proposal emitted.
Correction-to-proposal latency — operator correction → derived StrategyRule.
Time to close repeated issue — first repeated signal → released and monitored fix with no active rollback trigger.
Post-release rollback rate — fraction of released proposals rolled back or superseded during the monitor window.
Retire rate — fraction of released proposals retired within 90 days (high retire rate signals over-aggressive shipping).
Notes acted on — fraction of Chief-of-Staff notes the operator acts on (low rate signals notification fatigue).
Research patch yield — fraction of research tasks that produce a knowledge patch the operator promotes.

Example

A complete proposal chain spawned by a single failed run:

Run sess_42f1 / dr_2026_05_04_a17 (status=ESCALATED, policy=1.0 but utility=0.62)
  → InsightSynthesizer detects 23 similar escalations
  → emits Insight ins_2026_05_04_a17
  → StrategyCompiler proposes StrategyRule str_2026_05_04_b03 (planner tool preference)
  → ResearchQueue enqueues rq_2026_05_04_e02 (supplier rule clarification)
  → ChiefOfStaff emits Note cs_2026_05_04_d04 (high open-loop count on this gate)
  → Autotune proposes at_2026_05_04_f01 (reduce hop budget on read path)
  → all six records land in change control with the same trace_id

The harness as a search target

The Improvement Loop is the entry point for an idea ContextOS treats as foundational: the harness itself is a versioned artifact that can be improved by search, not just by hand. Recent work — most directly the Stanford / MIT / KRAFTON Meta-Harness paper (Lee et al., 2026) — shows that an outer-loop optimizer with full access to prior code, scores, and execution traces can discover better harnesses than hand-engineering, on tasks ranging from text classification to long-horizon agentic coding (TerminalBench-2). The result generalizes the principle behind every primitive in this doc.

ContextOS does not ship an autonomous outer-loop optimizer in the spec. It ships the substrate one needs:

Requirement	ContextOS primitive
Experience store of prior runs (code + scores + traces)	OTEL trace bundles + DecisionRecords + pinned pack/policy/tool versions
Multi-objective scoring (Pareto, not scalar)	Five evaluators (Policy / Utility / Latency / Safety / Economics)
Lightweight validation before expensive evaluation	Interface tests on packs, policies, and tools before any golden-set replay
Disjoint search-set and test-set	Golden sets and held-out replay sets; the search-set is what the proposer sees, the test-set is what gates release
Filesystem-shaped layout that grep + cat can navigate	The `harness/` repo layout in Harness Engineering
Initialization from strong baselines	Existing pack versions are the search prior; proposals start from a published baseline, not a blank slate

Patterns the substrate enables

Pareto-frontier proposals. Autotune already accepts a target with direction and guardrails. The Pareto framing is wider: the same proposal can be evaluated against multiple operating points (e.g., low-context-cost variant vs. high-utility variant), and the operator picks a frontier point rather than a single scalar winner. This is how Meta-Harness produces a family of harnesses that trade context against accuracy.

Causal reasoning over prior failures. When the InsightSynthesizer flags a regression, the proposer (human or automated) reads not the summary but the raw traces and code of the failing candidates. The Meta-Harness ablation makes this stark: scores-only proposers reach 34.6 median accuracy; scores-plus-summary reaches 34.9; full execution traces reach 50.0. Compressed feedback loses the diagnostic signal needed to identify confounds. ContextOS retains full traces by default for exactly this reason — replay against pinned snapshots is the contract that keeps prior experience interpretable.

Additive-change preference. Empirically, control-flow edits to the planner / executor / critic loop carry higher regression risk than purely-additive changes to retrieval, evidence injection, or environment bootstrap. A discovered Meta-Harness improvement on TerminalBench-2 was a single additive change — injecting an environment snapshot before the agent loop begins — that gained on 7 of 89 tasks without regressing the rest. StrategyCompiler proposals that sit in the additive layers (retrieval, memory_recall, prompt) carry less risk than ones in planner or tool_selection; the release gate weights this when ranking competing proposals.

Cross-run transfer. The proposer can read prior runs across domains, not just within one. A pattern discovered on one intent (e.g., evidence-injection helps the cold-start) may transfer to another. The Improvement Loop’s append-only proposal store, indexed by trace_id and intent, is what makes this transfer mechanically possible.

What the Improvement Loop does not do

It does not replace human approval. Every proposal — automated or hand-authored — passes through the same release gate.
It does not learn during inference. All discovery happens offline against pinned snapshots.
It does not optimize for a single scalar. Every proposal is evaluated on the full evaluator vector.
It does not search-set its way past safety guardrails. safety and policy are floor constraints, not optimization targets; any proposal that would regress them is rejected before it reaches the test-set.
It does not stop at release. Production monitoring is the evidence that a released proposal should be kept, rolled back, superseded, or retired.

Common misconceptions

Improvement is not auto-pilot. Every primitive produces proposals; humans approve.
The Improvement Loop is not separate from change control. Its outputs land under the same release-gate process as packs, policies, and catalogs.
Release is not the end of the loop. A shipped proposal enters a monitor window; post-release regressions produce rollback, supersede, or retire decisions.
Insights are not log lines. They are typed records with occurrence counts, evidence_refs, and a status lifecycle.
Autotune is not learning during inference. It is offline tuning against a golden replay, gated by scorecard deltas.
Compressed feedback is not equivalent to traces. Summaries collapse the diagnostic signal a proposer needs to identify confounds; the experience store keeps full traces and pinned versions for this reason.
The harness is the optimization target, not the model. Changing the harness around a fixed model can produce 6× swings on the same benchmark; the Improvement Loop is the seam through which those changes are proposed, evaluated, and shipped.