Skip to content
Press / to search

Improvement Loop

Trust-plane primitives that turn failed and corrected runs into typed proposals — Insight Synthesizer, Strategy Compiler, Feedback Store, Chief-of-Staff, Research Queue, Autotune.

Foundational SpecLast reviewed: Edit on GitHub
At a glance
Trust planeControl over the other four

Six primitives that turn observed behavior into typed, release-gated proposals — never auto-applied.

Inputs
  • Trace bundles + DecisionRecords from Observability
  • Replayable environment episodes and quality replay cases
  • Scorecards from the Evaluation Engine
  • Operator corrections and approval-gate verdicts
  • Memory promotion records (especially correction-class)
  • Open loops + queue state from the operator surface
  • Post-release scorecard deltas, rollback triggers, and regression reports
Outputs
  • Insight, StrategyRule, FeedbackRecord, Note, ResearchTask, TuningProposal envelopes
  • Replay obligations, source-closure verdicts, and evidence-bound benchmark candidates
  • Bundle of proposals for change-control review
  • Audit records bound to trace_id and decision_record_id
  • Monitor verdicts that keep, roll back, supersede, or retire released proposals
Lifecycle
  1. observe
  2. propose
  3. lint
  4. review
  5. gate
  6. release
  7. monitor
Canonical types
  • Insight
  • StrategyRule
  • FeedbackRecord
  • Note
  • ResearchTask
  • TuningProposal

The Improvement Loop is the Trust-plane discipline that turns observed runtime behavior into governed change proposals. It is what closes the loop between evaluation and the next pack version.

Definition

A coordinated set of typed primitives that consume traces, decision records, operator corrections, and approval-gate outcomes; surface recurring patterns; convert them into reusable strategy rules; queue autonomous research where the gap is knowledge; and generate proactive operational notes — all as proposals that land under the same change control as packs and policies.

Why it exists

LLM-driven systems drift. Without a closed loop that turns failures into typed proposals, every regression is rediscovered the hard way and every operator correction evaporates after the conversation ends. This plane formalizes the loop: every signal is captured, scored, and routed to the primitive that owns its kind of improvement.

The same discipline applies after release. A shipped proposal is not “done” until production monitoring shows that the intended scorecard improved without a hidden policy, safety, latency, cost, or trust regression.

How it works

  1. Observe — every completed run produces a RunScore, a DecisionRecord, and a trace bundle.
  2. Capture corrections — every operator override or correction enters the FeedbackStore with provenance.
  3. Pin source evidence — if the signal came from a replay suite, task fixture, or bounded environment episode, the loop records the episode id, replay case id, baseline tuple, required pack, and stable state signature before mutation.
  4. Classify — the signal is routed as an evidence gap, strategy gap, policy gap, workflow gap, scorecard regression, environment/task failure, or operational bottleneck before anyone reaches for a prompt edit.
  5. Synthesize — the InsightSynthesizer scans episodic memory and the trace store for recurring patterns (failure clusters, blocked decisions, common detours).
  6. Compile strategies — the StrategyCompiler converts feedback + insights into reusable strategy rules adjusting prompts, plans, or tool selection.
  7. Queue research — when a pattern signals a knowledge gap (not a strategy gap), the ResearchQueue enqueues an autonomous research task.
  8. Surface operational notes — the ChiefOfStaff scans due tasks, open loops, and queue backlog to propose proactive operator notes.
  9. TuneAutotune proposes prompt, retrieval, or budget changes against a target metric.
  10. Gate and release — every proposal lands in change control; nothing auto-applies without a human review (except where policy explicitly permits auto-promotion of a specific class).
  11. Close source pressure — the replay case, benchmark candidate, environment episode group, or correction cluster is marked fixed, still_failing, invalid, or superseded with evidence.
  12. Monitor — released proposals are compared against production scorecards, correction clusters, approval latency, and rollback triggers until the operator can keep, supersede, retire, or roll back the change with evidence.

Signal routing

The Improvement Loop does not assume that every failure is a prompt problem. Classification decides which surface owns the fix.

Runtime signalRoute firstTypical proposal
Missing factContext Pack evidence source or ResearchQueueResearchTask or knowledge patch for review
Wrong tool choiceTool catalog, planner rule, or strategy layerStrategyRule against tool_selection or planner
Bad policy behaviorGovernance policy and approval boundaryPolicy change proposal with replay evidence
Confusing user outputPrompt examples, receipt language, or response rubricStrategyRule against prompt plus scorecard update
Repeated escalationAuthority boundary, workflow design, or queue ownerInsight plus operational Note or governance proposal
Slow runRetrieval path, tool path, or context shapeTuningProposal with latency guardrails
Expensive runBudget allocation and context sizeTuningProposal with economics target
Recurring operator correctionFeedbackStore and StrategyCompilerFeedbackRecord leading to StrategyRule
User distrustReceipts, approval copy, and explanation qualityInsight plus product-surface or rubric proposal
New risk patternRelease gate, rollback trigger, or refusal policyPolicy gate proposal with production monitor criteria

Environment-backed evidence

The SecondBrain improvement-loop runbook makes one operational point explicit: the best improvement signal is not a complaint, a dashboard screenshot, or a prompt diff. It is a replayable episode. A failed task should be captured with reset inputs, structured actions, observations, reward components, terminal flags, and a stable state signature so the same failure can be rerun before and after a candidate change.

ContextOS treats this as a harness-side evidence pattern, not as a new production tool plane. Production tools still execute through the Tool Gateway. Replay environments and task fixtures live in the measurement plane: they preserve trajectory and reward evidence so the Evaluation Engine can decide whether a proposal fixed the source pressure.

SecondBrain patternContextOS interpretation
Environment episodeA bounded replay fixture under harness/fixtures or harness/evals, with reset state, actions, observations, reward components, terminal status, and a stable state signature.
Quality replay caseExecutable evidence derived from a failed run, environment episode, or operator correction. It must be rerun before promotion.
Benchmark candidateA pressure source with source_type, source_ref, expected properties, severity, and resolution status. It is not trusted until it has executable expectations.
Target-bound packThe required replay pack follows the source evidence. A hard failure cannot be closed by passing an unrelated smoke set.
Scientific promotion gateBaseline and candidate are compared on paired cases; kept-but-underpowered runs remain review candidates, not closed improvements.
Source closureAfter promotion, the original pressure is explicitly marked fixed, still_failing, invalid, or superseded; failed attempts add recurrence metadata instead of disappearing.

This turns the loop into a falsifiable control system:

source pressure
  -> replayable evidence
  -> target-bound baseline measurement
  -> bounded candidate mutation
  -> treatment measurement on the same evidence
  -> promotion gate
  -> source closure
  -> production monitor window

The invariant is simple: if a source pressure cannot be replayed or otherwise converted into executable expectations, the loop may record an Insight or ResearchTask, but it cannot claim a closed improvement.

The portable contract for this evidence is the ReplayPacket. A quality replay case, environment episode, operator correction, or golden case may each produce a packet, but all packets must name the source pressure, pinned context pack, compiled-context hash, side-effect policy, and expected outcome before the loop can claim closure.

The six primitives

InsightSynthesizer

Scans episodic memory and the trace store for recurring patterns and emits typed Insight records.

{
  "insight_id": "ins_2026_05_04_a17",
  "kind": "failure_cluster",
  "intent": "support.refund",
  "pattern": "policy.eval denial after order_lookup when refund_amount > 800 INR",
  "occurrences": 23,
  "first_seen": "2026-04-21T09:14:00Z",
  "last_seen": "2026-05-04T08:01:00Z",
  "evidence_refs": ["dr_2026_04_21_x12", "dr_2026_05_03_q88"],
  "confidence": 0.91,
  "status": "proposed"
}

Insight kinds: failure_cluster, gap_detected (recurring evidence misses), blocked_decision, common_detour, over-escalation.

StrategyCompiler

Converts validated feedback + insights into reusable strategy rules the runtime can apply at the right layer.

{
  "strategy_rule_id": "str_2026_05_04_b03",
  "applies_to": { "intent": "support.refund" },
  "trigger": { "from_insight": "ins_2026_05_04_a17" },
  "adjustment": {
    "layer": "planner",
    "type": "tool_preference",
    "value": { "prefer_tool": "adp_policy.eval_with_promotion", "over": "adp_policy.eval" }
  },
  "release_gate_target": "support.refund",
  "status": "proposed"
}

Adjustment layers: prompt, planner, retrieval, tool_selection, memory_recall, budget_allocation. Status flow: proposed → reviewed → approved → released → retired.

FeedbackStore

Captures every operator correction and tip with provenance. Cited by future improvements; auditable.

{
  "feedback_id": "fb_2026_05_04_c19",
  "kind": "correction",
  "context": { "decision_record_id": "dr_2026_05_04_a17", "intent": "support.refund" },
  "operator": "user_finance_lead_77",
  "correction": "Refund eligibility should consider prior corrections within 90 days",
  "applied_to_record": "dr_2026_05_04_a17",
  "evidence_refs": ["dr_2026_05_04_a17"],
  "captured_at": "2026-05-04T09:32:00Z"
}

Feedback kinds: correction, tip, escalation_rationale, gate_rationale. Every correction emits a learning signal to the InsightSynthesizer and writes a correction-class entry into Memory.

ChiefOfStaff

Scans due tasks, open loops, queue backlog, and approval-gate latency to propose proactive operational notes.

{
  "note_id": "cs_2026_05_04_d04",
  "kind": "open_loop_aging",
  "subject": "GATE_FINANCE_APPROVAL pending > 24h on session sess_42f1",
  "recommended_action": "page on-call finance approver or auto-deny per policy",
  "evidence_refs": ["session:sess_42f1", "audit:audit_inv_22_GATE_FINANCE_APPROVAL"],
  "status": "open"
}

Note kinds: open_loop_aging, queue_backlog, repeated_escalation, eval_target_drift, recurring_correction. Output goes to the operator surface; never to the planner directly.

ResearchQueue

When an insight indicates a knowledge gap rather than a strategy gap, an autonomous research task is queued. Output is a typed Knowledge Patch the operator can review and promote.

{
  "research_task_id": "rq_2026_05_04_e02",
  "trigger_insight_id": "ins_2026_05_04_a17",
  "question": "What are the supplier-side rules that override our default 90-day refund window?",
  "scope": { "domain": "support.refund", "max_external_calls": 0 },
  "status": "queued",
  "produces": "knowledge_patch"
}

Research tasks are budgeted (no live tool calls beyond the explicit scope); produced patches are evidence-bound knowledge-graph deltas, not free-form text.

Autotune

Proposes prompt, retrieval, or budget changes against a declared scorecard target. Bounded by the same release-gate process as a pack change.

{
  "tuning_proposal_id": "at_2026_05_04_f01",
  "target": { "intent": "support.refund", "metric": "economics_cents_per_decision", "direction": "decrease", "guardrails": ["safety>=1.0", "policy>=1.0"] },
  "candidate_change": {
    "layer": "retrieval",
    "diff": { "max_hops": { "from": 3, "to": 2 }, "top_k": { "from": 10, "to": 6 } }
  },
  "expected_delta": { "economics_cents_per_decision": -0.18, "utility": -0.01 },
  "evaluation_run_id": "eval_2026_05_04_a01",
  "status": "proposed"
}

Autotune never auto-applies; every proposal goes through the Evaluation Engine on a golden replay before review.

Before an autotune run starts, it must declare:

Required fieldPurpose
target.intentKeeps the search local to one business workflow.
target.metricNames the single primary metric being improved.
guardrailsFloors for Policy and Safety, plus accepted Utility / Latency / Economics deltas.
baseline_tuplePins the pack, policy, tool manifest, evaluator suite, and model profile being compared.
tunable_surfacesEnumerates exactly which fields the optimizer may change.
search_set / heldout_test_setPrevents the proposer from iterating on the release gate data.
rollback_targetNames the prior released tuple before any rollout can start.

Autotune candidates are ranked on the evaluator vector, not a single blended score. Policy and Safety are floor constraints; surviving candidates are compared by target-metric improvement, blast radius, and explainability. The output is a TuningProposal, not a production patch.

Proposal lifecycle

All six primitives produce typed records. The lifecycle is uniform:

proposed → reviewed → approved → released → (retired | superseded)
                  ↘ rejected
  • proposed — primitive emitted the record; no human has looked at it.
  • reviewed — operator triaged with a verdict and rationale.
  • approved — change-control approver signed the promotion.
  • released — applied to the relevant pack / policy / catalog version; pinned for replay.
  • rejected / superseded / retired — terminal states with rationale recorded.

Interfaces

Inputs

  • Trace bundles + Decision Records from Observability
  • scorecards from the Evaluation Engine
  • Operator corrections and approval-gate verdicts
  • Memory promotion records (especially correction-class)
  • Open loops and queue state from the operator surface

Outputs

  • Typed Insight, StrategyRule, FeedbackRecord, Note, ResearchTask, TuningProposal envelopes
  • Bundle of proposals for change-control review
  • Audit records bound to trace_id and decision_record_id
  • Monitor verdicts bound to released proposal id and baseline tuple

Failure modes

  • Primitive auto-applies a proposal without review — invariant violation; release gate must always require an approver.
  • InsightSynthesizer emits noise (low-precision patterns) — mitigated by minimum-occurrence thresholds and operator-tuned filters.
  • StrategyCompiler produces conflicting rules across primitives — release-gate lint detects conflicts with active rules.
  • FeedbackStore collects corrections that are themselves wrong — every correction is auditable and reversible by a later correction with higher precedence.
  • ChiefOfStaff floods the operator surface — note count budgeted per operator per day; aggregations preferred to per-note alerts.
  • ResearchQueue task exceeds its declared scope — refused at execution; emits a security event.
  • Autotune regresses a guardrail metric — release-gate blocks the proposal.
  • Runtime-affecting proposal is treated as finished after release — regression can persist silently; every runtime-affecting release needs a monitor window and rollback target.

Operational concerns

  • Sampling stratified by intent and risk class so rare-but-high-value insights surface.
  • Per-primitive cost budgets separate from production Run Budget.
  • Quarterly rebaselining: prune retired/superseded records; refresh thresholds and rubrics.
  • All proposals discoverable from the operator surface; nothing happens out-of-band.
  • Append-only proposal store with full provenance; replay must reproduce the verdict.
  • Recently released runtime-affecting proposals enter a post-release monitor window. The window compares production scorecard slices against the baseline tuple and records one of four verdicts: keep, roll back, supersede, or retire.
  • New and recently modified workflows should have a regular operations review cadence. The review must decide which repeated corrections become proposals, which approval delays need owner changes, which tests are missing, and whether rollout should advance, pause, or roll back.
  • Operating reviews produce artifacts, not meeting notes: accepted or rejected proposals, new replay cases, assigned owners, and explicit rollout decisions.
  • The loop should satisfy lifecycle-monitoring and continual-improvement expectations, but the spec remains change-control first: standards alignment never replaces traces, scorecards, or rollback evidence.

Evaluation metrics

  • Adoption rate — fraction of proposals that ship after review (target band: 30–60% — too low means noise, too high means rubber-stamping).
  • Detection lead time — regression introduced → proposal emitted.
  • Correction-to-proposal latency — operator correction → derived StrategyRule.
  • Time to close repeated issue — first repeated signal → released and monitored fix with no active rollback trigger.
  • Post-release rollback rate — fraction of released proposals rolled back or superseded during the monitor window.
  • Retire rate — fraction of released proposals retired within 90 days (high retire rate signals over-aggressive shipping).
  • Notes acted on — fraction of Chief-of-Staff notes the operator acts on (low rate signals notification fatigue).
  • Research patch yield — fraction of research tasks that produce a knowledge patch the operator promotes.

Example

A complete proposal chain spawned by a single failed run:

Run sess_42f1 / dr_2026_05_04_a17 (status=ESCALATED, policy=1.0 but utility=0.62)
  → InsightSynthesizer detects 23 similar escalations
  → emits Insight ins_2026_05_04_a17
  → StrategyCompiler proposes StrategyRule str_2026_05_04_b03 (planner tool preference)
  → ResearchQueue enqueues rq_2026_05_04_e02 (supplier rule clarification)
  → ChiefOfStaff emits Note cs_2026_05_04_d04 (high open-loop count on this gate)
  → Autotune proposes at_2026_05_04_f01 (reduce hop budget on read path)
  → all six records land in change control with the same trace_id

The harness as a search target

The Improvement Loop is the entry point for an idea ContextOS treats as foundational: the harness itself is a versioned artifact that can be improved by search, not just by hand. Recent work — most directly the Stanford / MIT / KRAFTON Meta-Harness paper (Lee et al., 2026) — shows that an outer-loop optimizer with full access to prior code, scores, and execution traces can discover better harnesses than hand-engineering, on tasks ranging from text classification to long-horizon agentic coding (TerminalBench-2). The result generalizes the principle behind every primitive in this doc.

ContextOS does not ship an autonomous outer-loop optimizer in the spec. It ships the substrate one needs:

RequirementContextOS primitive
Experience store of prior runs (code + scores + traces)OTEL trace bundles + DecisionRecords + pinned pack/policy/tool versions
Multi-objective scoring (Pareto, not scalar)Five evaluators (Policy / Utility / Latency / Safety / Economics)
Lightweight validation before expensive evaluationInterface tests on packs, policies, and tools before any golden-set replay
Disjoint search-set and test-setGolden sets and held-out replay sets; the search-set is what the proposer sees, the test-set is what gates release
Filesystem-shaped layout that grep + cat can navigateThe harness/ repo layout in Harness Engineering
Initialization from strong baselinesExisting pack versions are the search prior; proposals start from a published baseline, not a blank slate

Patterns the substrate enables

Pareto-frontier proposals. Autotune already accepts a target with direction and guardrails. The Pareto framing is wider: the same proposal can be evaluated against multiple operating points (e.g., low-context-cost variant vs. high-utility variant), and the operator picks a frontier point rather than a single scalar winner. This is how Meta-Harness produces a family of harnesses that trade context against accuracy.

Causal reasoning over prior failures. When the InsightSynthesizer flags a regression, the proposer (human or automated) reads not the summary but the raw traces and code of the failing candidates. The Meta-Harness ablation makes this stark: scores-only proposers reach 34.6 median accuracy; scores-plus-summary reaches 34.9; full execution traces reach 50.0. Compressed feedback loses the diagnostic signal needed to identify confounds. ContextOS retains full traces by default for exactly this reason — replay against pinned snapshots is the contract that keeps prior experience interpretable.

Additive-change preference. Empirically, control-flow edits to the planner / executor / critic loop carry higher regression risk than purely-additive changes to retrieval, evidence injection, or environment bootstrap. A discovered Meta-Harness improvement on TerminalBench-2 was a single additive change — injecting an environment snapshot before the agent loop begins — that gained on 7 of 89 tasks without regressing the rest. StrategyCompiler proposals that sit in the additive layers (retrieval, memory_recall, prompt) carry less risk than ones in planner or tool_selection; the release gate weights this when ranking competing proposals.

Cross-run transfer. The proposer can read prior runs across domains, not just within one. A pattern discovered on one intent (e.g., evidence-injection helps the cold-start) may transfer to another. The Improvement Loop’s append-only proposal store, indexed by trace_id and intent, is what makes this transfer mechanically possible.

What the Improvement Loop does not do

  • It does not replace human approval. Every proposal — automated or hand-authored — passes through the same release gate.
  • It does not learn during inference. All discovery happens offline against pinned snapshots.
  • It does not optimize for a single scalar. Every proposal is evaluated on the full evaluator vector.
  • It does not search-set its way past safety guardrails. safety and policy are floor constraints, not optimization targets; any proposal that would regress them is rejected before it reaches the test-set.
  • It does not stop at release. Production monitoring is the evidence that a released proposal should be kept, rolled back, superseded, or retired.

Common misconceptions

  • Improvement is not auto-pilot. Every primitive produces proposals; humans approve.
  • The Improvement Loop is not separate from change control. Its outputs land under the same release-gate process as packs, policies, and catalogs.
  • Release is not the end of the loop. A shipped proposal enters a monitor window; post-release regressions produce rollback, supersede, or retire decisions.
  • Insights are not log lines. They are typed records with occurrence counts, evidence_refs, and a status lifecycle.
  • Autotune is not learning during inference. It is offline tuning against a golden replay, gated by scorecard deltas.
  • Compressed feedback is not equivalent to traces. Summaries collapse the diagnostic signal a proposer needs to identify confounds; the experience store keeps full traces and pinned versions for this reason.
  • The harness is the optimization target, not the model. Changing the harness around a fixed model can produce 6× swings on the same benchmark; the Improvement Loop is the seam through which those changes are proposed, evaluated, and shipped.