Most teams say they have an AI feedback loop when they really have a list of complaints.
A support lead corrects an answer. An engineer edits a prompt. A dashboard moves. Two weeks later the same class of failure reappears and nobody can prove whether the change fixed the original problem, moved it somewhere else, or only helped the demo case.
That is not an improvement loop. It is change without experimental control.
The useful version is stricter: turn the failure into replayable evidence before you mutate the harness. That is the main lesson from SecondBrain’s Environment and Improvement Loop guide, and it sharpens how ContextOS should explain the Improvement Loop to builders.
The missing object: the episode
The guide frames SecondBrain’s loop across four planes:
| Plane | What it contributes |
|---|---|
| Environments | Replayable task episodes with reset state, structured actions, observations, rewards, persistence, export, and replay. |
| Quality | The control plane that turns weak runs into replay cases, benchmark pressure, and health summaries. |
| Autotune | Bounded lane-specific mutation in worktrees, scored against benchmark packs and promotion gates. |
| Antahkarana | The cognitive loop that records goals, regrets, strategy priors, cycle outcomes, and closure. |
For ContextOS readers, the important translation is this:
DecisionRecord = what happened in production
environment episode = what can be rerun under controlled measurement
replay case = the executable expectation derived from that evidence
benchmark candidate = the unresolved pressure to fix
TuningProposal = a bounded harness change that claims it can fix itThe episode is the missing object in many agent programs. Without it, the team cannot separate “we changed the prompt” from “we fixed the source pressure.”
Why environments matter
OpenEnv’s current docs describe an environment as an isolated execution context where an agent takes structured actions and receives observations. Its core loop is familiar: reset, observe state, choose an action, step, receive observation plus reward and termination metadata.
That shape matters outside RL training too.
In a production agent harness, an environment does not have to mean “train a model online.” It can be a local measurement fixture:
| Environment field | Why the improvement loop needs it |
|---|---|
| reset inputs | Recreate the same starting condition before and after a candidate change. |
| action schema | Keep the agent’s possible moves structured and comparable. |
| observation schema | Preserve what the agent could see at each step. |
| reward components | Explain why a trajectory scored well or poorly. |
| terminal / truncated flags | Distinguish solved, failed, timed-out, and cut-short episodes. |
| stable state signature | Detect environment drift between baseline and treatment. |
| persisted episode record | Let reviewers inspect the exact trajectory later. |
This is not the same as the production tool plane. ContextOS still routes real external effects through the Tool Gateway, with policy, approvals, idempotency, and audit. Environments sit in the measurement plane. Their job is to preserve enough state that the failure can become executable evidence.
The loop should be falsifiable
The SecondBrain runbook reads less like “AI learns from feedback” and more like a controlled experiment:
| Invariant | Question it answers |
|---|---|
| Hypothesis | What pressure are we trying to fix? |
| Source evidence | Which candidate, replay case, episode, trace, or correction created the pressure? |
| Control measurement | How did the current harness score before mutation? |
| Intervention | What bounded surface changed, in which worktree or candidate branch? |
| Treatment measurement | How did the candidate score on the same evidence? |
| Confirmation | Did paired-case evidence, replay obligations, and promotion gates agree? |
| Closure | Was the source marked fixed, still_failing, invalid, or superseded? |
That last row is easy to overlook. A kept candidate is not a closed loop. A candidate that improves one metric on too few paired cases is useful evidence, but it is still underpowered. A candidate that passes an unrelated smoke pack does not close a hard failure. The source pressure is only closed when the evidence that created it has been rerun and the promotion gate accepts the result.
In ContextOS terms:
source pressure
-> replayable evidence
-> target-bound baseline measurement
-> bounded candidate mutation
-> treatment measurement on the same evidence
-> promotion gate
-> source closure
-> production monitor windowThe key phrase is “on the same evidence.”
What the research is converging on
Recent harness-optimization work is converging on the same shape.
Meta-Harness treats the harness itself as the object of search. Its proposer reads prior source code, scores, and execution traces through a filesystem, then searches over harness variants. The useful lesson for builders is not the benchmark number. It is the data access pattern: compressed feedback is weaker than raw prior experience.
HARBOR frames automated harness optimization as constrained search over a bounded, mixed-variable flag space, with a reproducible task suite and safety checks. That is close to ContextOS’s TuningProposal: declare what can change, declare what must not regress, score candidates, and refuse unsafe promotion.
Agentic Harness Engineering makes observability the center of harness evolution. Editable components need file-level representation, prior trajectories need drill-down evidence, and every edit should carry a prediction that later gets checked against outcomes. Its reported gains localize more to tools, middleware, and memory than to the system prompt, which matches the ContextOS view that prompt tuning is only one surface of harness optimization.
OpenEnv adds the environment side: make the task state, actions, observations, rewards, and termination semantics explicit enough that episodes can be driven and compared by different evaluators and trainers.
Put those together and the pattern is clear:
| Requirement | ContextOS surface |
|---|---|
| Prior experience is inspectable | OTEL traces, DecisionRecords, scorecards, environment episodes, replay cases. |
| Search space is bounded | Context Pack tunable surfaces, lane specs, policy floors, rollout gates. |
| Evidence is paired | Baseline and candidate run on the same source cases. |
| Safety is constrained | Policy and Safety are floor constraints, not optimization targets. |
| Closure is explicit | Source pressure remains open until replay and promotion evidence close it. |
The ContextOS version
A ContextOS-aligned improvement loop should make environment evidence a first-class harness artifact without confusing it with production execution.
harness/
fixtures/
support.refund/
task-envs/
supplier_exception_boundary.json
evals/
support.refund/
replay-cases.jsonl
release-test.jsonl
experience/
runs/
trace.json
decision-record.json
scorecard.json
episodes/
env_episode_01.json
candidates/
hc_2026_05_14_refund_boundary/
candidate.json
baseline-scorecard.json
treatment-scorecard.json
promotion-gate.json
closure.jsonThe exact directories matter less than the invariant: the proposer and reviewer can navigate from a production failure to the replayable episode, from the episode to the candidate, from the candidate to the scorecard, and from the scorecard to the closure verdict.
That is also why “autotune” must stay boring. An autotune lane should name:
| Field | Purpose |
|---|---|
| target intent | Keep the loop local to one workflow. |
| primary metric | Prevent vague optimization. |
| guardrails | Stop cost or latency wins from eroding policy, safety, or approval boundaries. |
| baseline tuple | Pin pack, policy, tool manifest, model profile, and evaluator suite. |
| tunable surfaces | Declare exactly which fields may change. |
| search set and held-out set | Prevent the proposer from iterating on the release gate. |
| rollback target | Keep the previous tuple available if production monitoring disagrees. |
If the lane cannot name those fields, it is not ready to mutate anything.
What this changes for teams
The improvement loop becomes more mechanical and less mystical.
When a user correction arrives, do not start with “rewrite the prompt.” Start with:
- Can this correction become a replay case?
- Is there a bounded environment episode that reproduces the failure?
- Which source pressure owns it?
- What is the required benchmark pack?
- What candidate surfaces are allowed to change?
- What paired evidence would close the pressure?
- What monitor window decides whether production keeps it?
Sometimes the answer is “we do not have enough executable evidence yet.” That is a valid outcome. The loop can emit an Insight or ResearchTask. It should not claim a closed improvement.
The bar
A harness improvement loop is credible when it can say:
We saw this failure.
We captured the source evidence.
We replayed the baseline.
We changed only these surfaces.
We replayed the candidate on the same evidence.
The promotion gate passed with guardrails intact.
We closed the original pressure.
We monitored production and kept, rolled back, superseded, or retired the change.That is the difference between a feedback loop and an improvement loop.
The former collects reactions. The latter produces evidence-bound, replayable, reversible harness changes.