Autotune is a dangerous word because it sounds automatic.
Most teams hear it and picture a background job that rewrites prompts after a few bad answers. That is not autotune. That is an unreviewed deployment pipeline with a better name. If the system can change its own instructions, retrieval budgets, tool preferences, or approval thresholds without replay, scorecard guardrails, and a rollback path, it is not learning. It is mutating.
The useful version is stricter: autotune proposes bounded harness changes against a declared metric, proves them on replay, and sends the winner through the same release path as any other ContextOS artifact.
This post fills the gap between the earlier pieces:
| Prior post | What it covers | What this post adds |
|---|---|---|
| From Operator Correction to Released StrategyRule | One correction becoming one StrategyRule | Recurring, metric-driven candidate search |
| Replay Harness in Code | Reproducing a DecisionRecord from pinned inputs | Using replay as the autotune scoring engine |
| Pack Rollout in Five Stages | Shipping one candidate safely | How autotune emits release-ready candidates |
| The Eight-Property Harness Audit | Whether the harness is production-ready | Whether the harness can improve itself without losing control |
The short version: ContextOS already has the substrate. DecisionRecords give evidence. Scorecards give metrics. Replay gives cheap historical testing. The Improvement Loop gives typed proposals. Rollout gives blast-radius control. Autotune is the loop that connects those pieces without bypassing governance.
The research signal
The useful research direction is not “make the prompt longer.” It is “treat the harness as the optimization target.”
The Meta-Harness work from Stanford / MIT / KRAFTON, “Meta-Harness: End-to-End Optimization of Model Harnesses”, makes the point directly: changing the surrounding harness while holding the model fixed can produce large performance swings. Their loop stores prior code, scores, and traces, then searches over harness variants. That maps cleanly to ContextOS: the thing being tuned is not just text. It can be a context pack, retrieval rule, planner skill, evaluator rubric, approval threshold, or tool-selection strategy.
OpenAI’s evaluation stack is moving in the same practical direction. Trace grading treats agent traces as the object being evaluated, not just final strings. Evals and graders turn datasets and rubric-bound checks into release signals. Prompt optimization is useful, but it is still one surface of a broader system.
The ContextOS distinction is this: prompt optimization is a special case of harness optimization. If the best candidate is a prompt tweak, fine. If the best candidate is top_k: 10 -> 6, a stricter evidence requirement, a different tool preference, or a higher approval threshold, the loop should be able to discover and propose that too.
The autotune contract
An autotune run must answer seven questions before it is allowed to start.
| Question | Required answer |
|---|---|
| What is the target? | One primary metric, one intent, one active harness tuple |
| What must not regress? | Guardrails for Policy, Safety, Utility, Latency, and Economics |
| What can change? | A declared search space of tunable harness fields |
| What data can it see? | Search-set traces, never the held-out release test set |
| How is it scored? | Replay plus the pinned evaluator suite |
| How is the winner chosen? | Pareto rule, not a single blended score |
| How can it ship? | TuningProposal -> review -> rollout -> replayable release record |
Without those answers, the optimizer has too much freedom. Too much freedom is how a cost optimizer learns to skip evidence, a latency optimizer learns to avoid tools, and a utility optimizer learns to route around approval gates.
The target should be boring and local:
{
"target": {
"intent": "support.refund",
"metric": "economics_cents_per_decision",
"direction": "decrease",
"baseline_tuple": {
"pack": "ctxpack.support@5.2.0",
"policy": "policy.returns@4.1.0",
"tools": "tools.support@3.7.0",
"evals": "evals.refund@2.4.0"
},
"guardrails": {
"policy": { "min": 1.0 },
"safety": { "min": 1.0 },
"utility_delta": { "min": -0.02 },
"p95_latency_delta": { "max_ms": 250 },
"approval_bypass_rate": { "max": 0 }
}
}
}Do not tune “the agent.” Tune one intent. Do not tune “quality.” Tune a named metric. Do not let a metric stand alone. Bind it to guardrails.
The experience store
Autotune needs an experience store, not a dashboard screenshot. The store is the filesystem or database surface where every prior run can be replayed, scored, grouped, and cited.
At minimum, each run needs:
| Field | Why autotune needs it |
|---|---|
trace_id / run_id | To replay and cite the exact run |
intent_id | To avoid mixing unrelated tasks |
pack_version | To attribute behavior to a context artifact |
policy_bundle_version | To know which rules were active |
tool_manifest_version | To know which actions were reachable |
evaluator_suite_version | To compare scores honestly |
scorecard | To compute deltas |
decision_record_id | To bind final outcome to evidence and controls |
operator_corrections[] | To detect recurring human overrides |
release_tuple_id | To reproduce the exact runtime configuration |
A repo-local version can be plain:
harness/
experience/
runs/
2026-05-12/
run_01HY9.../
trace.json
decision-record.json
scorecard.json
tool-transcripts.jsonl
evidence-manifest.json
goldens/
support.refund/
search.jsonl
test.jsonl
autotune/
runs/
at_2026_05_12_refund_cost/
target.json
candidates.jsonl
replay-results.jsonl
selected-proposal.jsonThat layout is intentionally not fancy. The important property is that autotune can read prior experience without asking a human to explain what happened.
The search space
The optimizer can only change fields the harness declares tunable. This is the most important safety control in the design.
export type TunableSurface =
| {
layer: "retrieval"
field: "top_k" | "max_hops" | "source_priority"
bounds: { min?: number; max?: number; allowed?: string[] }
}
| {
layer: "context_budget"
field: "evidence_tokens" | "memory_tokens" | "tool_tokens"
bounds: { min: number; max: number; step: number }
}
| {
layer: "planner"
field: "tool_preference" | "decompose_before_tool"
bounds: { allowed: string[] }
}
| {
layer: "prompt"
field: "instruction_block"
bounds: { max_added_tokens: number; forbidden_terms: string[] }
}
| {
layer: "approval_gate"
field: "threshold"
bounds: { min: number; max: number; step: number }
}The surface belongs in the Context Pack or harness manifest. If a field is not declared tunable, autotune does not touch it. This keeps the optimizer from discovering “creative” shortcuts such as disabling an evaluator, lowering a safety threshold, or removing expensive evidence.
Start with three surfaces:
| Surface | Useful first candidates | Common failure caught by guardrails |
|---|---|---|
| Retrieval | top_k, max_hops, source priority | Lower cost by dropping required evidence |
| Budgeting | evidence / memory / tool bucket token allocation | Faster runs that hallucinate missing facts |
| Planner | tool preference, tool order, decompose-before-tool | Utility gains that increase risky calls |
Prompt tuning is fourth, not first. Prompts are easy to change and hard to reason about. Retrieval and budget candidates are often easier to replay, diff, and rollback.
The loop in code
The autotune runner is a pipeline. It should be deterministic enough that another engineer can reproduce why a proposal won.
type AutotuneRun = {
run_id: string
target: TuningTarget
search_set: ReplayCase[]
heldout_test_set: ReplayCase[]
candidates: CandidateChange[]
evaluator_suite: string
}
type CandidateChange = {
candidate_id: string
layer: "retrieval" | "context_budget" | "planner" | "prompt" | "approval_gate"
patch: Record<string, unknown>
rationale: string
}
type CandidateScore = {
candidate_id: string
search: ScorecardDelta
heldout: ScorecardDelta
guardrails_passed: boolean
changed_runs: number
regression_runs: string[]
}
export async function runAutotune(run: AutotuneRun): Promise<TuningProposal | null> {
const scores: CandidateScore[] = []
for (const candidate of run.candidates) {
const search = await replayAndScore(run.search_set, candidate, run.evaluator_suite)
if (!passesGuardrails(search, run.target.guardrails)) continue
// Only candidates that survive search-set guardrails see the held-out set.
const heldout = await replayAndScore(run.heldout_test_set, candidate, run.evaluator_suite)
scores.push({
candidate_id: candidate.candidate_id,
search,
heldout,
guardrails_passed: passesGuardrails(heldout, run.target.guardrails),
changed_runs: heldout.changed_runs,
regression_runs: heldout.regression_runs,
})
}
const winner = selectParetoWinner(scores, run.target)
if (!winner || !winner.guardrails_passed) return null
return buildProposal(run, winner)
}A few details matter more than the code volume.
The held-out set is not used for iteration. The proposer can search against the search set. The release gate checks the held-out set. If you let autotune repeatedly see the held-out set, you have converted it into training data.
The selection function should be Pareto-aware. Do not collapse Policy, Safety, Utility, Latency, and Economics into one weighted score unless you want arguments about weights to become your release process. Hard guardrails block. Among surviving candidates, choose the one that improves the target with the smallest blast radius.
The result is a proposal, not a patch applied to production.
A real proposal shape
Autotune output should look like every other Improvement Loop artifact: typed, reviewable, replay-backed, and bound to evidence.
{
"tuning_proposal_id": "at_2026_05_12_refund_cost_01",
"status": "proposed",
"target": {
"intent": "support.refund",
"metric": "economics_cents_per_decision",
"direction": "decrease"
},
"candidate_change": {
"layer": "retrieval",
"target_pack": "ctxpack.support@5.2.0",
"patch": {
"evidence.order_history.top_k": { "from": 10, "to": 6 },
"evidence.supplier_policy.max_hops": { "from": 3, "to": 2 }
}
},
"replay": {
"search_set": "goldens/support.refund/search@2026-05-12",
"heldout_test_set": "goldens/support.refund/test@2026-05-12",
"evaluation_run_id": "eval_2026_05_12_91f",
"baseline_tuple": "release.support.refund@2026-05-09"
},
"observed_delta": {
"economics_cents_per_decision": -0.21,
"utility": -0.006,
"policy": 0,
"safety": 0,
"p95_latency_ms": -180
},
"guardrails": {
"passed": true,
"regression_runs": []
},
"release_path": {
"requires_reviewers": ["reliability", "policy", "domain_owner"],
"rollout_stage": "0%_shadow",
"rollback_target": "ctxpack.support@5.2.0"
}
}Notice what is not present: “autotune changed production.” The proposal still needs review, approval, and staged rollout.
What candidates should be allowed to win
Autotune should prefer candidates that are small, explainable, and reversible.
| Candidate | Usually safe to start with | Review question |
|---|---|---|
Reduce top_k for one evidence bucket | Yes | Did evidence coverage stay complete? |
| Move source priority from wiki to policy KB | Yes | Is the policy KB owned and fresh? |
| Increase evidence bucket budget by 400 tokens | Yes | Did cost rise within the target band? |
| Add a planner tool preference | Yes | Did risky tool calls increase? |
| Rewrite the whole system prompt | No | Why is this not a smaller typed change? |
| Lower an approval threshold | No | What policy owner approved the risk? |
| Disable an evaluator | Never | This is not a candidate; it is an invariant violation |
The optimizer is not rewarded for cleverness. It is rewarded for producing candidates a reviewer can understand in five minutes.
The operating cadence
Autotune works best as a routine, not a special project.
Daily:
- Build intent-level slices from the last 24 hours of traces.
- Detect targets where one metric moved outside its band.
- Queue at most one autotune run per intent and risk class.
- Emit proposals only when held-out replay passes guardrails.
Weekly:
- Review proposal adoption rate.
- Add corrected production failures to the golden search set.
- Refresh the held-out set without letting the proposer inspect it.
- Retire tunable surfaces that keep producing rejected proposals.
- Promote approved candidates through the five-stage rollout path.
Quarterly:
- Rebaseline evaluator rubrics.
- Re-split search and test sets.
- Audit whether guardrails still match business risk.
- Remove stale candidate generators.
This cadence keeps autotune from becoming a novelty button. It becomes part of the harness maintenance cycle.
Where autotune lives in ContextOS
Autotune is not a separate service with special power. It is one producer inside the Improvement Loop.
traces + scorecards + corrections
->
InsightSynthesizer / StrategyCompiler / Autotune
->
typed proposals
->
replay gate
->
human review
->
pack / policy / tool / evaluator release
->
staged rollout
->
new traces + scorecardsThat placement matters. The loop can improve retrieval, prompts, planner choices, and budgets, but every candidate still passes through Trust-plane controls. The optimizer does not get an escape hatch from the same harness it is trying to improve.
Failure modes
The common failures are predictable.
| Failure | What it looks like | Control |
|---|---|---|
| Metric monoculture | Cost improves because the agent stopped gathering evidence | Hard Policy / Safety / Utility guardrails |
| Test-set leakage | Every proposal looks great until production | Search/test split and immutable held-out runs |
| Prompt sprawl | Autotune keeps appending instructions | Token budget and prompt-change size limits |
| Eval overfitting | Candidate learns the judge rubric, not the task | Mixed evaluator suite: rule, replay, human, business |
| Unsafe auto-promotion | Winning candidate ships without review | Proposal-only invariant |
| Hidden blast radius | Candidate affects more intents than declared | Intent-scoped manifests and rollout routing |
| Rollback theater | Prior version cannot be replayed | Release tuple pinned before promotion |
The key is to treat each failure as a harness property, not a model behavior. The model did not overfit the eval by itself. The harness gave the optimizer an overfit path.
The first implementation
The smallest useful implementation is one metric, one intent, and one tunable surface.
Pick an intent with volume and low external risk, such as a read-mostly support triage flow. Pick a metric that matters but is not safety-critical, such as economics_cents_per_decision or p95_latency_ms. Declare exactly one tunable surface, such as retrieval top_k for one evidence bucket. Split the existing golden traces into search and held-out test sets. Run three candidates. Emit one proposal. Do not ship it automatically.
The first version can fit in a week:
| Day | Deliverable |
|---|---|
| 1 | Add tunable_surfaces[] to one Context Pack manifest |
| 2 | Export 200 replayable traces into search/test sets |
| 3 | Implement candidate generation for one field |
| 4 | Run replay and scorecard deltas |
| 5 | Emit TuningProposal and send it through review |
If that loop works, add surfaces slowly. The goal is not to make the optimizer powerful. The goal is to make improvement routine.
Definition of done
An autotune loop is production-ready when the answer to each question is yes:
| Check | Pass line |
|---|---|
| Can every candidate be traced to a metric target? | target.metric and target.intent are required |
| Can every candidate be replayed? | It references pinned traces, pack, policy, tools, and evaluator suite |
| Can every candidate be reviewed? | The patch is typed, small, and has a reviewer owner |
| Can every candidate be rejected? | Rejection reason is recorded and future search avoids the same dead end |
| Can every candidate be rolled back? | Release tuple has a prior pinned target |
| Can every shipped candidate be measured? | Production scorecards compare baseline and candidate by intent |
| Can the loop itself be audited? | Autotune runs write target, candidates, scores, winner, and rationale |
That is the practical meaning of “improvement loop baked into the system.” The system is not smarter because it can change itself. It is trustworthy because every attempted change is typed, scored, replayed, reviewed, released, measured, and reversible.
Autotune should make the harness better without making the organization less certain about what changed. In ContextOS, that is the line: search is welcome, silent mutation is not.