Autotune the Harness: Baking the Improvement Loop into ContextOS

Autotune is a dangerous word because it sounds automatic.

Most teams hear it and picture a background job that rewrites prompts after a few bad answers. That is not autotune. That is an unreviewed deployment pipeline with a better name. If the system can change its own instructions, retrieval budgets, tool preferences, or approval thresholds without replay, scorecard guardrails, and a rollback path, it is not learning. It is mutating.

The useful version is stricter: autotune proposes bounded harness changes against a declared metric, proves them on replay, and sends the winner through the same release path as any other ContextOS artifact.

This post fills the gap between the earlier pieces:

Prior post	What it covers	What this post adds
From Operator Correction to Released StrategyRule	One correction becoming one StrategyRule	Recurring, metric-driven candidate search
Replay Harness in Code	Reproducing a DecisionRecord from pinned inputs	Using replay as the autotune scoring engine
Pack Rollout in Five Stages	Shipping one candidate safely	How autotune emits release-ready candidates
The Eight-Property Harness Audit	Whether the harness is production-ready	Whether the harness can improve itself without losing control

The short version: ContextOS already has the substrate. DecisionRecords give evidence. Scorecards give metrics. Replay gives cheap historical testing. The Improvement Loop gives typed proposals. Rollout gives blast-radius control. Autotune is the loop that connects those pieces without bypassing governance.

The research signal

The useful research direction is not “make the prompt longer.” It is “treat the harness as the optimization target.”

The Meta-Harness work from Stanford / MIT / KRAFTON, “Meta-Harness: End-to-End Optimization of Model Harnesses”, makes the point directly: changing the surrounding harness while holding the model fixed can produce large performance swings. Their loop stores prior code, scores, and traces, then searches over harness variants. That maps cleanly to ContextOS: the thing being tuned is not just text. It can be a context pack, retrieval rule, planner skill, evaluator rubric, approval threshold, or tool-selection strategy.

OpenAI’s evaluation stack is moving in the same practical direction. Trace grading treats agent traces as the object being evaluated, not just final strings. Evals and graders turn datasets and rubric-bound checks into release signals. Prompt optimization is useful, but it is still one surface of a broader system.

The ContextOS distinction is this: prompt optimization is a special case of harness optimization. If the best candidate is a prompt tweak, fine. If the best candidate is top_k: 10 -> 6, a stricter evidence requirement, a different tool preference, or a higher approval threshold, the loop should be able to discover and propose that too.

Autotune invariant

The optimizer may propose. The release gate decides.

Autotune is allowed to search over bounded harness changes. It is not allowed to promote, deploy, or silently mutate the active runtime.

Input

Traces + scorecards

Runs grouped by intent, pack version, evaluator suite, risk tier, and outcome.

Bounded candidates

Only declared surfaces are tunable: retrieval, budgets, prompts, planner rules, rubrics, gates.

Output

TuningProposal

A replay-backed proposal that enters human review and staged rollout.

The autotune contract

An autotune run must answer seven questions before it is allowed to start.

Question	Required answer
What is the target?	One primary metric, one intent, one active harness tuple
What must not regress?	Guardrails for Policy, Safety, Utility, Latency, and Economics
What can change?	A declared search space of tunable harness fields
What data can it see?	Search-set traces, never the held-out release test set
How is it scored?	Replay plus the pinned evaluator suite
How is the winner chosen?	Pareto rule, not a single blended score
How can it ship?	TuningProposal -> review -> rollout -> replayable release record

Without those answers, the optimizer has too much freedom. Too much freedom is how a cost optimizer learns to skip evidence, a latency optimizer learns to avoid tools, and a utility optimizer learns to route around approval gates.

The target should be boring and local:

{
  "target": {
    "intent": "support.refund",
    "metric": "economics_cents_per_decision",
    "direction": "decrease",
    "baseline_tuple": {
      "pack": "ctxpack.support@5.2.0",
      "policy": "policy.returns@4.1.0",
      "tools": "tools.support@3.7.0",
      "evals": "evals.refund@2.4.0"
    },
    "guardrails": {
      "policy": { "min": 1.0 },
      "safety": { "min": 1.0 },
      "utility_delta": { "min": -0.02 },
      "p95_latency_delta": { "max_ms": 250 },
      "approval_bypass_rate": { "max": 0 }
    }
  }
}

Do not tune “the agent.” Tune one intent. Do not tune “quality.” Tune a named metric. Do not let a metric stand alone. Bind it to guardrails.

The experience store

Autotune needs an experience store, not a dashboard screenshot. The store is the filesystem or database surface where every prior run can be replayed, scored, grouped, and cited.

At minimum, each run needs:

Field	Why autotune needs it
`trace_id` / `run_id`	To replay and cite the exact run
`intent_id`	To avoid mixing unrelated tasks
`pack_version`	To attribute behavior to a context artifact
`policy_bundle_version`	To know which rules were active
`tool_manifest_version`	To know which actions were reachable
`evaluator_suite_version`	To compare scores honestly
`scorecard`	To compute deltas
`decision_record_id`	To bind final outcome to evidence and controls
`operator_corrections[]`	To detect recurring human overrides
`release_tuple_id`	To reproduce the exact runtime configuration

A repo-local version can be plain:

harness/
  experience/
    runs/
      2026-05-12/
        run_01HY9.../
          trace.json
          decision-record.json
          scorecard.json
          tool-transcripts.jsonl
          evidence-manifest.json
    goldens/
      support.refund/
        search.jsonl
        test.jsonl
    autotune/
      runs/
        at_2026_05_12_refund_cost/
          target.json
          candidates.jsonl
          replay-results.jsonl
          selected-proposal.json

That layout is intentionally not fancy. The important property is that autotune can read prior experience without asking a human to explain what happened.

The search space

The optimizer can only change fields the harness declares tunable. This is the most important safety control in the design.

export type TunableSurface =
  | {
      layer: "retrieval"
      field: "top_k" | "max_hops" | "source_priority"
      bounds: { min?: number; max?: number; allowed?: string[] }
    }
  | {
      layer: "context_budget"
      field: "evidence_tokens" | "memory_tokens" | "tool_tokens"
      bounds: { min: number; max: number; step: number }
    }
  | {
      layer: "planner"
      field: "tool_preference" | "decompose_before_tool"
      bounds: { allowed: string[] }
    }
  | {
      layer: "prompt"
      field: "instruction_block"
      bounds: { max_added_tokens: number; forbidden_terms: string[] }
    }
  | {
      layer: "approval_gate"
      field: "threshold"
      bounds: { min: number; max: number; step: number }
    }

The surface belongs in the Context Pack or harness manifest. If a field is not declared tunable, autotune does not touch it. This keeps the optimizer from discovering “creative” shortcuts such as disabling an evaluator, lowering a safety threshold, or removing expensive evidence.

Start with three surfaces:

Surface	Useful first candidates	Common failure caught by guardrails
Retrieval	`top_k`, `max_hops`, source priority	Lower cost by dropping required evidence
Budgeting	evidence / memory / tool bucket token allocation	Faster runs that hallucinate missing facts
Planner	tool preference, tool order, decompose-before-tool	Utility gains that increase risky calls

Prompt tuning is fourth, not first. Prompts are easy to change and hard to reason about. Retrieval and budget candidates are often easier to replay, diff, and rollback.

The loop in code

The autotune runner is a pipeline. It should be deterministic enough that another engineer can reproduce why a proposal won.

type AutotuneRun = {
  run_id: string
  target: TuningTarget
  search_set: ReplayCase[]
  heldout_test_set: ReplayCase[]
  candidates: CandidateChange[]
  evaluator_suite: string
}
 
type CandidateChange = {
  candidate_id: string
  layer: "retrieval" | "context_budget" | "planner" | "prompt" | "approval_gate"
  patch: Record<string, unknown>
  rationale: string
}
 
type CandidateScore = {
  candidate_id: string
  search: ScorecardDelta
  heldout: ScorecardDelta
  guardrails_passed: boolean
  changed_runs: number
  regression_runs: string[]
}
 
export async function runAutotune(run: AutotuneRun): Promise<TuningProposal | null> {
  const scores: CandidateScore[] = []
 
  for (const candidate of run.candidates) {
    const search = await replayAndScore(run.search_set, candidate, run.evaluator_suite)
    if (!passesGuardrails(search, run.target.guardrails)) continue
 
    // Only candidates that survive search-set guardrails see the held-out set.
    const heldout = await replayAndScore(run.heldout_test_set, candidate, run.evaluator_suite)
    scores.push({
      candidate_id: candidate.candidate_id,
      search,
      heldout,
      guardrails_passed: passesGuardrails(heldout, run.target.guardrails),
      changed_runs: heldout.changed_runs,
      regression_runs: heldout.regression_runs,
    })
  }
 
  const winner = selectParetoWinner(scores, run.target)
  if (!winner || !winner.guardrails_passed) return null
 
  return buildProposal(run, winner)
}

A few details matter more than the code volume.

The held-out set is not used for iteration. The proposer can search against the search set. The release gate checks the held-out set. If you let autotune repeatedly see the held-out set, you have converted it into training data.

The selection function should be Pareto-aware. Do not collapse Policy, Safety, Utility, Latency, and Economics into one weighted score unless you want arguments about weights to become your release process. Hard guardrails block. Among surviving candidates, choose the one that improves the target with the smallest blast radius.

The result is a proposal, not a patch applied to production.

A real proposal shape

Autotune output should look like every other Improvement Loop artifact: typed, reviewable, replay-backed, and bound to evidence.

{
  "tuning_proposal_id": "at_2026_05_12_refund_cost_01",
  "status": "proposed",
  "target": {
    "intent": "support.refund",
    "metric": "economics_cents_per_decision",
    "direction": "decrease"
  },
  "candidate_change": {
    "layer": "retrieval",
    "target_pack": "ctxpack.support@5.2.0",
    "patch": {
      "evidence.order_history.top_k": { "from": 10, "to": 6 },
      "evidence.supplier_policy.max_hops": { "from": 3, "to": 2 }
    }
  },
  "replay": {
    "search_set": "goldens/support.refund/search@2026-05-12",
    "heldout_test_set": "goldens/support.refund/test@2026-05-12",
    "evaluation_run_id": "eval_2026_05_12_91f",
    "baseline_tuple": "release.support.refund@2026-05-09"
  },
  "observed_delta": {
    "economics_cents_per_decision": -0.21,
    "utility": -0.006,
    "policy": 0,
    "safety": 0,
    "p95_latency_ms": -180
  },
  "guardrails": {
    "passed": true,
    "regression_runs": []
  },
  "release_path": {
    "requires_reviewers": ["reliability", "policy", "domain_owner"],
    "rollout_stage": "0%_shadow",
    "rollback_target": "ctxpack.support@5.2.0"
  }
}

Notice what is not present: “autotune changed production.” The proposal still needs review, approval, and staged rollout.

What candidates should be allowed to win

Autotune should prefer candidates that are small, explainable, and reversible.

Candidate	Usually safe to start with	Review question
Reduce `top_k` for one evidence bucket	Yes	Did evidence coverage stay complete?
Move source priority from wiki to policy KB	Yes	Is the policy KB owned and fresh?
Increase evidence bucket budget by 400 tokens	Yes	Did cost rise within the target band?
Add a planner tool preference	Yes	Did risky tool calls increase?
Rewrite the whole system prompt	No	Why is this not a smaller typed change?
Lower an approval threshold	No	What policy owner approved the risk?
Disable an evaluator	Never	This is not a candidate; it is an invariant violation

The optimizer is not rewarded for cleverness. It is rewarded for producing candidates a reviewer can understand in five minutes.

The operating cadence

Autotune works best as a routine, not a special project.

Daily:

Build intent-level slices from the last 24 hours of traces.
Detect targets where one metric moved outside its band.
Queue at most one autotune run per intent and risk class.
Emit proposals only when held-out replay passes guardrails.

Weekly:

Review proposal adoption rate.
Add corrected production failures to the golden search set.
Refresh the held-out set without letting the proposer inspect it.
Retire tunable surfaces that keep producing rejected proposals.
Promote approved candidates through the five-stage rollout path.

Quarterly:

Rebaseline evaluator rubrics.
Re-split search and test sets.
Audit whether guardrails still match business risk.
Remove stale candidate generators.

This cadence keeps autotune from becoming a novelty button. It becomes part of the harness maintenance cycle.

Where autotune lives in ContextOS

Autotune is not a separate service with special power. It is one producer inside the Improvement Loop.

traces + scorecards + corrections
        ->
InsightSynthesizer / StrategyCompiler / Autotune
        ->
typed proposals
        ->
replay gate
        ->
human review
        ->
pack / policy / tool / evaluator release
        ->
staged rollout
        ->
new traces + scorecards

That placement matters. The loop can improve retrieval, prompts, planner choices, and budgets, but every candidate still passes through Trust-plane controls. The optimizer does not get an escape hatch from the same harness it is trying to improve.

Failure modes

The common failures are predictable.

Failure	What it looks like	Control
Metric monoculture	Cost improves because the agent stopped gathering evidence	Hard Policy / Safety / Utility guardrails
Test-set leakage	Every proposal looks great until production	Search/test split and immutable held-out runs
Prompt sprawl	Autotune keeps appending instructions	Token budget and prompt-change size limits
Eval overfitting	Candidate learns the judge rubric, not the task	Mixed evaluator suite: rule, replay, human, business
Unsafe auto-promotion	Winning candidate ships without review	Proposal-only invariant
Hidden blast radius	Candidate affects more intents than declared	Intent-scoped manifests and rollout routing
Rollback theater	Prior version cannot be replayed	Release tuple pinned before promotion

The key is to treat each failure as a harness property, not a model behavior. The model did not overfit the eval by itself. The harness gave the optimizer an overfit path.

The first implementation

The smallest useful implementation is one metric, one intent, and one tunable surface.

Pick an intent with volume and low external risk, such as a read-mostly support triage flow. Pick a metric that matters but is not safety-critical, such as economics_cents_per_decision or p95_latency_ms. Declare exactly one tunable surface, such as retrieval top_k for one evidence bucket. Split the existing golden traces into search and held-out test sets. Run three candidates. Emit one proposal. Do not ship it automatically.

The first version can fit in a week:

Day	Deliverable
1	Add `tunable_surfaces[]` to one Context Pack manifest
2	Export 200 replayable traces into search/test sets
3	Implement candidate generation for one field
4	Run replay and scorecard deltas
5	Emit `TuningProposal` and send it through review

If that loop works, add surfaces slowly. The goal is not to make the optimizer powerful. The goal is to make improvement routine.

Definition of done

An autotune loop is production-ready when the answer to each question is yes:

Check	Pass line
Can every candidate be traced to a metric target?	`target.metric` and `target.intent` are required
Can every candidate be replayed?	It references pinned traces, pack, policy, tools, and evaluator suite
Can every candidate be reviewed?	The patch is typed, small, and has a reviewer owner
Can every candidate be rejected?	Rejection reason is recorded and future search avoids the same dead end
Can every candidate be rolled back?	Release tuple has a prior pinned target
Can every shipped candidate be measured?	Production scorecards compare baseline and candidate by intent
Can the loop itself be audited?	Autotune runs write target, candidates, scores, winner, and rationale

That is the practical meaning of “improvement loop baked into the system.” The system is not smarter because it can change itself. It is trustworthy because every attempted change is typed, scored, replayed, reviewed, released, measured, and reversible.

Autotune should make the harness better without making the organization less certain about what changed. In ContextOS, that is the line: search is welcome, silent mutation is not.