Skip to content
Back to Blog
Reviewers & improvement
May 12, 2026
·by Piyush·11 min read

Autotune the Harness: Baking the Improvement Loop into ContextOS

ContextOS
Harness Engineering
Autotune
Improvement Loop
Evaluation
Share:XHN

Autotune is a dangerous word because it sounds automatic.

Most teams hear it and picture a background job that rewrites prompts after a few bad answers. That is not autotune. That is an unreviewed deployment pipeline with a better name. If the system can change its own instructions, retrieval budgets, tool preferences, or approval thresholds without replay, scorecard guardrails, and a rollback path, it is not learning. It is mutating.

The useful version is stricter: autotune proposes bounded harness changes against a declared metric, proves them on replay, and sends the winner through the same release path as any other ContextOS artifact.

This post fills the gap between the earlier pieces:

Prior postWhat it coversWhat this post adds
From Operator Correction to Released StrategyRuleOne correction becoming one StrategyRuleRecurring, metric-driven candidate search
Replay Harness in CodeReproducing a DecisionRecord from pinned inputsUsing replay as the autotune scoring engine
Pack Rollout in Five StagesShipping one candidate safelyHow autotune emits release-ready candidates
The Eight-Property Harness AuditWhether the harness is production-readyWhether the harness can improve itself without losing control

The short version: ContextOS already has the substrate. DecisionRecords give evidence. Scorecards give metrics. Replay gives cheap historical testing. The Improvement Loop gives typed proposals. Rollout gives blast-radius control. Autotune is the loop that connects those pieces without bypassing governance.

The research signal

The useful research direction is not “make the prompt longer.” It is “treat the harness as the optimization target.”

The Meta-Harness work from Stanford / MIT / KRAFTON, “Meta-Harness: End-to-End Optimization of Model Harnesses”, makes the point directly: changing the surrounding harness while holding the model fixed can produce large performance swings. Their loop stores prior code, scores, and traces, then searches over harness variants. That maps cleanly to ContextOS: the thing being tuned is not just text. It can be a context pack, retrieval rule, planner skill, evaluator rubric, approval threshold, or tool-selection strategy.

OpenAI’s evaluation stack is moving in the same practical direction. Trace grading treats agent traces as the object being evaluated, not just final strings. Evals and graders turn datasets and rubric-bound checks into release signals. Prompt optimization is useful, but it is still one surface of a broader system.

The ContextOS distinction is this: prompt optimization is a special case of harness optimization. If the best candidate is a prompt tweak, fine. If the best candidate is top_k: 10 -> 6, a stricter evidence requirement, a different tool preference, or a higher approval threshold, the loop should be able to discover and propose that too.

Autotune invariant
The optimizer may propose. The release gate decides.
Autotune is allowed to search over bounded harness changes. It is not allowed to promote, deploy, or silently mutate the active runtime.
Input
Traces + scorecards
Runs grouped by intent, pack version, evaluator suite, risk tier, and outcome.
Search
Bounded candidates
Only declared surfaces are tunable: retrieval, budgets, prompts, planner rules, rubrics, gates.
Output
TuningProposal
A replay-backed proposal that enters human review and staged rollout.

The autotune contract

An autotune run must answer seven questions before it is allowed to start.

QuestionRequired answer
What is the target?One primary metric, one intent, one active harness tuple
What must not regress?Guardrails for Policy, Safety, Utility, Latency, and Economics
What can change?A declared search space of tunable harness fields
What data can it see?Search-set traces, never the held-out release test set
How is it scored?Replay plus the pinned evaluator suite
How is the winner chosen?Pareto rule, not a single blended score
How can it ship?TuningProposal -> review -> rollout -> replayable release record

Without those answers, the optimizer has too much freedom. Too much freedom is how a cost optimizer learns to skip evidence, a latency optimizer learns to avoid tools, and a utility optimizer learns to route around approval gates.

The target should be boring and local:

{
  "target": {
    "intent": "support.refund",
    "metric": "economics_cents_per_decision",
    "direction": "decrease",
    "baseline_tuple": {
      "pack": "ctxpack.support@5.2.0",
      "policy": "policy.returns@4.1.0",
      "tools": "tools.support@3.7.0",
      "evals": "evals.refund@2.4.0"
    },
    "guardrails": {
      "policy": { "min": 1.0 },
      "safety": { "min": 1.0 },
      "utility_delta": { "min": -0.02 },
      "p95_latency_delta": { "max_ms": 250 },
      "approval_bypass_rate": { "max": 0 }
    }
  }
}

Do not tune “the agent.” Tune one intent. Do not tune “quality.” Tune a named metric. Do not let a metric stand alone. Bind it to guardrails.

The experience store

Autotune needs an experience store, not a dashboard screenshot. The store is the filesystem or database surface where every prior run can be replayed, scored, grouped, and cited.

At minimum, each run needs:

FieldWhy autotune needs it
trace_id / run_idTo replay and cite the exact run
intent_idTo avoid mixing unrelated tasks
pack_versionTo attribute behavior to a context artifact
policy_bundle_versionTo know which rules were active
tool_manifest_versionTo know which actions were reachable
evaluator_suite_versionTo compare scores honestly
scorecardTo compute deltas
decision_record_idTo bind final outcome to evidence and controls
operator_corrections[]To detect recurring human overrides
release_tuple_idTo reproduce the exact runtime configuration

A repo-local version can be plain:

harness/
  experience/
    runs/
      2026-05-12/
        run_01HY9.../
          trace.json
          decision-record.json
          scorecard.json
          tool-transcripts.jsonl
          evidence-manifest.json
    goldens/
      support.refund/
        search.jsonl
        test.jsonl
    autotune/
      runs/
        at_2026_05_12_refund_cost/
          target.json
          candidates.jsonl
          replay-results.jsonl
          selected-proposal.json

That layout is intentionally not fancy. The important property is that autotune can read prior experience without asking a human to explain what happened.

The search space

The optimizer can only change fields the harness declares tunable. This is the most important safety control in the design.

export type TunableSurface =
  | {
      layer: "retrieval"
      field: "top_k" | "max_hops" | "source_priority"
      bounds: { min?: number; max?: number; allowed?: string[] }
    }
  | {
      layer: "context_budget"
      field: "evidence_tokens" | "memory_tokens" | "tool_tokens"
      bounds: { min: number; max: number; step: number }
    }
  | {
      layer: "planner"
      field: "tool_preference" | "decompose_before_tool"
      bounds: { allowed: string[] }
    }
  | {
      layer: "prompt"
      field: "instruction_block"
      bounds: { max_added_tokens: number; forbidden_terms: string[] }
    }
  | {
      layer: "approval_gate"
      field: "threshold"
      bounds: { min: number; max: number; step: number }
    }

The surface belongs in the Context Pack or harness manifest. If a field is not declared tunable, autotune does not touch it. This keeps the optimizer from discovering “creative” shortcuts such as disabling an evaluator, lowering a safety threshold, or removing expensive evidence.

Start with three surfaces:

SurfaceUseful first candidatesCommon failure caught by guardrails
Retrievaltop_k, max_hops, source priorityLower cost by dropping required evidence
Budgetingevidence / memory / tool bucket token allocationFaster runs that hallucinate missing facts
Plannertool preference, tool order, decompose-before-toolUtility gains that increase risky calls

Prompt tuning is fourth, not first. Prompts are easy to change and hard to reason about. Retrieval and budget candidates are often easier to replay, diff, and rollback.

The loop in code

The autotune runner is a pipeline. It should be deterministic enough that another engineer can reproduce why a proposal won.

type AutotuneRun = {
  run_id: string
  target: TuningTarget
  search_set: ReplayCase[]
  heldout_test_set: ReplayCase[]
  candidates: CandidateChange[]
  evaluator_suite: string
}
 
type CandidateChange = {
  candidate_id: string
  layer: "retrieval" | "context_budget" | "planner" | "prompt" | "approval_gate"
  patch: Record<string, unknown>
  rationale: string
}
 
type CandidateScore = {
  candidate_id: string
  search: ScorecardDelta
  heldout: ScorecardDelta
  guardrails_passed: boolean
  changed_runs: number
  regression_runs: string[]
}
 
export async function runAutotune(run: AutotuneRun): Promise<TuningProposal | null> {
  const scores: CandidateScore[] = []
 
  for (const candidate of run.candidates) {
    const search = await replayAndScore(run.search_set, candidate, run.evaluator_suite)
    if (!passesGuardrails(search, run.target.guardrails)) continue
 
    // Only candidates that survive search-set guardrails see the held-out set.
    const heldout = await replayAndScore(run.heldout_test_set, candidate, run.evaluator_suite)
    scores.push({
      candidate_id: candidate.candidate_id,
      search,
      heldout,
      guardrails_passed: passesGuardrails(heldout, run.target.guardrails),
      changed_runs: heldout.changed_runs,
      regression_runs: heldout.regression_runs,
    })
  }
 
  const winner = selectParetoWinner(scores, run.target)
  if (!winner || !winner.guardrails_passed) return null
 
  return buildProposal(run, winner)
}

A few details matter more than the code volume.

The held-out set is not used for iteration. The proposer can search against the search set. The release gate checks the held-out set. If you let autotune repeatedly see the held-out set, you have converted it into training data.

The selection function should be Pareto-aware. Do not collapse Policy, Safety, Utility, Latency, and Economics into one weighted score unless you want arguments about weights to become your release process. Hard guardrails block. Among surviving candidates, choose the one that improves the target with the smallest blast radius.

The result is a proposal, not a patch applied to production.

A real proposal shape

Autotune output should look like every other Improvement Loop artifact: typed, reviewable, replay-backed, and bound to evidence.

{
  "tuning_proposal_id": "at_2026_05_12_refund_cost_01",
  "status": "proposed",
  "target": {
    "intent": "support.refund",
    "metric": "economics_cents_per_decision",
    "direction": "decrease"
  },
  "candidate_change": {
    "layer": "retrieval",
    "target_pack": "ctxpack.support@5.2.0",
    "patch": {
      "evidence.order_history.top_k": { "from": 10, "to": 6 },
      "evidence.supplier_policy.max_hops": { "from": 3, "to": 2 }
    }
  },
  "replay": {
    "search_set": "goldens/support.refund/search@2026-05-12",
    "heldout_test_set": "goldens/support.refund/test@2026-05-12",
    "evaluation_run_id": "eval_2026_05_12_91f",
    "baseline_tuple": "release.support.refund@2026-05-09"
  },
  "observed_delta": {
    "economics_cents_per_decision": -0.21,
    "utility": -0.006,
    "policy": 0,
    "safety": 0,
    "p95_latency_ms": -180
  },
  "guardrails": {
    "passed": true,
    "regression_runs": []
  },
  "release_path": {
    "requires_reviewers": ["reliability", "policy", "domain_owner"],
    "rollout_stage": "0%_shadow",
    "rollback_target": "ctxpack.support@5.2.0"
  }
}

Notice what is not present: “autotune changed production.” The proposal still needs review, approval, and staged rollout.

What candidates should be allowed to win

Autotune should prefer candidates that are small, explainable, and reversible.

CandidateUsually safe to start withReview question
Reduce top_k for one evidence bucketYesDid evidence coverage stay complete?
Move source priority from wiki to policy KBYesIs the policy KB owned and fresh?
Increase evidence bucket budget by 400 tokensYesDid cost rise within the target band?
Add a planner tool preferenceYesDid risky tool calls increase?
Rewrite the whole system promptNoWhy is this not a smaller typed change?
Lower an approval thresholdNoWhat policy owner approved the risk?
Disable an evaluatorNeverThis is not a candidate; it is an invariant violation

The optimizer is not rewarded for cleverness. It is rewarded for producing candidates a reviewer can understand in five minutes.

The operating cadence

Autotune works best as a routine, not a special project.

Daily:

  1. Build intent-level slices from the last 24 hours of traces.
  2. Detect targets where one metric moved outside its band.
  3. Queue at most one autotune run per intent and risk class.
  4. Emit proposals only when held-out replay passes guardrails.

Weekly:

  1. Review proposal adoption rate.
  2. Add corrected production failures to the golden search set.
  3. Refresh the held-out set without letting the proposer inspect it.
  4. Retire tunable surfaces that keep producing rejected proposals.
  5. Promote approved candidates through the five-stage rollout path.

Quarterly:

  1. Rebaseline evaluator rubrics.
  2. Re-split search and test sets.
  3. Audit whether guardrails still match business risk.
  4. Remove stale candidate generators.

This cadence keeps autotune from becoming a novelty button. It becomes part of the harness maintenance cycle.

Where autotune lives in ContextOS

Autotune is not a separate service with special power. It is one producer inside the Improvement Loop.

traces + scorecards + corrections
        ->
InsightSynthesizer / StrategyCompiler / Autotune
        ->
typed proposals
        ->
replay gate
        ->
human review
        ->
pack / policy / tool / evaluator release
        ->
staged rollout
        ->
new traces + scorecards

That placement matters. The loop can improve retrieval, prompts, planner choices, and budgets, but every candidate still passes through Trust-plane controls. The optimizer does not get an escape hatch from the same harness it is trying to improve.

Failure modes

The common failures are predictable.

FailureWhat it looks likeControl
Metric monocultureCost improves because the agent stopped gathering evidenceHard Policy / Safety / Utility guardrails
Test-set leakageEvery proposal looks great until productionSearch/test split and immutable held-out runs
Prompt sprawlAutotune keeps appending instructionsToken budget and prompt-change size limits
Eval overfittingCandidate learns the judge rubric, not the taskMixed evaluator suite: rule, replay, human, business
Unsafe auto-promotionWinning candidate ships without reviewProposal-only invariant
Hidden blast radiusCandidate affects more intents than declaredIntent-scoped manifests and rollout routing
Rollback theaterPrior version cannot be replayedRelease tuple pinned before promotion

The key is to treat each failure as a harness property, not a model behavior. The model did not overfit the eval by itself. The harness gave the optimizer an overfit path.

The first implementation

The smallest useful implementation is one metric, one intent, and one tunable surface.

Pick an intent with volume and low external risk, such as a read-mostly support triage flow. Pick a metric that matters but is not safety-critical, such as economics_cents_per_decision or p95_latency_ms. Declare exactly one tunable surface, such as retrieval top_k for one evidence bucket. Split the existing golden traces into search and held-out test sets. Run three candidates. Emit one proposal. Do not ship it automatically.

The first version can fit in a week:

DayDeliverable
1Add tunable_surfaces[] to one Context Pack manifest
2Export 200 replayable traces into search/test sets
3Implement candidate generation for one field
4Run replay and scorecard deltas
5Emit TuningProposal and send it through review

If that loop works, add surfaces slowly. The goal is not to make the optimizer powerful. The goal is to make improvement routine.

Definition of done

An autotune loop is production-ready when the answer to each question is yes:

CheckPass line
Can every candidate be traced to a metric target?target.metric and target.intent are required
Can every candidate be replayed?It references pinned traces, pack, policy, tools, and evaluator suite
Can every candidate be reviewed?The patch is typed, small, and has a reviewer owner
Can every candidate be rejected?Rejection reason is recorded and future search avoids the same dead end
Can every candidate be rolled back?Release tuple has a prior pinned target
Can every shipped candidate be measured?Production scorecards compare baseline and candidate by intent
Can the loop itself be audited?Autotune runs write target, candidates, scores, winner, and rationale

That is the practical meaning of “improvement loop baked into the system.” The system is not smarter because it can change itself. It is trustworthy because every attempted change is typed, scored, replayed, reviewed, released, measured, and reversible.

Autotune should make the harness better without making the organization less certain about what changed. In ContextOS, that is the line: search is welcome, silent mutation is not.

Found this useful? Share it.

Share:XHN
Analytics consent

We use Google Analytics to understand site usage. You can opt in or decline.