Skip to content
Back to Blog
Agent engineering series
May 12, 2026
·by Piyush·6 min read

Harness Candidates Are Model Checkpoints: How to Improve Agents Without Silent Mutation

ContextOS
Harness Engineering
AI Engineering
Autotune
Agents
Share:XHN

When an ML team trains a model, nobody says “we changed production weights because one example got better.”

They create a candidate. They evaluate it. They compare it to a baseline. They check regressions. They review tradeoffs. They stage rollout. They keep rollback.

Agent harnesses deserve the same discipline.

A harness candidate is a proposed version of the system around the model: Context Pack, retrieval settings, tool schema, planner template, Critic rubric, policy threshold, evaluator suite, rollout gate, or memory promotion rule. The model may be unchanged. The agent behavior can still change dramatically.

That is why great AI engineers treat harness candidates like model checkpoints.

What counts as a harness candidate

Almost any agent behavior change is a harness candidate.

SurfaceCandidate examples
Contextchange retrieval top_k, source priority, budget allocation, compression
Promptrewrite a small instruction block, add examples, remove ambiguous language
Toolstighten schema, improve result shape, add retryability hints
Planneradd verification step, reorder tool calls, change re-plan budget
Critictighten evidence sufficiency, adjust escalate vs. retry rubric
Policyadd rule, modify threshold, require gate for a new condition
Memorychange recall eligibility, contradiction handling, promotion threshold
Evaluationadd grader, reweight threshold, update golden set

Some of these are software diffs. Some are configuration. Some are data. The release system should not care. If it changes agent behavior, it is a candidate.

Candidate shape

A useful harness candidate has the same basic fields every time:

{
  "candidate_id": "hc_2026_05_12_refund_004",
  "baseline_tuple": {
    "pack": "ctxpack.support@5.2.0",
    "policy": "policy.returns@4.1.0",
    "tools": "tools.support@3.7.0",
    "evals": "evals.refund@2.4.0",
    "model_profile": "model.fast@2026-05-01"
  },
  "target": {
    "intent": "support.refund",
    "primary_metric": "operator_corrected_rate",
    "direction": "decrease"
  },
  "patches": [
    {
      "surface": "context.retrieval",
      "field": "supplier_policy.max_hops",
      "from": 2,
      "to": 3
    },
    {
      "surface": "decision.planner",
      "field": "verify_supplier_exception_before_denial",
      "from": false,
      "to": true
    }
  ],
  "guardrails": {
    "policy": "== 1.0",
    "safety": "== 1.0",
    "approval_bypass_rate": "== 0"
  }
}

The candidate names the baseline, the target, the patch, and the guardrails. Without those, the team cannot tell whether the change helped or merely changed behavior.

Candidate generation

Candidates come from four sources.

SourceCandidate type
Human diagnosistargeted fix after reading traces
Operator correction clusterStrategyRule, evidence requirement, policy rule
Autotune searchbounded changes to retrieval, budgets, planner rules, prompts
Incident reviewfailure playbook, evaluator, approval gate, rollback control

The source should be recorded. A human-authored candidate and an autotune candidate go through the same gate, but they deserve different review questions.

Candidate sourceReviewer asks
Human diagnosisdid the patch match the trace failure?
Correction clusterare the corrections valid and representative?
Autotunewas the search space bounded and held-out data protected?
Incident reviewdoes the fix prevent recurrence without hiding symptoms?

Candidate scoring

A candidate should be scored before review, not after debate, using the same scorecard and replay machinery as other harness releases.

baseline:  release.support.refund@2026-05-09
candidate: hc_2026_05_12_refund_004
dataset:   support.refund/release_test@2026-05-12
 
policy:    1.000 -> 1.000
safety:    1.000 -> 1.000
utility:   0.901 -> 0.929
latency:   2180ms -> 2260ms
cost:      0.91c -> 0.97c
changed:   54 / 900
regressed: 0 policy, 0 safety, 3 utility review
verdict:   needs_human

This is a good outcome. It does not auto-ship. It gives reviewers a precise tradeoff: utility rose, latency and cost rose slightly, three utility regressions need inspection.

Trace diff

The most useful part of candidate review is the trace diff.

DiffExample
Contextsupplier exception evidence now included in 42 runs
Planverification step added before denial
Toolsone extra policy lookup on boundary cases
Guardrailsapproval gate unchanged
Verdict39 denials changed to approve-with-gate
Costplus 0.06 cents per decision

Trace diff keeps review grounded. Instead of arguing over whether the prompt “seems better,” the team inspects the mechanism.

Release stages

Harness candidates should roll out like product changes through the feature-flagged harness rollout.

StageWhat happensGate
replayhistorical goldens replayed offlinescorecard passes hard floors
shadowcandidate runs beside baseline, no side effectslive deltas within band
internalinternal users or trusted operatorsno safety/policy regression
low_risklimited low-blast-radius intentscorrections trend down
monitoredbroader canary with tail samplingstable by tenant and risk
fullcandidate becomes defaultrollback target retained

Shadow mode is especially important for agents. It gives the team live language and live retrieval conditions without letting the candidate execute destructive actions.

Rollback

Rollback is not a button unless replay works.

Every release must pin:

PinWhy
pack versionreconstruct compiled context
policy bundlereconstruct allow/deny/gate decisions
tool manifestreconstruct reachable capabilities
evaluator suitecompare scores honestly
model profilereconstruct model route and cost envelope
knowledge snapshotreconstruct evidence

The rollback target is the prior tuple. If the prior tuple cannot replay, the team does not have rollback. It has hope.

Candidate registry

Keep candidates in an append-only registry.

harness/candidates/
  hc_2026_05_12_refund_004/
    candidate.json
    patch.diff
    search-scorecard.json
    heldout-scorecard.json
    trace-diff-summary.json
    reviewer-verdicts.json
    rollout-record.json

Rejected candidates stay in the registry. They prevent repeat mistakes. If a retrieval budget reduction failed because evidence coverage dropped, the next autotune run should know not to rediscover that path.

What should never be tunable

Not every field should be part of search.

Never auto-tuneReason
approval-mode maximumauthority boundary, not optimization knob
safety floorfloor constraint
policy floorfloor constraint
credential scopesecurity boundary
tenant isolationsecurity boundary
evidence requirement for destructive actionaudit boundary
idempotency on writesrecovery boundary
trace emissionobservability boundary

The optimizer may propose a governance change for humans to review. It may not silently relax these fields because a metric improved.

The ContextOS version

ContextOS already has the right shape:

candidate harness tuple
  -> replay against search set
  -> replay against held-out set
  -> reviewer verdicts
  -> staged rollout
  -> DecisionRecords and scorecards
  -> Improvement Loop proposals

The Trust plane owns promotion. The other planes expose tunable surfaces and invariants. The candidate is the unit that moves between them.

The bar

A harness candidate is release-ready when:

CheckPass line
It has a baselineprior tuple pinned
It has a targetone primary metric and intent
It has guardrailsPolicy and Safety floors are explicit
It has replay evidencesearch and held-out scorecards exist
It has trace diffschanged behavior is inspectable
It has ownersdomain, reliability, and policy reviewers named
It has rolloutshadow and canary plan defined
It has rollbackprior tuple is replayable

That is the standard that keeps agent improvement from becoming silent mutation.

The harness can improve like a model. It should also be governed like one.

Found this useful? Share it.

Share:XHN
Analytics consent

We use Google Analytics to understand site usage. You can opt in or decline.