Harness Candidates Are Model Checkpoints: How to Improve Agents Without Silent Mutation

When an ML team trains a model, nobody says “we changed production weights because one example got better.”

They create a candidate. They evaluate it. They compare it to a baseline. They check regressions. They review tradeoffs. They stage rollout. They keep rollback.

Agent harnesses deserve the same discipline.

A harness candidate is a proposed version of the system around the model: Context Pack, retrieval settings, tool schema, planner template, Critic rubric, policy threshold, evaluator suite, rollout gate, or memory promotion rule. The model may be unchanged. The agent behavior can still change dramatically.

That is why great AI engineers treat harness candidates like model checkpoints.

What counts as a harness candidate

Almost any agent behavior change is a harness candidate.

Surface	Candidate examples
Context	change retrieval `top_k`, source priority, budget allocation, compression
Prompt	rewrite a small instruction block, add examples, remove ambiguous language
Tools	tighten schema, improve result shape, add retryability hints
Planner	add verification step, reorder tool calls, change re-plan budget
Critic	tighten evidence sufficiency, adjust escalate vs. retry rubric
Policy	add rule, modify threshold, require gate for a new condition
Memory	change recall eligibility, contradiction handling, promotion threshold
Evaluation	add grader, reweight threshold, update golden set

Some of these are software diffs. Some are configuration. Some are data. The release system should not care. If it changes agent behavior, it is a candidate.

Candidate shape

A useful harness candidate has the same basic fields every time:

{
  "candidate_id": "hc_2026_05_12_refund_004",
  "baseline_tuple": {
    "pack": "ctxpack.support@5.2.0",
    "policy": "policy.returns@4.1.0",
    "tools": "tools.support@3.7.0",
    "evals": "evals.refund@2.4.0",
    "model_profile": "model.fast@2026-05-01"
  },
  "target": {
    "intent": "support.refund",
    "primary_metric": "operator_corrected_rate",
    "direction": "decrease"
  },
  "patches": [
    {
      "surface": "context.retrieval",
      "field": "supplier_policy.max_hops",
      "from": 2,
      "to": 3
    },
    {
      "surface": "decision.planner",
      "field": "verify_supplier_exception_before_denial",
      "from": false,
      "to": true
    }
  ],
  "guardrails": {
    "policy": "== 1.0",
    "safety": "== 1.0",
    "approval_bypass_rate": "== 0"
  }
}

The candidate names the baseline, the target, the patch, and the guardrails. Without those, the team cannot tell whether the change helped or merely changed behavior.

Candidate generation

Candidates come from four sources.

Source	Candidate type
Human diagnosis	targeted fix after reading traces
Operator correction cluster	StrategyRule, evidence requirement, policy rule
Autotune search	bounded changes to retrieval, budgets, planner rules, prompts
Incident review	failure playbook, evaluator, approval gate, rollback control

The source should be recorded. A human-authored candidate and an autotune candidate go through the same gate, but they deserve different review questions.

Candidate source	Reviewer asks
Human diagnosis	did the patch match the trace failure?
Correction cluster	are the corrections valid and representative?
Autotune	was the search space bounded and held-out data protected?
Incident review	does the fix prevent recurrence without hiding symptoms?

Candidate scoring

A candidate should be scored before review, not after debate, using the same scorecard and replay machinery as other harness releases.

baseline:  release.support.refund@2026-05-09
candidate: hc_2026_05_12_refund_004
dataset:   support.refund/release_test@2026-05-12
 
policy:    1.000 -> 1.000
safety:    1.000 -> 1.000
utility:   0.901 -> 0.929
latency:   2180ms -> 2260ms
cost:      0.91c -> 0.97c
changed:   54 / 900
regressed: 0 policy, 0 safety, 3 utility review
verdict:   needs_human

This is a good outcome. It does not auto-ship. It gives reviewers a precise tradeoff: utility rose, latency and cost rose slightly, three utility regressions need inspection.

Trace diff

The most useful part of candidate review is the trace diff.

Diff	Example
Context	supplier exception evidence now included in 42 runs
Plan	verification step added before denial
Tools	one extra policy lookup on boundary cases
Guardrails	approval gate unchanged
Verdict	39 denials changed to approve-with-gate
Cost	plus 0.06 cents per decision

Trace diff keeps review grounded. Instead of arguing over whether the prompt “seems better,” the team inspects the mechanism.

Release stages

Harness candidates should roll out like product changes through the feature-flagged harness rollout.

Stage	What happens	Gate
`replay`	historical goldens replayed offline	scorecard passes hard floors
`shadow`	candidate runs beside baseline, no side effects	live deltas within band
`internal`	internal users or trusted operators	no safety/policy regression
`low_risk`	limited low-blast-radius intents	corrections trend down
`monitored`	broader canary with tail sampling	stable by tenant and risk
`full`	candidate becomes default	rollback target retained

Shadow mode is especially important for agents. It gives the team live language and live retrieval conditions without letting the candidate execute destructive actions.

Rollback

Rollback is not a button unless replay works.

Every release must pin:

Pin	Why
pack version	reconstruct compiled context
policy bundle	reconstruct allow/deny/gate decisions
tool manifest	reconstruct reachable capabilities
evaluator suite	compare scores honestly
model profile	reconstruct model route and cost envelope
knowledge snapshot	reconstruct evidence

The rollback target is the prior tuple. If the prior tuple cannot replay, the team does not have rollback. It has hope.

Candidate registry

Keep candidates in an append-only registry.

harness/candidates/
  hc_2026_05_12_refund_004/
    candidate.json
    patch.diff
    search-scorecard.json
    heldout-scorecard.json
    trace-diff-summary.json
    reviewer-verdicts.json
    rollout-record.json

Rejected candidates stay in the registry. They prevent repeat mistakes. If a retrieval budget reduction failed because evidence coverage dropped, the next autotune run should know not to rediscover that path.

What should never be tunable

Not every field should be part of search.

Never auto-tune	Reason
approval-mode maximum	authority boundary, not optimization knob
safety floor	floor constraint
policy floor	floor constraint
credential scope	security boundary
tenant isolation	security boundary
evidence requirement for destructive action	audit boundary
idempotency on writes	recovery boundary
trace emission	observability boundary

The optimizer may propose a governance change for humans to review. It may not silently relax these fields because a metric improved.

The ContextOS version

ContextOS already has the right shape:

candidate harness tuple
  -> replay against search set
  -> replay against held-out set
  -> reviewer verdicts
  -> staged rollout
  -> DecisionRecords and scorecards
  -> Improvement Loop proposals

The Trust plane owns promotion. The other planes expose tunable surfaces and invariants. The candidate is the unit that moves between them.

The bar

A harness candidate is release-ready when:

Check	Pass line
It has a baseline	prior tuple pinned
It has a target	one primary metric and intent
It has guardrails	Policy and Safety floors are explicit
It has replay evidence	search and held-out scorecards exist
It has trace diffs	changed behavior is inspectable
It has owners	domain, reliability, and policy reviewers named
It has rollout	shadow and canary plan defined
It has rollback	prior tuple is replayable

That is the standard that keeps agent improvement from becoming silent mutation.

The harness can improve like a model. It should also be governed like one.