When an ML team trains a model, nobody says “we changed production weights because one example got better.”
They create a candidate. They evaluate it. They compare it to a baseline. They check regressions. They review tradeoffs. They stage rollout. They keep rollback.
Agent harnesses deserve the same discipline.
A harness candidate is a proposed version of the system around the model: Context Pack, retrieval settings, tool schema, planner template, Critic rubric, policy threshold, evaluator suite, rollout gate, or memory promotion rule. The model may be unchanged. The agent behavior can still change dramatically.
That is why great AI engineers treat harness candidates like model checkpoints.
What counts as a harness candidate
Almost any agent behavior change is a harness candidate.
| Surface | Candidate examples |
|---|---|
| Context | change retrieval top_k, source priority, budget allocation, compression |
| Prompt | rewrite a small instruction block, add examples, remove ambiguous language |
| Tools | tighten schema, improve result shape, add retryability hints |
| Planner | add verification step, reorder tool calls, change re-plan budget |
| Critic | tighten evidence sufficiency, adjust escalate vs. retry rubric |
| Policy | add rule, modify threshold, require gate for a new condition |
| Memory | change recall eligibility, contradiction handling, promotion threshold |
| Evaluation | add grader, reweight threshold, update golden set |
Some of these are software diffs. Some are configuration. Some are data. The release system should not care. If it changes agent behavior, it is a candidate.
Candidate shape
A useful harness candidate has the same basic fields every time:
{
"candidate_id": "hc_2026_05_12_refund_004",
"baseline_tuple": {
"pack": "ctxpack.support@5.2.0",
"policy": "policy.returns@4.1.0",
"tools": "tools.support@3.7.0",
"evals": "evals.refund@2.4.0",
"model_profile": "model.fast@2026-05-01"
},
"target": {
"intent": "support.refund",
"primary_metric": "operator_corrected_rate",
"direction": "decrease"
},
"patches": [
{
"surface": "context.retrieval",
"field": "supplier_policy.max_hops",
"from": 2,
"to": 3
},
{
"surface": "decision.planner",
"field": "verify_supplier_exception_before_denial",
"from": false,
"to": true
}
],
"guardrails": {
"policy": "== 1.0",
"safety": "== 1.0",
"approval_bypass_rate": "== 0"
}
}The candidate names the baseline, the target, the patch, and the guardrails. Without those, the team cannot tell whether the change helped or merely changed behavior.
Candidate generation
Candidates come from four sources.
| Source | Candidate type |
|---|---|
| Human diagnosis | targeted fix after reading traces |
| Operator correction cluster | StrategyRule, evidence requirement, policy rule |
| Autotune search | bounded changes to retrieval, budgets, planner rules, prompts |
| Incident review | failure playbook, evaluator, approval gate, rollback control |
The source should be recorded. A human-authored candidate and an autotune candidate go through the same gate, but they deserve different review questions.
| Candidate source | Reviewer asks |
|---|---|
| Human diagnosis | did the patch match the trace failure? |
| Correction cluster | are the corrections valid and representative? |
| Autotune | was the search space bounded and held-out data protected? |
| Incident review | does the fix prevent recurrence without hiding symptoms? |
Candidate scoring
A candidate should be scored before review, not after debate, using the same scorecard and replay machinery as other harness releases.
baseline: release.support.refund@2026-05-09
candidate: hc_2026_05_12_refund_004
dataset: support.refund/release_test@2026-05-12
policy: 1.000 -> 1.000
safety: 1.000 -> 1.000
utility: 0.901 -> 0.929
latency: 2180ms -> 2260ms
cost: 0.91c -> 0.97c
changed: 54 / 900
regressed: 0 policy, 0 safety, 3 utility review
verdict: needs_humanThis is a good outcome. It does not auto-ship. It gives reviewers a precise tradeoff: utility rose, latency and cost rose slightly, three utility regressions need inspection.
Trace diff
The most useful part of candidate review is the trace diff.
| Diff | Example |
|---|---|
| Context | supplier exception evidence now included in 42 runs |
| Plan | verification step added before denial |
| Tools | one extra policy lookup on boundary cases |
| Guardrails | approval gate unchanged |
| Verdict | 39 denials changed to approve-with-gate |
| Cost | plus 0.06 cents per decision |
Trace diff keeps review grounded. Instead of arguing over whether the prompt “seems better,” the team inspects the mechanism.
Release stages
Harness candidates should roll out like product changes through the feature-flagged harness rollout.
| Stage | What happens | Gate |
|---|---|---|
replay | historical goldens replayed offline | scorecard passes hard floors |
shadow | candidate runs beside baseline, no side effects | live deltas within band |
internal | internal users or trusted operators | no safety/policy regression |
low_risk | limited low-blast-radius intents | corrections trend down |
monitored | broader canary with tail sampling | stable by tenant and risk |
full | candidate becomes default | rollback target retained |
Shadow mode is especially important for agents. It gives the team live language and live retrieval conditions without letting the candidate execute destructive actions.
Rollback
Rollback is not a button unless replay works.
Every release must pin:
| Pin | Why |
|---|---|
| pack version | reconstruct compiled context |
| policy bundle | reconstruct allow/deny/gate decisions |
| tool manifest | reconstruct reachable capabilities |
| evaluator suite | compare scores honestly |
| model profile | reconstruct model route and cost envelope |
| knowledge snapshot | reconstruct evidence |
The rollback target is the prior tuple. If the prior tuple cannot replay, the team does not have rollback. It has hope.
Candidate registry
Keep candidates in an append-only registry.
harness/candidates/
hc_2026_05_12_refund_004/
candidate.json
patch.diff
search-scorecard.json
heldout-scorecard.json
trace-diff-summary.json
reviewer-verdicts.json
rollout-record.jsonRejected candidates stay in the registry. They prevent repeat mistakes. If a retrieval budget reduction failed because evidence coverage dropped, the next autotune run should know not to rediscover that path.
What should never be tunable
Not every field should be part of search.
| Never auto-tune | Reason |
|---|---|
| approval-mode maximum | authority boundary, not optimization knob |
| safety floor | floor constraint |
| policy floor | floor constraint |
| credential scope | security boundary |
| tenant isolation | security boundary |
| evidence requirement for destructive action | audit boundary |
| idempotency on writes | recovery boundary |
| trace emission | observability boundary |
The optimizer may propose a governance change for humans to review. It may not silently relax these fields because a metric improved.
The ContextOS version
ContextOS already has the right shape:
candidate harness tuple
-> replay against search set
-> replay against held-out set
-> reviewer verdicts
-> staged rollout
-> DecisionRecords and scorecards
-> Improvement Loop proposalsThe Trust plane owns promotion. The other planes expose tunable surfaces and invariants. The candidate is the unit that moves between them.
The bar
A harness candidate is release-ready when:
| Check | Pass line |
|---|---|
| It has a baseline | prior tuple pinned |
| It has a target | one primary metric and intent |
| It has guardrails | Policy and Safety floors are explicit |
| It has replay evidence | search and held-out scorecards exist |
| It has trace diffs | changed behavior is inspectable |
| It has owners | domain, reliability, and policy reviewers named |
| It has rollout | shadow and canary plan defined |
| It has rollback | prior tuple is replayable |
That is the standard that keeps agent improvement from becoming silent mutation.
The harness can improve like a model. It should also be governed like one.