A pack version is not a release until traffic moves to it. The team that thinks otherwise eventually has the day where 100% of refund traffic hits a freshly-merged pack and the destructive-action rate doubles before the dashboards refresh. I have had that day. So has every team I have worked with that did not stage their rollouts.
Five stages is the number that buys you the most signal per dollar. Three is too coarse — there is no “internal” cohort to catch the obvious bugs. Seven is too fine — nobody has the patience to watch a dashboard for a week per stage. The five canonical stages from Harness Engineering are the ones I keep coming back to:
| Stage | What is true | Gate to advance |
|---|---|---|
0%_shadow | Runs in parallel; emits scorecard but does not affect outcome | Scorecard delta within bounds for ≥ N runs |
1%_internal | Internal users only; full telemetry | No regression on safety / policy guardrails |
5%_low_risk | Limited cohort; low-blast-radius intents only | Adoption rate of corrections trending down |
25%_monitored | Broader rollout; tail-based sampling on every escalation | Evaluator scorecards stable across cohorts |
100% | Full rollout; previous version pinned for replay | Two-week canary period clean |
This post is the operator’s version: the routing config, the gating queries, and the kill switch, with a worked example of bumping ctxpack.support@5.1.0 → ctxpack.support@5.2.0.
2026 update: rollout is part of replay
The routing rule is not deployment plumbing. It is part of the lineage. If a run cannot later prove why it received the baseline pack or the candidate pack, the rollout system has broken replay.
That is why the rollout tuple belongs in versioned configuration and why every stage transition needs a promotion record. Feature-flag services can still distribute decisions, but the source of truth for audit should be a signed rollout artifact with stage, cohort, thresholds, approver, and rollback target.
The routing config
Routing is decided by a small TypeScript module, not by a feature-flag UI. The reason is replay — the routing decision needs to be reproducible from the run inputs alone.
import type { RunContext } from "@/types"
type Stage = "0%_shadow" | "1%_internal" | "5%_low_risk" | "25%_monitored" | "100%"
type RolloutEntry = {
intent: string
baseline: string // "ctxpack.support@5.1.0"
candidate: string // "ctxpack.support@5.2.0"
stage: Stage
cohort?: { internal_only?: boolean; intent_classes?: string[] }
started_at: string
}
import { rollouts } from "./rollouts.json" // versioned in the repo
export function pickPackVersion(ctx: RunContext): string {
const entry = rollouts.find(
(r) => r.intent === ctx.intent || r.intent === "*",
) as RolloutEntry | undefined
if (!entry) return defaultPackFor(ctx.intent)
// hash-bucket the user/tenant for stable routing across requests
const bucket = hashBucket(ctx.tenant_id, ctx.user_id) // 0..99
switch (entry.stage) {
case "0%_shadow":
return entry.baseline // candidate runs in parallel; see shadow.ts
case "1%_internal":
return ctx.is_internal && bucket < 100
? entry.candidate
: entry.baseline
case "5%_low_risk":
return entry.cohort?.intent_classes?.includes(ctx.intent_class) && bucket < 5
? entry.candidate
: entry.baseline
case "25%_monitored":
return bucket < 25 ? entry.candidate : entry.baseline
case "100%":
return entry.candidate
}
}
function hashBucket(tenant: string, user: string): number {
// FNV-1a → 0..99; deterministic across processes
let h = 2166136261
for (const c of `${tenant}:${user}`) {
h ^= c.charCodeAt(0)
h = Math.imul(h, 16777619)
}
return Math.abs(h) % 100
}Two design notes worth pausing on.
The bucket is a deterministic hash of (tenant_id, user_id). The same user always lands in the same bucket. Without that, a user might see the candidate pack on one request and the baseline on the next, and you cannot diagnose anything from that traffic mix.
The candidate version lives in a JSON file (rollouts.json) that is in the repo, not in a runtime feature-flag service. The reason is replay: when the regulator asks what version was active for trace_id=abc, you produce a git blame of rollouts.json, not a screenshot of a flag UI. The history of the rollout is the history of the file.
The shadow stage, in detail
The most useful and least understood stage is 0%_shadow. It runs both packs on every eligible request and compares scorecards. Production traffic gets the baseline result; the candidate result is recorded but not returned.
import type { RunContext, InvokeRequest } from "@/types"
import { compileAndRun } from "@/runtime"
import { scoreRun } from "@/evals/scorecard"
import { writeShadowComparison } from "@/store"
export async function runWithShadow(
ctx: RunContext,
req: InvokeRequest,
baseline: string,
candidate: string,
) {
// baseline returns to the user
const live = await compileAndRun(baseline, ctx, req)
// candidate runs in parallel — never returned, never executes destructive
const shadow = await compileAndRun(candidate, ctx, req, {
block_destructive: true, // dry-run only, no external effects
}).catch((err) => ({ error: err.message }))
// record both scorecards side by side
await writeShadowComparison({
trace_id: live.trace_id,
intent: req.intent,
baseline: { pack: baseline, scorecard: scoreRun(live) },
candidate: "error" in shadow
? { pack: candidate, error: shadow.error }
: { pack: candidate, scorecard: scoreRun(shadow) },
written_at: new Date().toISOString(),
})
return live
}The block_destructive: true flag is non-negotiable in shadow. The candidate must not refund anything, must not write to durable stores, must not call out to side-effecting APIs. The shadow’s job is to predict; the live run’s job is to execute.
The advance-stage check
Advancing from one stage to the next is an SQL query, not a meeting. The query lives in harness/rollout/checks.sql and runs on the experience store:
-- Used to gate 0%_shadow → 1%_internal
WITH paired AS (
SELECT
s.intent,
s.trace_id,
s.candidate_pack,
s.baseline_pack,
s.candidate_score_policy,
s.baseline_score_policy,
s.candidate_score_safety,
s.baseline_score_safety,
s.candidate_score_utility,
s.baseline_score_utility,
s.candidate_score_latency,
s.baseline_score_latency,
s.candidate_score_cost,
s.baseline_score_cost
FROM shadow_comparisons s
WHERE s.intent = :intent
AND s.candidate_pack = :candidate
AND s.written_at >= :since
),
agg AS (
SELECT
COUNT(*) AS n_runs,
-- hard-fail evaluators: any candidate < baseline is a problem
SUM(CASE WHEN candidate_score_policy < baseline_score_policy THEN 1 ELSE 0 END) AS n_policy_regressions,
SUM(CASE WHEN candidate_score_safety < baseline_score_safety THEN 1 ELSE 0 END) AS n_safety_regressions,
-- soft evaluators: mean delta
AVG(candidate_score_utility - baseline_score_utility) AS d_utility,
AVG(candidate_score_latency - baseline_score_latency) AS d_latency,
AVG(candidate_score_cost - baseline_score_cost ) AS d_cost
FROM paired
)
SELECT
n_runs,
n_policy_regressions,
n_safety_regressions,
ROUND(d_utility::numeric, 4) AS d_utility,
ROUND(d_latency::numeric, 4) AS d_latency,
ROUND(d_cost::numeric, 4) AS d_cost,
CASE
WHEN n_runs < 1000 THEN 'block: insufficient sample'
WHEN n_policy_regressions > 0 THEN 'block: policy regression'
WHEN n_safety_regressions > 0 THEN 'block: safety regression'
WHEN d_utility < -0.05 THEN 'needs_human: utility regression'
WHEN d_latency < -0.10 THEN 'needs_human: latency regression'
WHEN d_cost < -0.10 THEN 'needs_human: cost regression'
ELSE 'advance'
END AS verdict
FROM agg;Three things this query does that a hand-eye check does not.
It refuses to advance on insufficient data. A thousand runs is a small number for support traffic but a large one for low-volume intents; pick the threshold per intent class. The point is that it is a number, not a feeling.
It distinguishes hard fails (Policy, Safety) from soft fails (Utility, Latency, Cost). A hard fail blocks. A soft fail kicks the verdict into needs_human, which means a named approver has to sign off citing the regression they accepted.
It returns one of three discrete verdicts: advance, block, needs_human. The advance script reads that verdict and acts accordingly — no interpretation, no edge cases.
The advance script
A short shell script that wraps the query and either bumps the stage in rollouts.json or files a review request:
#!/usr/bin/env bash
# harness/rollout/advance.sh
# Usage: ./advance.sh <intent> <candidate_pack>
set -euo pipefail
INTENT="${1:?intent required}"
CANDIDATE="${2:?candidate pack required}"
SINCE="$(date -u -d '24 hours ago' +%FT%TZ)"
VERDICT=$(psql "$EXPSTORE_URL" -At -v intent="$INTENT" -v candidate="$CANDIDATE" -v since="$SINCE" \
-f harness/rollout/checks/scorecard_delta.sql | tail -1 | awk '{print $NF}')
case "$VERDICT" in
advance)
echo "verdict: advance — bumping stage in rollouts.json"
./scripts/bump_stage.ts "$INTENT" "$CANDIDATE"
git add harness/rollout/rollouts.json
git commit -m "rollout($INTENT): advance stage for $CANDIDATE"
;;
needs_human)
echo "verdict: needs_human — opening review request"
./scripts/open_review.ts "$INTENT" "$CANDIDATE"
;;
block:*)
echo "verdict: $VERDICT — staying on current stage; investigate"
exit 1
;;
*)
echo "unknown verdict: $VERDICT"
exit 2
;;
esacWire this into a daily cron, alongside the scorecard rollup. Stage advances become a side effect of clean numbers, not a calendar event.
The kill switch
The kill switch is a single command and a 30-second runbook. It pins the prior version of every artifact in the release tuple and re-routes traffic.
Before a candidate leaves shadow, the operator should be able to point at this table and show the evidence:
| Gate | Evidence required |
|---|---|
| Shadow clean | Paired scorecard deltas for baseline vs candidate across the required sample size |
| Internal clean | No hard-gate Policy/Safety failures and no unresolved reviewer blocks |
| Low-risk clean | Correction rate and escalation rate are stable or improving |
| Monitored clean | Tail samples include all destructive, denied, loop-guard, and scorecard-fail runs |
| Full rollout | Rollback tuple is pinned and replay verifies both old and new pack versions |
#!/usr/bin/env bash
# harness/rollout/kill_switch.sh
# Usage: ./kill_switch.sh <intent> [reason]
set -euo pipefail
INTENT="${1:?intent required}"
REASON="${2:-no reason given}"
# 1. drop the candidate from rollouts.json (atomic)
./scripts/pin_baseline.ts "$INTENT"
# 2. emit a kill_switch event into the experience store
psql "$EXPSTORE_URL" -c "INSERT INTO kill_switch_events
(intent, reason, fired_at) VALUES
('$INTENT', '$REASON', NOW());"
# 3. notify the on-call channel
curl -sS -X POST "$ONCALL_WEBHOOK" \
-H 'content-type: application/json' \
-d "$(jq -n --arg i "$INTENT" --arg r "$REASON" \
'{ text: "kill switch fired for " + $i + " — reason: " + $r }')"
echo "kill switch fired for $INTENT — pinned baseline, traffic re-routing"Two principles the kill switch must respect.
It pins the entire release tuple — pack, policy, tool registry, evaluator suite — to the prior known-good versions. A kill switch that only reverts one of the four is a kill switch that leaves stale combinations live. The tuple is the unit of release; the tuple is the unit of rollback.
It must be replay-equivalent. Pinning the baseline must reproduce the prior DecisionRecord byte-for-byte for any historical trace_id. If it does not, the rollback is theoretical; you reverted the surface but not the substrate. The contract is documented in Replay Is the Real Audit Log.
A worked example: ctxpack.support@5.1.0 → 5.2.0
The full sequence on a real-ish pack bump.
Day 0. Pack v5.2.0 lands in harness/packs/. Rollouts file gets a new entry:
{
"intent": "support.refund.execute",
"baseline": "ctxpack.support@5.1.0",
"candidate": "ctxpack.support@5.2.0",
"stage": "0%_shadow",
"started_at": "2026-05-09T09:00:00Z"
}Day 0–3. Shadow traffic accumulates. After 1,200 runs, the daily check returns:
n_runs: 1247
n_policy_regressions: 0
n_safety_regressions: 0
d_utility: +0.018
d_latency: +0.022
d_cost: -0.004
verdict: advanceThe advance script bumps to 1%_internal.
Day 3–6. Internal traffic only. One person notices a slightly different refund-window phrasing in a corner case; that becomes a feedback entry and a regression test. Scorecard stays clean. Advance to 5%_low_risk (intent_class = “support_low_risk”).
Day 6–11. Low-risk cohort. Latency holds. Utility lifts another 1.5% on net (the corner-case fix landed in v5.2.1, replayed clean). Advance to 25%_monitored.
Day 11–14. A quarter of refund traffic. Tail-based sampling captures every escalation. Two days in, one finding from the security reviewer surfaces about a policy bundle the new pack assumes — a fix lands as v5.2.2. The kill switch is not fired because the scorecard never crossed a hard threshold; the fix is a normal pack bump.
Day 14. Advance to 100%. v5.1.0 stays pinned for replay; the rollouts entry stays in the file with stage: "100%" so it serves as the authoritative record.
If at any point the scorecard had crossed Policy or Safety, the kill switch would have fired and the entry would have been pinned back to baseline. That has happened; it does not happen often, because the gate caught the bad version while it was at 5%, not at 100%.
What this discipline costs you
A bit of patience. The full cycle takes ~14 days. Some teams chafe at this — the engineer who shipped the pack wants it out the door. The rebuttal is the day in production you spent debugging a regression that the scorecard would have flagged at 0%. Every team I have worked with that adopted the five stages then went through one cycle of 0% catches the regression that almost shipped and never argued with the cadence again.
A small operational footprint. Five files: router.ts, shadow.ts, checks/scorecard_delta.sql, advance.sh, kill_switch.sh. Plus the rollouts JSON. Versioned in the repo. Reviewable in a PR.
A change in how the team thinks about “shipping”. Shipping a pack is not merging a PR. It is the merge plus 14 days of automated checks plus a kill switch that lives in cron. That mental model is the part you are paying for, and the dollar value of having it shows up the first time something almost regresses.
Wire the five stages this week. Bump one pack through them. By the third bump it is muscle memory.