Pack Rollout in Five Stages: Shipping a Context Pack Without Blowing Up Production

A pack version is not a release until traffic moves to it. The team that thinks otherwise eventually has the day where 100% of refund traffic hits a freshly-merged pack and the destructive-action rate doubles before the dashboards refresh. I have had that day. So has every team I have worked with that did not stage their rollouts.

Five stages is the number that buys you the most signal per dollar. Three is too coarse — there is no “internal” cohort to catch the obvious bugs. Seven is too fine — nobody has the patience to watch a dashboard for a week per stage. The five canonical stages from Harness Engineering are the ones I keep coming back to:

Stage	What is true	Gate to advance
`0%_shadow`	Runs in parallel; emits scorecard but does not affect outcome	Scorecard delta within bounds for ≥ N runs
`1%_internal`	Internal users only; full telemetry	No regression on safety / policy guardrails
`5%_low_risk`	Limited cohort; low-blast-radius intents only	Adoption rate of corrections trending down
`25%_monitored`	Broader rollout; tail-based sampling on every escalation	Evaluator scorecards stable across cohorts
`100%`	Full rollout; previous version pinned for replay	Two-week canary period clean

This post is the operator’s version: the routing config, the gating queries, and the kill switch, with a worked example of bumping ctxpack.support@5.1.0 → ctxpack.support@5.2.0.

2026 update: rollout is part of replay

The routing rule is not deployment plumbing. It is part of the lineage. If a run cannot later prove why it received the baseline pack or the candidate pack, the rollout system has broken replay.

That is why the rollout tuple belongs in versioned configuration and why every stage transition needs a promotion record. Feature-flag services can still distribute decisions, but the source of truth for audit should be a signed rollout artifact with stage, cohort, thresholds, approver, and rollback target.

The routing config

Routing is decided by a small TypeScript module, not by a feature-flag UI. The reason is replay — the routing decision needs to be reproducible from the run inputs alone.

harness/rollout/router.ts

import type { RunContext } from "@/types"
 
type Stage = "0%_shadow" | "1%_internal" | "5%_low_risk" | "25%_monitored" | "100%"
type RolloutEntry = {
  intent: string
  baseline: string             // "ctxpack.support@5.1.0"
  candidate: string            // "ctxpack.support@5.2.0"
  stage: Stage
  cohort?: { internal_only?: boolean; intent_classes?: string[] }
  started_at: string
}
 
import { rollouts } from "./rollouts.json"          // versioned in the repo
 
export function pickPackVersion(ctx: RunContext): string {
  const entry = rollouts.find(
    (r) => r.intent === ctx.intent || r.intent === "*",
  ) as RolloutEntry | undefined
 
  if (!entry) return defaultPackFor(ctx.intent)
 
  // hash-bucket the user/tenant for stable routing across requests
  const bucket = hashBucket(ctx.tenant_id, ctx.user_id)   // 0..99
 
  switch (entry.stage) {
    case "0%_shadow":
      return entry.baseline       // candidate runs in parallel; see shadow.ts
    case "1%_internal":
      return ctx.is_internal && bucket < 100
        ? entry.candidate
        : entry.baseline
    case "5%_low_risk":
      return entry.cohort?.intent_classes?.includes(ctx.intent_class) && bucket < 5
        ? entry.candidate
        : entry.baseline
    case "25%_monitored":
      return bucket < 25 ? entry.candidate : entry.baseline
    case "100%":
      return entry.candidate
  }
}
 
function hashBucket(tenant: string, user: string): number {
  // FNV-1a → 0..99; deterministic across processes
  let h = 2166136261
  for (const c of `${tenant}:${user}`) {
    h ^= c.charCodeAt(0)
    h = Math.imul(h, 16777619)
  }
  return Math.abs(h) % 100
}

Two design notes worth pausing on.

The bucket is a deterministic hash of (tenant_id, user_id). The same user always lands in the same bucket. Without that, a user might see the candidate pack on one request and the baseline on the next, and you cannot diagnose anything from that traffic mix.

The candidate version lives in a JSON file (rollouts.json) that is in the repo, not in a runtime feature-flag service. The reason is replay: when the regulator asks what version was active for trace_id=abc, you produce a git blame of rollouts.json, not a screenshot of a flag UI. The history of the rollout is the history of the file.

The shadow stage, in detail

The most useful and least understood stage is 0%_shadow. It runs both packs on every eligible request and compares scorecards. Production traffic gets the baseline result; the candidate result is recorded but not returned.

harness/rollout/shadow.ts

import type { RunContext, InvokeRequest } from "@/types"
import { compileAndRun } from "@/runtime"
import { scoreRun } from "@/evals/scorecard"
import { writeShadowComparison } from "@/store"
 
export async function runWithShadow(
  ctx: RunContext,
  req: InvokeRequest,
  baseline: string,
  candidate: string,
) {
  // baseline returns to the user
  const live = await compileAndRun(baseline, ctx, req)
 
  // candidate runs in parallel — never returned, never executes destructive
  const shadow = await compileAndRun(candidate, ctx, req, {
    block_destructive: true,    // dry-run only, no external effects
  }).catch((err) => ({ error: err.message }))
 
  // record both scorecards side by side
  await writeShadowComparison({
    trace_id: live.trace_id,
    intent: req.intent,
    baseline: { pack: baseline, scorecard: scoreRun(live) },
    candidate: "error" in shadow
      ? { pack: candidate, error: shadow.error }
      : { pack: candidate, scorecard: scoreRun(shadow) },
    written_at: new Date().toISOString(),
  })
 
  return live
}

The block_destructive: true flag is non-negotiable in shadow. The candidate must not refund anything, must not write to durable stores, must not call out to side-effecting APIs. The shadow’s job is to predict; the live run’s job is to execute.

The advance-stage check

Advancing from one stage to the next is an SQL query, not a meeting. The query lives in harness/rollout/checks.sql and runs on the experience store:

harness/rollout/checks/scorecard_delta.sql

-- Used to gate 0%_shadow → 1%_internal
WITH paired AS (
  SELECT
    s.intent,
    s.trace_id,
    s.candidate_pack,
    s.baseline_pack,
    s.candidate_score_policy,
    s.baseline_score_policy,
    s.candidate_score_safety,
    s.baseline_score_safety,
    s.candidate_score_utility,
    s.baseline_score_utility,
    s.candidate_score_latency,
    s.baseline_score_latency,
    s.candidate_score_cost,
    s.baseline_score_cost
  FROM shadow_comparisons s
  WHERE s.intent = :intent
    AND s.candidate_pack = :candidate
    AND s.written_at >= :since
),
agg AS (
  SELECT
    COUNT(*) AS n_runs,
    -- hard-fail evaluators: any candidate < baseline is a problem
    SUM(CASE WHEN candidate_score_policy < baseline_score_policy THEN 1 ELSE 0 END) AS n_policy_regressions,
    SUM(CASE WHEN candidate_score_safety < baseline_score_safety THEN 1 ELSE 0 END) AS n_safety_regressions,
    -- soft evaluators: mean delta
    AVG(candidate_score_utility - baseline_score_utility) AS d_utility,
    AVG(candidate_score_latency - baseline_score_latency) AS d_latency,
    AVG(candidate_score_cost    - baseline_score_cost   ) AS d_cost
  FROM paired
)
SELECT
  n_runs,
  n_policy_regressions,
  n_safety_regressions,
  ROUND(d_utility::numeric, 4) AS d_utility,
  ROUND(d_latency::numeric, 4) AS d_latency,
  ROUND(d_cost::numeric,    4) AS d_cost,
  CASE
    WHEN n_runs < 1000                       THEN 'block: insufficient sample'
    WHEN n_policy_regressions > 0            THEN 'block: policy regression'
    WHEN n_safety_regressions > 0            THEN 'block: safety regression'
    WHEN d_utility < -0.05                   THEN 'needs_human: utility regression'
    WHEN d_latency < -0.10                   THEN 'needs_human: latency regression'
    WHEN d_cost    < -0.10                   THEN 'needs_human: cost regression'
    ELSE 'advance'
  END AS verdict
FROM agg;

Three things this query does that a hand-eye check does not.

It refuses to advance on insufficient data. A thousand runs is a small number for support traffic but a large one for low-volume intents; pick the threshold per intent class. The point is that it is a number, not a feeling.

It distinguishes hard fails (Policy, Safety) from soft fails (Utility, Latency, Cost). A hard fail blocks. A soft fail kicks the verdict into needs_human, which means a named approver has to sign off citing the regression they accepted.

It returns one of three discrete verdicts: advance, block, needs_human. The advance script reads that verdict and acts accordingly — no interpretation, no edge cases.

The advance script

A short shell script that wraps the query and either bumps the stage in rollouts.json or files a review request:

#!/usr/bin/env bash
# harness/rollout/advance.sh
# Usage: ./advance.sh <intent> <candidate_pack>
set -euo pipefail
 
INTENT="${1:?intent required}"
CANDIDATE="${2:?candidate pack required}"
SINCE="$(date -u -d '24 hours ago' +%FT%TZ)"
 
VERDICT=$(psql "$EXPSTORE_URL" -At -v intent="$INTENT" -v candidate="$CANDIDATE" -v since="$SINCE" \
  -f harness/rollout/checks/scorecard_delta.sql | tail -1 | awk '{print $NF}')
 
case "$VERDICT" in
  advance)
    echo "verdict: advance — bumping stage in rollouts.json"
    ./scripts/bump_stage.ts "$INTENT" "$CANDIDATE"
    git add harness/rollout/rollouts.json
    git commit -m "rollout($INTENT): advance stage for $CANDIDATE"
    ;;
  needs_human)
    echo "verdict: needs_human — opening review request"
    ./scripts/open_review.ts "$INTENT" "$CANDIDATE"
    ;;
  block:*)
    echo "verdict: $VERDICT — staying on current stage; investigate"
    exit 1
    ;;
  *)
    echo "unknown verdict: $VERDICT"
    exit 2
    ;;
esac

Wire this into a daily cron, alongside the scorecard rollup. Stage advances become a side effect of clean numbers, not a calendar event.

The kill switch

The kill switch is a single command and a 30-second runbook. It pins the prior version of every artifact in the release tuple and re-routes traffic.

Before a candidate leaves shadow, the operator should be able to point at this table and show the evidence:

Gate	Evidence required
Shadow clean	Paired scorecard deltas for baseline vs candidate across the required sample size
Internal clean	No hard-gate Policy/Safety failures and no unresolved reviewer blocks
Low-risk clean	Correction rate and escalation rate are stable or improving
Monitored clean	Tail samples include all destructive, denied, loop-guard, and scorecard-fail runs
Full rollout	Rollback tuple is pinned and replay verifies both old and new pack versions

#!/usr/bin/env bash
# harness/rollout/kill_switch.sh
# Usage: ./kill_switch.sh <intent> [reason]
set -euo pipefail
 
INTENT="${1:?intent required}"
REASON="${2:-no reason given}"
 
# 1. drop the candidate from rollouts.json (atomic)
./scripts/pin_baseline.ts "$INTENT"
 
# 2. emit a kill_switch event into the experience store
psql "$EXPSTORE_URL" -c "INSERT INTO kill_switch_events
  (intent, reason, fired_at) VALUES
  ('$INTENT', '$REASON', NOW());"
 
# 3. notify the on-call channel
curl -sS -X POST "$ONCALL_WEBHOOK" \
  -H 'content-type: application/json' \
  -d "$(jq -n --arg i "$INTENT" --arg r "$REASON" \
    '{ text: "kill switch fired for " + $i + " — reason: " + $r }')"
 
echo "kill switch fired for $INTENT — pinned baseline, traffic re-routing"

Two principles the kill switch must respect.

It pins the entire release tuple — pack, policy, tool registry, evaluator suite — to the prior known-good versions. A kill switch that only reverts one of the four is a kill switch that leaves stale combinations live. The tuple is the unit of release; the tuple is the unit of rollback.

It must be replay-equivalent. Pinning the baseline must reproduce the prior DecisionRecord byte-for-byte for any historical trace_id. If it does not, the rollback is theoretical; you reverted the surface but not the substrate. The contract is documented in Replay Is the Real Audit Log.

A worked example: ctxpack.support@5.1.0 → 5.2.0

The full sequence on a real-ish pack bump.

Day 0. Pack v5.2.0 lands in harness/packs/. Rollouts file gets a new entry:

{
  "intent": "support.refund.execute",
  "baseline": "ctxpack.support@5.1.0",
  "candidate": "ctxpack.support@5.2.0",
  "stage": "0%_shadow",
  "started_at": "2026-05-09T09:00:00Z"
}

Day 0–3. Shadow traffic accumulates. After 1,200 runs, the daily check returns:

n_runs:                 1247
n_policy_regressions:      0
n_safety_regressions:      0
d_utility:           +0.018
d_latency:           +0.022
d_cost:              -0.004
verdict:             advance

The advance script bumps to 1%_internal.

Day 3–6. Internal traffic only. One person notices a slightly different refund-window phrasing in a corner case; that becomes a feedback entry and a regression test. Scorecard stays clean. Advance to 5%_low_risk (intent_class = “support_low_risk”).

Day 6–11. Low-risk cohort. Latency holds. Utility lifts another 1.5% on net (the corner-case fix landed in v5.2.1, replayed clean). Advance to 25%_monitored.

Day 11–14. A quarter of refund traffic. Tail-based sampling captures every escalation. Two days in, one finding from the security reviewer surfaces about a policy bundle the new pack assumes — a fix lands as v5.2.2. The kill switch is not fired because the scorecard never crossed a hard threshold; the fix is a normal pack bump.

Day 14. Advance to 100%. v5.1.0 stays pinned for replay; the rollouts entry stays in the file with stage: "100%" so it serves as the authoritative record.

If at any point the scorecard had crossed Policy or Safety, the kill switch would have fired and the entry would have been pinned back to baseline. That has happened; it does not happen often, because the gate caught the bad version while it was at 5%, not at 100%.

What this discipline costs you

A bit of patience. The full cycle takes ~14 days. Some teams chafe at this — the engineer who shipped the pack wants it out the door. The rebuttal is the day in production you spent debugging a regression that the scorecard would have flagged at 0%. Every team I have worked with that adopted the five stages then went through one cycle of 0% catches the regression that almost shipped and never argued with the cadence again.

A small operational footprint. Five files: router.ts, shadow.ts, checks/scorecard_delta.sql, advance.sh, kill_switch.sh. Plus the rollouts JSON. Versioned in the repo. Reviewable in a PR.

A change in how the team thinks about “shipping”. Shipping a pack is not merging a PR. It is the merge plus 14 days of automated checks plus a kill switch that lives in cron. That mental model is the part you are paying for, and the dollar value of having it shows up the first time something almost regresses.

Wire the five stages this week. Bump one pack through them. By the third bump it is muscle memory.