Context Packs in Practice: From Spec to Run

Most production agent bugs I have helped debug start with the same symptom: “the prompt was wrong, and we don’t know why.” The team has a git blame on the prompt template, a memory of who edited it last week, and a vague suspicion that retrieval has changed. Two days later they pin the cause to a wiki page that got moved, an embedding store that got rebuilt, or a policy version that got promoted in the wrong environment.

Context Packs exist so that this bug stops happening. The pack is the small, named contract that records, for a given run, which sources were in scope, which tools the model could see, which policy bundles applied, and which bar the run had to clear before it could ship. When something goes wrong, the recovery starts with a pack version, not with archaeology.

This post is a practitioner’s tour. The spec covers the schema; this is how it lands in practice.

2026 update: make the pack the release unit

The practical lesson since this post was first written is that Context Packs should be treated as release artifacts, not configuration blobs. A pack promotion should carry the same seriousness as a service deployment: lint, golden-set replay, reviewer verdicts, signature, rollout stage, and rollback pointer.

The pack is the only artifact that cleanly joins product intent to runtime behavior. It names the buckets, policy bundles, tools, evidence requirements, memory scope, evaluation gates, and rollout constraints. If a change cannot be expressed as a pack diff or a referenced policy/tool/skill version, it is probably bypassing the harness.

What a Context Pack is, and isn’t

A Context Pack is a versioned, signed artifact that declares everything the runtime needs to compile per-request context for a given intent — the buckets the prompt is composed from, the budgets each bucket gets, the policy bundles in scope, the tool permissions, the runtime controls (must-refuse, must-escalate, redaction, approval gates), and the evaluation gates that block release.

A pack is not a prompt template. The pack is the contract that the Context Pack Compiler consumes to produce a CompiledContext — typed, budgeted, and bound to a RunContext. The prompt is downstream of the pack; the pack is upstream of the prompt.

The reason this matters is debuggability. Change a pack and behavior changes? You can name the cause: a pack version. Change a prompt template and behavior changes? You’re back in archaeology.

What the Compiler actually does

When invokeAgent fires, the Compiler runs eight stages in order. They are short individually; they are unforgiving in aggregate.

It resolves the intent — matching request.intent to a pack-binding, e.g. ctxpack.support@5.2.0. It selects policy bundles by evaluating RunContext and intent against active bundles, collecting rules and obligations. It surfaces tools by computing the intersection of registry, permissions, and prohibitions, and expanding only the surfaced schemas — indexed by capability and capability_class. It resolves evidence by pulling from the KG and memory under tenant + classification scoping. It selects memory by recalling promoted entries scoped by tenant_id and subject. It allocates the token budget across the named buckets per the pack’s declared shape. It assembles the buckets, applying redaction rules. And it emits the manifest — CompiledContext plus runtime_controls for the Decision plane to consume.

The output is deterministic given the same inputs. That determinism is what makes the pack replayable, which is what makes audit possible.

Bucket budgets in practice

A pack declares per-bucket budgets and source rules:

{
  "buckets": {
    "policy":   { "max_tokens": 1800, "sources": ["policy_engine"], "ttl_sec": 3600 },
    "tool":     { "max_tokens": 1500, "sources": ["tool_registry"], "ttl_sec": 3600 },
    "evidence": { "max_tokens": 3500, "sources": ["kg", "tool_results"], "ttl_sec": 60 },
    "memory":   { "max_tokens": 1500, "sources": ["promoted_memory"], "ttl_sec": 86400 },
    "business": { "max_tokens": 1500, "sources": ["catalog", "policy_kb"], "ttl_sec": 3600 },
    "session":  { "max_tokens": 2200, "sources": ["session_state"], "ttl_sec": 7200 }
  }
}

A few things worth internalizing.

Budgets are part of the contract, not a soft target. “We’ll just expand the window” is a non-answer; the work of context engineering is the work of choosing what fits. Without a budget you ship silent context bloat: irrelevant retrieval crowds out the critical detail and you don’t notice until utility quietly drops on a downstream eval.

TTLs vary by bucket because the truth conditions vary. Evidence is fresh-or-discard; a stale order status is worse than no order status. Memory is durable; a year-old promoted preference is still valid. Policy lives in between. The TTL is part of correctness, not performance tuning.

Source priority is explicit, never implicit. When two sources can produce the same fact, the pack declares which wins. Otherwise you are running on the silent drift between systems and the answer changes depending on retrieval ordering. (See Context Engineering in Production for more on this.)

Policy bundles by reference, not by inlining

Policy bundles are declared by reference. The Compiler evaluates them and emits the matching rules into the policy bucket; the model never sees the source.

{
  "policy_bundles": [
    {
      "id": "POLICY_RETURNS_V4",
      "version": "4.2.1",
      "priority": 10,
      "must_refuse": ["refund_guaranteed"],
      "approval_gates": ["GATE_FINANCE_APPROVAL"]
    }
  ]
}

Two reasons this matters. First, policy text in the prompt is a prompt-injection target. Second, the model sees decisions, not source code: a must_refuse is an enforced control, not a suggestion the model has to interpret. The runtime — not the model — decides what is allowed.

Policy bundles can downgrade approval modes within their priority but cannot upgrade past the wire-time declaration on the capability. The downgrade-only invariant is the load-bearing rule of the entire governance model; see Approval-Mode Tiers for the long version.

Evaluation gates as release gates

The pack declares the bar the runtime has to clear before the version can ship:

{
  "evaluation": {
    "trust_benchmarks": {
      "policy_compliance_rate":     { "min": 0.995 },
      "evidence_backed_rate":       { "min": 0.98 },
      "tool_success_recovery_rate": { "min": 0.97 },
      "redaction_failure_rate":     { "max": 0.0 }
    },
    "golden_sets": ["goldenset://support/refund/v3"],
    "replay_determinism_required":  true
  }
}

The gate is mechanical. A pack version that fails any of these against its golden set does not promote. scorecard deltas (per the evaluation and observability plane) catch regressions against the previous pack version before promotion. If you do not have a golden set, you do not have a release gate; the pack will regress, and you will find out from a customer.

Lifecycle, in plain terms

Packs move through a small set of states under change-control: authored by an owner role, validated by lint (unreachable rules, missing evidence requirements, budget overflow, redaction-rule conflicts), tested against the golden set with scorecard deltas computed against the previous version, approved through security/governance review for high-risk domains, promoted into the registry as pack_id@semver with a content hash and signature, pinned by environments that resolve packs only by exact version, and eventually retired. Retired packs remain queryable for replay but cannot be re-promoted under the same version number.

The runtime refuses unsigned references and unpinned references. This is the load-bearing rule for replay determinism: tag-based references like latest or stable are not pinned, and a runtime that accepts them cannot guarantee replay.

A worked example

Here is a small pack for a support refund intent. It is realistic enough to compile and minimal enough to fit on screen:

{
  "context_pack_id": "ctxpack.support",
  "version": "5.2.0",
  "owner": "support_ops",
  "intent": "support.refund",
  "buckets": {
    "policy":   { "max_tokens": 1800, "sources": ["policy_engine"], "ttl_sec": 3600 },
    "tool":     { "max_tokens": 1500, "sources": ["tool_registry"], "ttl_sec": 3600 },
    "evidence": { "max_tokens": 3500, "sources": ["kg", "tool_results"], "ttl_sec": 60 },
    "memory":   { "max_tokens": 1500, "sources": ["promoted_memory"], "ttl_sec": 86400 },
    "business": { "max_tokens": 1500, "sources": ["catalog", "policy_kb"], "ttl_sec": 3600 },
    "session":  { "max_tokens": 2200, "sources": ["session_state"], "ttl_sec": 7200 }
  },
  "tooling_layer": {
    "permissions": [
      { "permission_id": "perm_orders_lookup", "adapter_id": "adp_orders", "capability": "lookup", "approval_mode": "read_only" },
      { "permission_id": "perm_idv_verify",    "adapter_id": "adp_idv",    "capability": "verify", "approval_mode": "read_only" },
      { "permission_id": "perm_payments_refund_capped", "adapter_id": "adp_payments", "capability": "issue_refund",
        "approval_mode": "destructive",
        "arg_constraints": { "amount_inr": { "min": 1, "max": 50000 } } }
    ],
    "prohibitions": ["adp_billing.export_csv"]
  },
  "policy_bundles": [
    { "id": "POLICY_RETURNS_V4", "version": "4.2.1", "priority": 10 }
  ],
  "runtime_controls": {
    "must_refuse": ["refund_guaranteed"],
    "redaction_rules": ["card_pan", "cvv"]
  },
  "evaluation": {
    "trust_benchmarks": {
      "policy_compliance_rate":     { "min": 0.995 },
      "evidence_backed_rate":       { "min": 0.98 },
      "redaction_failure_rate":     { "max": 0.0 }
    },
    "golden_sets": ["goldenset://support/refund/v3"]
  }
}

What this guarantees at runtime is concrete. The model sees three tool schemas, not the entire catalog. The destructive issue_refund cannot be invoked above 50,000 INR — a higher value is denied at the Critic before execute, with the offending value recorded on the Decision Record. “Refund guaranteed” claims are refused. Card numbers and CVVs are redacted from any string entering the prompt or returning from a tool. A run with redaction failure on a CONFIDENTIAL field fails the release gate, so the pack does not promote.

Patterns worth stealing

Across mature pack libraries, a few patterns show up repeatedly.

Per-tenant overlays are common. The base pack at ctxpack.support@5.2.0 declares the shape; tenant overlays adjust budgets, prohibitions, and policy bundles for specific deployments. The runtime composes them deterministically. This avoids the anti-pattern of forking a pack per tenant.

Skill bundles are useful for recurring micro-tasks. Identity verification, address validation, and similar routines get packaged as Skills — versioned bundles of capabilities + prompt fragments + golden sets — and packs reference them. The Skill carries its own evaluation set and ownership.

Evidence-class budgets help when evidence quality varies. Fresh tool reads are not the same artifact as cached KG snapshots; budgeting by class lets the Compiler prefer fresh evidence when budgets tighten, instead of cutting both proportionally.

Conservative tool surfacing is worth the friction. Default to read_only tooling, and require an explicit promotion (and the corresponding golden-set entries) to surface delegated or destructive capabilities. Most intents need only the former.

Pack readiness checklist

Area	Ready means
Identity	`context_pack_id`, semver, owner, content hash, and signature are present.
Buckets	Each bucket has source rules, max tokens, freshness, redaction, and overflow behavior.
Policy	Bundles are referenced by id/version; runtime controls are emitted, not copied as prose.
Tools	Permissions, prohibitions, approval modes, and argument constraints resolve deterministically.
Evidence	Required evidence classes map to retrievers with snapshot refs and payload hashes.
Memory	Recall reads only promoted memory scoped by tenant, subject, intent, and classification.
Evaluation	Golden sets and scorecard thresholds block promotion on regressions.
Replay	The release tuple is pinned enough to rebuild the `CompiledContext` for a past run.

Mistakes to avoid

The most common is inlining policy text into a business or evidence bucket. This recreates the prompt-injection problem the policy bucket was designed to avoid. Reference policy by id; let the Compiler emit the rules that match.

The second is treating bucket budgets as guidelines. A pack that lets evidence spill into business to “make room” is a pack that cannot be replayed deterministically. Budgets are limits, not suggestions. If you regularly need more room, the pack design is wrong, not the budget.

The third is shipping a pack without a golden set. A pack without a golden set has no release gate. It will regress; you will not know when. The first time you find out will not be on your terms.

The fourth is tag-based references. ctxpack.support@latest is not a pinned reference. The runtime should refuse it; if it does not, your replay story is fiction.

A closing thought

Context Packs are not ceremony. They are the smallest contract that makes per-request behavior versionable, replayable, and auditable. They turn “the prompt was wrong, and we don’t know why” into “the pack is ctxpack.support@5.2.0; here is the diff to the previous version; here is the failing case from the golden set.”

If your runtime today produces prompts directly from code, the migration is straightforward: name the buckets, declare the budgets, register the tool permissions, sign the result, and pin the version in the request envelope. Then walk away from prompt-template diff-debugging for good.