Context Engineering in Production

A team I worked with last quarter shipped an internal copilot with great early reviews. Two months in, an engineer asked it about the on-call schedule. The agent confidently pointed her at a person who had left the company in October. The agent was not wrong about its sources — the wiki had not been updated. The agent was not wrong about retrieval — it had ranked the most-recently-edited document highest. The agent was wrong because nobody had decided who owned the wiki page, and “most recent edit” turned out to mean “an intern moved a heading.”

That is the texture of real production failure for agents. It looks like a model bug. It is almost always a context bug.

This post is about treating context as its own engineering discipline — not as something you stuff into a window at runtime, but as a contract with owners, schemas, freshness, and tests.

2026 update: context has a contract now

When this piece was first written, the useful framing was “context engineering.” That is still right, but it is not precise enough for a production team. The ContextOS version is sharper: context is a compiled artifact with a named input (ContextPack), a named output (CompiledContext), and a recorded lineage on every DecisionRecord.

That distinction matters because a production incident does not ask “what did the prompt say?” It asks which pack version was pinned, which source won the priority rule, which evidence refs were admitted, which tools were surfaced, which policy bundle was active, and whether replay can rebuild the same decision from the same inputs. If the answer is a prompt template diff, the context layer is still informal.

Where context actually fails

The temptation when something like the on-call story happens is to spend a sprint on retrieval quality. Tune the embedder. Adjust the chunk size. Add a reranker. None of those would have helped here, because the failure was not in retrieval. It was in what was allowed to be a source of truth.

The failure modes I see most often have the same flavor. Stale context: policies and prices change, but caches do not. Conflicting sources: the wiki and the policy KB say different things, with no priority rule. Missing provenance: a confident answer that no one can trace back to a source. Unbounded packing: the prompt swelled to thirty thousand tokens and the critical detail got bumped. No drift detection: a schema migration silently broke a downstream prompt. Untrusted text reaching the prompt: an indirect prompt injection from a stray document.

The common pattern is that none of these are model bugs. They are gaps between the substrate (your knowledge), the per-request decision (what to ground on), and the regime that says which sources are trustworthy. That gap is what context engineering is for.

The Context plane, in one paragraph

ContextOS organizes the runtime as five planes — Intelligence, Context, Decision, Action, Trust. The Context plane sits between the slow-moving Intelligence substrate and the fast-moving Decision loop. Its job is to take a RunContext (intent, claims, budgets, the policy bundle in effect right now) and a pinned ContextPack, and produce a typed CompiledContext for the Decision plane to consume. That output is composed of named buckets — policy, tool, evidence, memory, business, session — each with its own budget, redaction rules, and source priority.

When you separate Context from its neighbors, four things become possible: the pack is versionable (you can ship ctxpack.support@5.2.0 and pin it per request), testable (a golden set runs against the pinned pack), replayable (the Compiler is deterministic given the same inputs), and enforceable (tool surfaces, redaction rules, and approval gates are baked into the manifest at compile time, not whispered into the prompt at runtime).

Five practices that make context shippable

1. Treat context contracts like APIs

The team in the on-call story did not have a contract for their wiki. Anyone could edit it; nobody owned it; no schema enforced what counts as an “on-call entry.” When you treat context as a supply chain, every source is an API: it has an owner, a schema, a refresh cadence, a TTL, a sensitivity class, and a priority rule that says how it composes with other sources of the same fact. If your retrieval source is Confluence pages tagged “ops”, that is fine — but write it down as a contract, version it, and review it like any other API.

2. Retrieve with intent, not by similarity

Retrieval is not a global similarity search. It is task decomposition. For a refund decision, the agent needs to ground on eligibility, on policy applicability, on latest order facts, on tool availability, on identity verification status. Each of these is a separate retrieval question with its own source priority. Smashing them all into one vector query is how the wrong document wins.

The Context plane resolves these as separate evidence requirements declared on the Decision Spec, which the Compiler turns into separate retrievals with separate scoping. The result is auditable: each evidence_ref traces back to the question it answered.

3. Pack into typed buckets with budgets

Token budget is a scarce resource, and “we’ll just use a bigger context window” is not a strategy. The Context plane treats the budget as part of the contract, distributing it across the named buckets per the pack’s declared shape. Most regressions show up first as bucket-budget pressure — the evidence bucket gets crowded by stale memory, or the tool bucket grows because someone added five new schemas to the registry. Watching budget pressure per intent is the early-warning signal that I have not seen any other monitoring catch.

4. Make every claim provenance-bound

Every fact entering a bucket carries a source and a timestamp. Every fact the run grounds on becomes an evidence_ref on the Decision Record. If a claim cannot be linked to evidence, the run does not assert it — it marks it uncertain, defers, or escalates. The on-call story would have ended with “I don’t have a confidently-current source for the on-call schedule” instead of a confident wrong answer.

This is not a UX feature. It is a trust contract. The runtime owes evidence; the model proposes; the Decision Record records what was actually grounded.

5. Cache safely, invalidate explicitly

Caching at scale is non-negotiable, and caching wrong is worse than no cache at all. The pattern that works is per-bucket TTLs paired with event-based invalidation: stable data gets a long TTL, but a policy change or a price update fires an invalidation event that drops cached entries immediately. The cached read-only tool aliases that the Tool Gateway exposes are conservative by design — discovery only, allow-listed tools only, execution still routes through the Gateway. Conservative wins.

Observability you can actually operate

Context systems need the same visibility as runtime systems. The minimum surface to operate is small: pinned pack_version per run, retrieval set with scores, provenance for every critical claim, tool calls with policy_decision_id per call, and a scorecard at the end. Replay determinism — the same trace_id reproducing the same Decision Record — is the most useful single signal that the Context plane is healthy.

If you cannot measure context quality, you cannot improve it. If you cannot trace a confident wrong answer back to a source, you are still in demo mode.

A working checklist

A context system you can trust answers yes to all of these:

Check	Production-grade answer
Can I reproduce the exact context for a past run?	Every run records `pack_version`, content hash, source snapshot refs, bucket budgets, and compiler version.
Do I know which policy bundles were active?	Policy decisions point to signed bundle ids and rule ids, not copied prompt text.
Can I prove where every numeric claim came from?	Numeric claims carry `evidence_ref`s into the Decision Record, with source, timestamp, and payload hash.
Do I have a safe degradation mode?	Missing or stale evidence produces `defer`, `ask_for_info`, or `escalate`, never confident assertion.
Can I invalidate context when truth changes?	Each bucket has TTL plus event-based invalidation from its source owner.
Can I detect schema drift before it reaches the model?	Source contracts run schema checks and golden-set replays before a pack promotes.
Do I treat untrusted text as evidence, not instruction?	User documents, tool output, and memory recall stay in evidence/memory buckets; authority lives in policy and tool manifests.

If the answer to any of these is “we’d have to look that up,” you are already carrying the incident. You just have not replayed it yet.

A closing note

The on-call story above ended better than it might have. The team named an owner for the wiki page, added a freshness check to the on-call source, and pushed the change through the agent’s pack as a new version. The recovery took an hour because the diagnosis took five minutes — not because the fix was clever, but because the runtime knew which version of which pack the agent had been running, and that single piece of information let everything else fall into place.

That is the practical case for context engineering. It does not make agents impressive. It makes them debuggable. In production, that is the only thing that matters.