Evaluation and Observability
evaluators, OTEL traces, replay, and the continuous improvement loop.
Evaluators + OTEL tracing + replay — the trust-plane stack that scores every run and gates every release.
- OTEL spans from every plane
- Tool transcripts and evidence manifests
- Operator corrections and approval-gate decisions
- Golden sets per intent
- Release-gate thresholds
- Per-run scorecards
- Trace bundles for audit and post-incident review
- Replay datasets
- Improvement proposals (insights, strategy rules, tuning suggestions)
- Release-gate verdicts
- trace
- score
- replay
- gate
- Scorecard
- TraceBundle
- ReplayDataset
- GoldenSet
- ReleaseVerdict
Evaluation and Observability is the Trust-plane capability that turns runtime behavior into measurable, replayable evidence. Without it, the other planes are unverifiable.
Definition
A coordinated stack of: OTEL-first tracing that records every plane and primitive that touched a request; evaluators that score each run on Policy / Utility / Latency / Safety / Economics; golden sets and replay harnesses that catch regressions before release; and continuous improvement primitives that turn observed failures into proposed fixes.
Why it exists
LLM-driven systems drift. Models change, evidence changes, policies change, tools change. A trustworthy runtime cannot be “tested at release and trusted forever” — it must be measured on every run, replayable for any past run, and improved on a closed loop. This plane is what makes that loop concrete.
How it works
- Instrument: every primitive (compiler, planner, executor, critic, gateway, memory) emits OTEL spans with W3C trace context propagated end-to-end.
- Score: each completed run is scored against evaluators with rubric-bound metrics.
- Sample online: a configurable fraction of production runs route to LLM-as-judge plus rule-based scorers.
- Replay offline: golden sets and recorded production runs are replayed on every change to the planner, prompt, pack, or model.
- Gate releases: a release is blocked if scorecard deltas exceed configured thresholds.
- Improve: failed and corrected runs feed the Insight Synthesizer, Strategy Compiler, and Feedback Store that propose changes back into the catalogs and packs.
Evaluators
The Critic scores every completed run on five dimensions:
| Dimension | Examples of metrics |
|---|---|
| Policy compliance | rule-violation rate, must-refuse coverage, approval-gate honored rate |
| Utility | task success rate, decision-correctness on golden set, user-corrected rate |
| Latency | p50/p95/p99 per stage, end-to-end Run Context wall-clock |
| Safety | redaction success, evidence-coverage, hallucination rate by intent |
| Economics | tokens per decision, tool-call count per decision, $/decision by tier |
Scorecard deltas are tracked per intent, per tenant, and per Context Pack version, so regressions are localizable to the cause.
Eval-type taxonomy
The five evaluator dimensions above are the axes every run is scored on. Beneath them, a release-grade harness runs multiple eval types, each answering a different question about whether the change is safe to ship:
| Eval type | What it checks | Runs when |
|---|---|---|
| Unit | Function-level correctness of harness code (compiler stages, policy DSL, adapter logic) | On every commit; sub-second typical |
| Contract | Input/output schema compliance against API contracts — RunContext, ContextPack, ToolEnvelope, DecisionRecord | On every commit; pre-merge |
| Interface validation | Lightweight smoke test: import the candidate, instantiate, call public methods on a tiny fixture. Catches malformed candidates in seconds before any golden replay | First gate in the rollout pipeline |
| Policy compliance | Every JsonLogic rule fires correctly on its golden cases; no rule depends on an unbound variable | On policy-bundle change; nightly drift check |
| Tool-use | The right tool is selected with the right arguments; capability-class and approval-mode bindings are honored | Every pack change touching tools |
| Reasoning | The DecisionRecord is supported by its declared evidence_refs; no hallucinated citations | Every pack change touching evidence |
| Regression | Prior golden runs replay deterministically against the candidate harness | Every pack / policy / tool / skill change |
| Simulation | Synthetic edge cases (expired state, supplier outage, partial-failure tools, contradictory memory) exercise the long tail | Nightly + before any major rollout |
| UX | Response is clear, minimal, and useful at the user surface; no internal IDs or stack-shaped strings leak | Every change to user-facing skills |
| Business | Conversion, completion, deflection, SLA, cost-per-verified-success | Continuous on production traffic |
| Safety | No irreversible action without a binding gate; redaction holds; sandbox profile is honored | Continuous; never optional |
Unit, contract, and interface-validation tests run cheaply on every commit. Policy / tool-use / reasoning / regression / simulation / UX tests run on the relevant change classes. Business and safety evals run continuously against production traffic with stratified sampling. Search-set and test-set are kept disjoint: the proposer (human or automated, including the Improvement Loop) iterates against the search set; the release gate uses the held-out test set.
Release gates and feature-flagged rollout
A passing scorecard is necessary but not sufficient. New harness components — pack versions, policy bundles, tool additions, planner skills — flow through staged rollout, with each stage carrying an explicit advance gate:
| Stage | What is true | Gate to advance |
|---|---|---|
0%_shadow | Runs in parallel with the pinned baseline; emits scorecard but does not affect outcome | Scorecard delta within bounds for ≥ N runs; no safety / policy regression |
1%_internal | Internal cohort only; full telemetry + tail-based sampling on every run | No regression on safety / policy guardrails over the canary window |
5%_low_risk | Low-blast-radius intents (read-mostly tiers); bounded by ApprovalMode | Adoption rate of operator corrections trending down; eval pass rate stable |
25%_monitored | Broader rollout; tail-based sampling stratified by intent and tenant | Evaluator scorecards stable across cohorts; no per-tenant cliff |
100% | Full rollout; previous version pinned for replay | Two-week clean canary; rollback rehearsal completed |
Every stage carries a kill switch: a single control-plane operation that re-pins the prior version and re-routes traffic to it. Replay against the pinned snapshot must reproduce the prior DecisionRecord byte-for-byte; this is the contract that makes rollback meaningful.
OTEL-first tracing
The runtime treats OTEL as a primitive, not a sidecar:
- W3C Trace Context (
traceparent,tracestate,baggage) propagates across compiler → planner → critic → gateway → adapter. - Every plane writes spans; key attributes are namespaced (
contextos.pack_version,contextos.policy_decision_id,contextos.approval_mode_effective,contextos.evidence_ref_count,contextos.budget_tokens_used). - The Run Context’s
trace_idandrun_idappear on every span and on the resulting Decision Record. - Tail-based sampling preserves all runs that crossed an approval gate, hit a loop guard, or failed scorecard thresholds.
Replay
Replay is the contract that makes evaluation honest. Given:
- the Context Pack version,
- the Knowledge Graph snapshot,
- the recorded
invokeAgentenvelope, - the recorded tool transcripts,
the Critic verdict and the DecisionRecord must be reproducible without re-executing tools. This lets evaluation re-score historical runs against new evaluators or against a candidate prompt change without paying the live-tool cost.
ContextOS distinguishes four replay modes:
| Replay mode | What it proves | Side-effect rule |
|---|---|---|
| Audit replay | A past record can be reconstructed from pinned context, policy, evidence, and transcripts. | Transcript-only; never re-execute side-effecting tools. |
| Regression replay | A candidate pack, policy, prompt, router, or evaluator change preserves required outcomes on prior cases. | Transcript-only or sandboxed. |
| Environment replay | A bounded task fixture can rerun reset state, actions, observations, rewards, and terminal status. | Sandbox only. |
| Source-pressure replay | A failed run, correction, or environment episode remains open until a replay case is fixed, invalid, superseded, or still failing with evidence. | Determined by the replay packet. |
The portable input is a ReplayPacket. It carries the source reference, pinned pack, compiled-context hash, transcript-chain hash, side-effect policy, and expected outcome.
Continuous improvement loop
Failures and corrections are first-class signals, not log lines. The full set of improvement-loop primitives — Insight Synthesizer, Strategy Compiler, Feedback Store, Chief-of-Staff, Research Queue, Autotune — has its own foundation doc: see Improvement Loop.
Outputs of these primitives are proposals, not auto-applied changes. They land in the same change-control system as Context Packs, Decision Specs, and policies.
Interfaces
Inputs
- OTEL spans from every plane
- Tool transcripts and evidence manifests
- Operator corrections and approval-gate decisions
- Golden sets per intent
- Release-gate thresholds
Outputs
- Per-run scorecards
- Trace bundles consumable by audit and post-incident review
- Replay datasets
- Improvement proposals (insights, strategy rules, tuning suggestions)
- Release-gate verdicts
Failure modes
- Sampled-in only the easy cases — production sampling not stratified by intent or risk, blinding regressions on rare-but-high-risk paths.
- LLM-as-judge drift between evaluator model versions; mitigated by pinning judge model and rubric, and re-baselining golden sets on judge upgrades.
- Replay non-determinism caused by an unpinned dependency (Context Pack, snapshot, or model).
- Trace gaps when a custom adapter forgets to propagate W3C headers — caught by a budget-coverage assertion.
- Improvement loop loops on its own corrections without human review.
Operational concerns
- Sampling stratified by intent, risk tier, and tenant.
- Tail-based sampling forced for runs crossing destructive gates or failing scorecard thresholds.
- Trace retention bands by data classification.
- Cost budget per evaluation run separate from the production Run Budget.
- Quarterly rebaselining of golden sets and judge rubrics.
Evaluation metrics
- Coverage: fraction of production runs with full trace, full scorecard, and resolvable evidence_refs.
- Detection lead time: time from regression introduction to release-gate block.
- Replay determinism: identical DecisionRecord across replay against pinned snapshot.
- Improvement adoption rate: fraction of proposals that ship after review.
- Mean time to safe completion per intent.
Example
A condensed scorecard for one run:
{
"run_id": "run_a1b2c3d4e5f60718",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"intent": "support.refund",
"pack_version": "ctxpack.support@5.2.0",
"decision_id": "support.refund.execute",
"scores": {
"policy": { "score": 1.00, "violations": [] },
"utility": { "score": 0.92, "judge_rationale": "..." },
"latency": { "wall_clock_ms": 1840, "p99_target_ms": 3000 },
"safety": { "redaction_success": true, "evidence_coverage": 1.00, "hallucination": false },
"economics":{ "tokens_used": 4720, "tool_calls": 3, "cost_usd_cents": 0.91 }
},
"release_gate": "pass",
"improvement_proposals": []
}Common misconceptions
- Evaluation is not QA. QA runs once; evaluation runs continuously and gates every change.
- Observability is not logging. Logs are unstructured; this plane emits typed spans tied to a Decision Record.
- Replay is not optional. Without replay, post-incident analysis becomes guesswork and improvement proposals cannot be validated.
- Improvement proposals are not auto-applied. They land in the same change-control system as packs and policies.