Evaluation and Observability

evaluators, OTEL traces, replay, and the continuous improvement loop.

Foundational SpecLast reviewed: 2026-05-04 Edit on GitHub

At a glance

Trust planeControl over the other four

Evaluators + OTEL tracing + replay — the trust-plane stack that scores every run and gates every release.

Inputs

OTEL spans from every plane
Tool transcripts and evidence manifests
Operator corrections and approval-gate decisions
Golden sets per intent
Release-gate thresholds

Outputs

Per-run scorecards
Trace bundles for audit and post-incident review
Replay datasets
Improvement proposals (insights, strategy rules, tuning suggestions)
Release-gate verdicts

Lifecycle

trace
score
replay
gate

Canonical types

Scorecard
TraceBundle
ReplayDataset
GoldenSet
ReleaseVerdict

Evaluation and Observability is the Trust-plane capability that turns runtime behavior into measurable, replayable evidence. Without it, the other planes are unverifiable.

Definition

A coordinated stack of: OTEL-first tracing that records every plane and primitive that touched a request; evaluators that score each run on Policy / Utility / Latency / Safety / Economics; golden sets and replay harnesses that catch regressions before release; and continuous improvement primitives that turn observed failures into proposed fixes.

Why it exists

LLM-driven systems drift. Models change, evidence changes, policies change, tools change. A trustworthy runtime cannot be “tested at release and trusted forever” — it must be measured on every run, replayable for any past run, and improved on a closed loop. This plane is what makes that loop concrete.

How it works

Instrument: every primitive (compiler, planner, executor, critic, gateway, memory) emits OTEL spans with W3C trace context propagated end-to-end.
Score: each completed run is scored against evaluators with rubric-bound metrics.
Sample online: a configurable fraction of production runs route to LLM-as-judge plus rule-based scorers.
Replay offline: golden sets and recorded production runs are replayed on every change to the planner, prompt, pack, or model.
Gate releases: a release is blocked if scorecard deltas exceed configured thresholds.
Improve: failed and corrected runs feed the Insight Synthesizer, Strategy Compiler, and Feedback Store that propose changes back into the catalogs and packs.

Evaluators

The Critic scores every completed run on five dimensions:

Dimension	Examples of metrics
Policy compliance	rule-violation rate, must-refuse coverage, approval-gate honored rate
Utility	task success rate, decision-correctness on golden set, user-corrected rate
Latency	p50/p95/p99 per stage, end-to-end Run Context wall-clock
Safety	redaction success, evidence-coverage, hallucination rate by intent
Economics	tokens per decision, tool-call count per decision, $/decision by tier

Scorecard deltas are tracked per intent, per tenant, and per Context Pack version, so regressions are localizable to the cause.

Eval-type taxonomy

The five evaluator dimensions above are the axes every run is scored on. Beneath them, a release-grade harness runs multiple eval types, each answering a different question about whether the change is safe to ship:

Eval type	What it checks	Runs when
Unit	Function-level correctness of harness code (compiler stages, policy DSL, adapter logic)	On every commit; sub-second typical
Contract	Input/output schema compliance against API contracts — `RunContext`, `ContextPack`, `ToolEnvelope`, `DecisionRecord`	On every commit; pre-merge
Interface validation	Lightweight smoke test: import the candidate, instantiate, call public methods on a tiny fixture. Catches malformed candidates in seconds before any golden replay	First gate in the rollout pipeline
Policy compliance	Every JsonLogic rule fires correctly on its golden cases; no rule depends on an unbound variable	On policy-bundle change; nightly drift check
Tool-use	The right tool is selected with the right arguments; capability-class and approval-mode bindings are honored	Every pack change touching tools
Reasoning	The DecisionRecord is supported by its declared `evidence_refs`; no hallucinated citations	Every pack change touching evidence
Regression	Prior golden runs replay deterministically against the candidate harness	Every pack / policy / tool / skill change
Simulation	Synthetic edge cases (expired state, supplier outage, partial-failure tools, contradictory memory) exercise the long tail	Nightly + before any major rollout
UX	Response is clear, minimal, and useful at the user surface; no internal IDs or stack-shaped strings leak	Every change to user-facing skills
Business	Conversion, completion, deflection, SLA, cost-per-verified-success	Continuous on production traffic
Safety	No irreversible action without a binding gate; redaction holds; sandbox profile is honored	Continuous; never optional

Unit, contract, and interface-validation tests run cheaply on every commit. Policy / tool-use / reasoning / regression / simulation / UX tests run on the relevant change classes. Business and safety evals run continuously against production traffic with stratified sampling. Search-set and test-set are kept disjoint: the proposer (human or automated, including the Improvement Loop) iterates against the search set; the release gate uses the held-out test set.

Release gates and feature-flagged rollout

A passing scorecard is necessary but not sufficient. New harness components — pack versions, policy bundles, tool additions, planner skills — flow through staged rollout, with each stage carrying an explicit advance gate:

Stage	What is true	Gate to advance
`0%_shadow`	Runs in parallel with the pinned baseline; emits scorecard but does not affect outcome	Scorecard delta within bounds for ≥ N runs; no safety / policy regression
`1%_internal`	Internal cohort only; full telemetry + tail-based sampling on every run	No regression on safety / policy guardrails over the canary window
`5%_low_risk`	Low-blast-radius intents (read-mostly tiers); bounded by ApprovalMode	Adoption rate of operator corrections trending down; eval pass rate stable
`25%_monitored`	Broader rollout; tail-based sampling stratified by intent and tenant	Evaluator scorecards stable across cohorts; no per-tenant cliff
`100%`	Full rollout; previous version pinned for replay	Two-week clean canary; rollback rehearsal completed

Every stage carries a kill switch: a single control-plane operation that re-pins the prior version and re-routes traffic to it. Replay against the pinned snapshot must reproduce the prior DecisionRecord byte-for-byte; this is the contract that makes rollback meaningful.

OTEL-first tracing

The runtime treats OTEL as a primitive, not a sidecar:

W3C Trace Context (traceparent, tracestate, baggage) propagates across compiler → planner → critic → gateway → adapter.
Every plane writes spans; key attributes are namespaced (contextos.pack_version, contextos.policy_decision_id, contextos.approval_mode_effective, contextos.evidence_ref_count, contextos.budget_tokens_used).
The Run Context’s trace_id and run_id appear on every span and on the resulting Decision Record.
Tail-based sampling preserves all runs that crossed an approval gate, hit a loop guard, or failed scorecard thresholds.

Replay

Replay is the contract that makes evaluation honest. Given:

the Context Pack version,
the Knowledge Graph snapshot,
the recorded invokeAgent envelope,
the recorded tool transcripts,

the Critic verdict and the DecisionRecord must be reproducible without re-executing tools. This lets evaluation re-score historical runs against new evaluators or against a candidate prompt change without paying the live-tool cost.

ContextOS distinguishes four replay modes:

Replay mode	What it proves	Side-effect rule
Audit replay	A past record can be reconstructed from pinned context, policy, evidence, and transcripts.	Transcript-only; never re-execute side-effecting tools.
Regression replay	A candidate pack, policy, prompt, router, or evaluator change preserves required outcomes on prior cases.	Transcript-only or sandboxed.
Environment replay	A bounded task fixture can rerun reset state, actions, observations, rewards, and terminal status.	Sandbox only.
Source-pressure replay	A failed run, correction, or environment episode remains open until a replay case is fixed, invalid, superseded, or still failing with evidence.	Determined by the replay packet.

The portable input is a ReplayPacket. It carries the source reference, pinned pack, compiled-context hash, transcript-chain hash, side-effect policy, and expected outcome.

Continuous improvement loop

Failures and corrections are first-class signals, not log lines. The full set of improvement-loop primitives — Insight Synthesizer, Strategy Compiler, Feedback Store, Chief-of-Staff, Research Queue, Autotune — has its own foundation doc: see Improvement Loop.

Outputs of these primitives are proposals, not auto-applied changes. They land in the same change-control system as Context Packs, Decision Specs, and policies.

Interfaces

Inputs

OTEL spans from every plane
Tool transcripts and evidence manifests
Operator corrections and approval-gate decisions
Golden sets per intent
Release-gate thresholds

Outputs

Per-run scorecards
Trace bundles consumable by audit and post-incident review
Replay datasets
Improvement proposals (insights, strategy rules, tuning suggestions)
Release-gate verdicts

Failure modes

Sampled-in only the easy cases — production sampling not stratified by intent or risk, blinding regressions on rare-but-high-risk paths.
LLM-as-judge drift between evaluator model versions; mitigated by pinning judge model and rubric, and re-baselining golden sets on judge upgrades.
Replay non-determinism caused by an unpinned dependency (Context Pack, snapshot, or model).
Trace gaps when a custom adapter forgets to propagate W3C headers — caught by a budget-coverage assertion.
Improvement loop loops on its own corrections without human review.

Operational concerns

Sampling stratified by intent, risk tier, and tenant.
Tail-based sampling forced for runs crossing destructive gates or failing scorecard thresholds.
Trace retention bands by data classification.
Cost budget per evaluation run separate from the production Run Budget.
Quarterly rebaselining of golden sets and judge rubrics.

Evaluation metrics

Coverage: fraction of production runs with full trace, full scorecard, and resolvable evidence_refs.
Detection lead time: time from regression introduction to release-gate block.
Replay determinism: identical DecisionRecord across replay against pinned snapshot.
Improvement adoption rate: fraction of proposals that ship after review.
Mean time to safe completion per intent.

Example

A condensed scorecard for one run:

{
  "run_id": "run_a1b2c3d4e5f60718",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "intent": "support.refund",
  "pack_version": "ctxpack.support@5.2.0",
  "decision_id": "support.refund.execute",
  "scores": {
    "policy":   { "score": 1.00, "violations": [] },
    "utility":  { "score": 0.92, "judge_rationale": "..." },
    "latency":  { "wall_clock_ms": 1840, "p99_target_ms": 3000 },
    "safety":   { "redaction_success": true, "evidence_coverage": 1.00, "hallucination": false },
    "economics":{ "tokens_used": 4720, "tool_calls": 3, "cost_usd_cents": 0.91 }
  },
  "release_gate": "pass",
  "improvement_proposals": []
}

Common misconceptions

Evaluation is not QA. QA runs once; evaluation runs continuously and gates every change.
Observability is not logging. Logs are unstructured; this plane emits typed spans tied to a Decision Record.
Replay is not optional. Without replay, post-incident analysis becomes guesswork and improvement proposals cannot be validated.
Improvement proposals are not auto-applied. They land in the same change-control system as packs and policies.