Skip to content
Press / to search

Evaluation and Observability

evaluators, OTEL traces, replay, and the continuous improvement loop.

Foundational SpecLast reviewed: Edit on GitHub
At a glance
Trust planeControl over the other four

Evaluators + OTEL tracing + replay — the trust-plane stack that scores every run and gates every release.

Inputs
  • OTEL spans from every plane
  • Tool transcripts and evidence manifests
  • Operator corrections and approval-gate decisions
  • Golden sets per intent
  • Release-gate thresholds
Outputs
  • Per-run scorecards
  • Trace bundles for audit and post-incident review
  • Replay datasets
  • Improvement proposals (insights, strategy rules, tuning suggestions)
  • Release-gate verdicts
Lifecycle
  1. trace
  2. score
  3. replay
  4. gate
Canonical types
  • Scorecard
  • TraceBundle
  • ReplayDataset
  • GoldenSet
  • ReleaseVerdict

Evaluation and Observability is the Trust-plane capability that turns runtime behavior into measurable, replayable evidence. Without it, the other planes are unverifiable.

Definition

A coordinated stack of: OTEL-first tracing that records every plane and primitive that touched a request; evaluators that score each run on Policy / Utility / Latency / Safety / Economics; golden sets and replay harnesses that catch regressions before release; and continuous improvement primitives that turn observed failures into proposed fixes.

Why it exists

LLM-driven systems drift. Models change, evidence changes, policies change, tools change. A trustworthy runtime cannot be “tested at release and trusted forever” — it must be measured on every run, replayable for any past run, and improved on a closed loop. This plane is what makes that loop concrete.

How it works

  1. Instrument: every primitive (compiler, planner, executor, critic, gateway, memory) emits OTEL spans with W3C trace context propagated end-to-end.
  2. Score: each completed run is scored against evaluators with rubric-bound metrics.
  3. Sample online: a configurable fraction of production runs route to LLM-as-judge plus rule-based scorers.
  4. Replay offline: golden sets and recorded production runs are replayed on every change to the planner, prompt, pack, or model.
  5. Gate releases: a release is blocked if scorecard deltas exceed configured thresholds.
  6. Improve: failed and corrected runs feed the Insight Synthesizer, Strategy Compiler, and Feedback Store that propose changes back into the catalogs and packs.

Evaluators

The Critic scores every completed run on five dimensions:

DimensionExamples of metrics
Policy compliancerule-violation rate, must-refuse coverage, approval-gate honored rate
Utilitytask success rate, decision-correctness on golden set, user-corrected rate
Latencyp50/p95/p99 per stage, end-to-end Run Context wall-clock
Safetyredaction success, evidence-coverage, hallucination rate by intent
Economicstokens per decision, tool-call count per decision, $/decision by tier

Scorecard deltas are tracked per intent, per tenant, and per Context Pack version, so regressions are localizable to the cause.

Eval-type taxonomy

The five evaluator dimensions above are the axes every run is scored on. Beneath them, a release-grade harness runs multiple eval types, each answering a different question about whether the change is safe to ship:

Eval typeWhat it checksRuns when
UnitFunction-level correctness of harness code (compiler stages, policy DSL, adapter logic)On every commit; sub-second typical
ContractInput/output schema compliance against API contractsRunContext, ContextPack, ToolEnvelope, DecisionRecordOn every commit; pre-merge
Interface validationLightweight smoke test: import the candidate, instantiate, call public methods on a tiny fixture. Catches malformed candidates in seconds before any golden replayFirst gate in the rollout pipeline
Policy complianceEvery JsonLogic rule fires correctly on its golden cases; no rule depends on an unbound variableOn policy-bundle change; nightly drift check
Tool-useThe right tool is selected with the right arguments; capability-class and approval-mode bindings are honoredEvery pack change touching tools
ReasoningThe DecisionRecord is supported by its declared evidence_refs; no hallucinated citationsEvery pack change touching evidence
RegressionPrior golden runs replay deterministically against the candidate harnessEvery pack / policy / tool / skill change
SimulationSynthetic edge cases (expired state, supplier outage, partial-failure tools, contradictory memory) exercise the long tailNightly + before any major rollout
UXResponse is clear, minimal, and useful at the user surface; no internal IDs or stack-shaped strings leakEvery change to user-facing skills
BusinessConversion, completion, deflection, SLA, cost-per-verified-successContinuous on production traffic
SafetyNo irreversible action without a binding gate; redaction holds; sandbox profile is honoredContinuous; never optional

Unit, contract, and interface-validation tests run cheaply on every commit. Policy / tool-use / reasoning / regression / simulation / UX tests run on the relevant change classes. Business and safety evals run continuously against production traffic with stratified sampling. Search-set and test-set are kept disjoint: the proposer (human or automated, including the Improvement Loop) iterates against the search set; the release gate uses the held-out test set.

Release gates and feature-flagged rollout

A passing scorecard is necessary but not sufficient. New harness components — pack versions, policy bundles, tool additions, planner skills — flow through staged rollout, with each stage carrying an explicit advance gate:

StageWhat is trueGate to advance
0%_shadowRuns in parallel with the pinned baseline; emits scorecard but does not affect outcomeScorecard delta within bounds for ≥ N runs; no safety / policy regression
1%_internalInternal cohort only; full telemetry + tail-based sampling on every runNo regression on safety / policy guardrails over the canary window
5%_low_riskLow-blast-radius intents (read-mostly tiers); bounded by ApprovalModeAdoption rate of operator corrections trending down; eval pass rate stable
25%_monitoredBroader rollout; tail-based sampling stratified by intent and tenantEvaluator scorecards stable across cohorts; no per-tenant cliff
100%Full rollout; previous version pinned for replayTwo-week clean canary; rollback rehearsal completed

Every stage carries a kill switch: a single control-plane operation that re-pins the prior version and re-routes traffic to it. Replay against the pinned snapshot must reproduce the prior DecisionRecord byte-for-byte; this is the contract that makes rollback meaningful.

OTEL-first tracing

The runtime treats OTEL as a primitive, not a sidecar:

  • W3C Trace Context (traceparent, tracestate, baggage) propagates across compiler → planner → critic → gateway → adapter.
  • Every plane writes spans; key attributes are namespaced (contextos.pack_version, contextos.policy_decision_id, contextos.approval_mode_effective, contextos.evidence_ref_count, contextos.budget_tokens_used).
  • The Run Context’s trace_id and run_id appear on every span and on the resulting Decision Record.
  • Tail-based sampling preserves all runs that crossed an approval gate, hit a loop guard, or failed scorecard thresholds.

Replay

Replay is the contract that makes evaluation honest. Given:

  • the Context Pack version,
  • the Knowledge Graph snapshot,
  • the recorded invokeAgent envelope,
  • the recorded tool transcripts,

the Critic verdict and the DecisionRecord must be reproducible without re-executing tools. This lets evaluation re-score historical runs against new evaluators or against a candidate prompt change without paying the live-tool cost.

ContextOS distinguishes four replay modes:

Replay modeWhat it provesSide-effect rule
Audit replayA past record can be reconstructed from pinned context, policy, evidence, and transcripts.Transcript-only; never re-execute side-effecting tools.
Regression replayA candidate pack, policy, prompt, router, or evaluator change preserves required outcomes on prior cases.Transcript-only or sandboxed.
Environment replayA bounded task fixture can rerun reset state, actions, observations, rewards, and terminal status.Sandbox only.
Source-pressure replayA failed run, correction, or environment episode remains open until a replay case is fixed, invalid, superseded, or still failing with evidence.Determined by the replay packet.

The portable input is a ReplayPacket. It carries the source reference, pinned pack, compiled-context hash, transcript-chain hash, side-effect policy, and expected outcome.

Continuous improvement loop

Failures and corrections are first-class signals, not log lines. The full set of improvement-loop primitives — Insight Synthesizer, Strategy Compiler, Feedback Store, Chief-of-Staff, Research Queue, Autotune — has its own foundation doc: see Improvement Loop.

Outputs of these primitives are proposals, not auto-applied changes. They land in the same change-control system as Context Packs, Decision Specs, and policies.

Interfaces

Inputs

  • OTEL spans from every plane
  • Tool transcripts and evidence manifests
  • Operator corrections and approval-gate decisions
  • Golden sets per intent
  • Release-gate thresholds

Outputs

  • Per-run scorecards
  • Trace bundles consumable by audit and post-incident review
  • Replay datasets
  • Improvement proposals (insights, strategy rules, tuning suggestions)
  • Release-gate verdicts

Failure modes

  • Sampled-in only the easy cases — production sampling not stratified by intent or risk, blinding regressions on rare-but-high-risk paths.
  • LLM-as-judge drift between evaluator model versions; mitigated by pinning judge model and rubric, and re-baselining golden sets on judge upgrades.
  • Replay non-determinism caused by an unpinned dependency (Context Pack, snapshot, or model).
  • Trace gaps when a custom adapter forgets to propagate W3C headers — caught by a budget-coverage assertion.
  • Improvement loop loops on its own corrections without human review.

Operational concerns

  • Sampling stratified by intent, risk tier, and tenant.
  • Tail-based sampling forced for runs crossing destructive gates or failing scorecard thresholds.
  • Trace retention bands by data classification.
  • Cost budget per evaluation run separate from the production Run Budget.
  • Quarterly rebaselining of golden sets and judge rubrics.

Evaluation metrics

  • Coverage: fraction of production runs with full trace, full scorecard, and resolvable evidence_refs.
  • Detection lead time: time from regression introduction to release-gate block.
  • Replay determinism: identical DecisionRecord across replay against pinned snapshot.
  • Improvement adoption rate: fraction of proposals that ship after review.
  • Mean time to safe completion per intent.

Example

A condensed scorecard for one run:

{
  "run_id": "run_a1b2c3d4e5f60718",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "intent": "support.refund",
  "pack_version": "ctxpack.support@5.2.0",
  "decision_id": "support.refund.execute",
  "scores": {
    "policy":   { "score": 1.00, "violations": [] },
    "utility":  { "score": 0.92, "judge_rationale": "..." },
    "latency":  { "wall_clock_ms": 1840, "p99_target_ms": 3000 },
    "safety":   { "redaction_success": true, "evidence_coverage": 1.00, "hallucination": false },
    "economics":{ "tokens_used": 4720, "tool_calls": 3, "cost_usd_cents": 0.91 }
  },
  "release_gate": "pass",
  "improvement_proposals": []
}

Common misconceptions

  • Evaluation is not QA. QA runs once; evaluation runs continuously and gates every change.
  • Observability is not logging. Logs are unstructured; this plane emits typed spans tied to a Decision Record.
  • Replay is not optional. Without replay, post-incident analysis becomes guesswork and improvement proposals cannot be validated.
  • Improvement proposals are not auto-applied. They land in the same change-control system as packs and policies.