Skip to content
Press / to search

Evaluation Engine

Trust-plane component computing scorecards, running golden replays, and gating releases.

Reference DesignLast reviewed: Edit on GitHub
At a glance
Trust planeControl over the other four

Scoring and release-gate component — per-run scorecards, golden-set replays, and pass/block verdicts on every change.

Inputs
  • Trace bundles from the Observability component
  • Tool transcripts and evidence manifests
  • Operator corrections and approval-gate decisions
  • Golden sets per intent
  • Release-gate thresholds per eval_targets entry from the active pack
Outputs
  • Per-run RunScore record bound to trace_id and record_id
  • Replay datasets (re-derive verdicts against pinned snapshot)
  • Release-gate verdict (pass / block) with deltas
  • Improvement proposals (consumed by Insight Synthesizer / Strategy Compiler / Autotune)
Lifecycle
  1. score
  2. replay
  3. compare
  4. gate
Canonical types
  • RunScore
  • GoldenSet
  • ReleaseVerdict
  • ReplayDataset

Reference Architecture

The Evaluation Engine is the scoring and release-gate component of the Trust plane. It computes per-run scorecards, runs golden-set replays on every change, and emits release-gate verdicts.

Definition

A coordinated set of evaluators (rule-based scorers + LLM-as-judge + cost/latency aggregators) running on every completed run (sampled in production, exhaustive in CI). Outputs are typed RunScore records bound to a trace_id and a Decision Record.

Why it exists

LLM-driven systems drift. Without continuous, structured evaluation, regressions ship undetected. The Evaluation Engine is the gate that catches them — both online (sampled production) and offline (golden-set replays for every change to the prompt, pack, model, or policy).

Evaluators

LetterDimensionExamples
PPolicy compliancerule-violation rate, must-refuse coverage, approval-gate honored rate
UUtilitytask success rate, decision-correctness on golden set, user-corrected rate
LLatencyp50/p95/p99 per stage, end-to-end Run Context wall-clock
SSafetyredaction success, evidence-coverage, hallucination rate by intent
EEconomicstokens per decision, tool-call count per decision, $/decision by tier

scorecard deltas are tracked per intent, per tenant, and per pack version so regressions are localizable.

Inputs

  • Trace bundles from the Observability component
  • Tool transcripts and evidence manifests
  • Operator corrections and approval-gate decisions
  • Golden sets per intent
  • Release-gate thresholds per eval_targets entry from the active pack

Outputs

  • Per-run RunScore record bound to trace_id and record_id
  • Replay datasets (re-derive verdicts against pinned snapshot)
  • Release-gate verdict (pass / block) with deltas
  • Improvement proposals (consumed by Insight Synthesizer / Strategy Compiler / Autotune)

How it works

  1. Score — every completed run gets a scorecard record (sampled in prod, exhaustive in CI).
  2. Replay — golden runs re-execute against pinned pack + snapshot + transcripts; verdicts re-derived.
  3. Gate — release candidates compared against baseline; block if any scorecard delta exceeds threshold.
  4. Propose — failed and corrected runs become typed proposals routed to the improvement primitives.

Failure modes

  • Sampling not stratified by risk / tenant — blinds rare-but-high-risk regressions.
  • LLM-as-judge drift between judge model versions — mitigated by pinning judge model + rubric and re-baselining on judge upgrade.
  • Replay non-determinism from an unpinned dependency — mitigated by full pinning checks.
  • Trace gaps when a custom adapter forgets W3C propagation — caught by trace-coverage assertion.

Operational concerns

  • Sampling stratified by intent, risk tier, tenant.
  • Tail-based sampling forced for runs crossing destructive gates or failing thresholds.
  • Cost budget per evaluation run separate from production Run Budget.
  • Quarterly rebaselining of golden sets and judge rubrics.

Evaluation metrics

  • Coverage (fraction of production runs with full scorecard).
  • Detection lead time (regression introduced → release-gate block).
  • Replay determinism (identical record on identical inputs).
  • Improvement adoption rate (proposals that ship after review).