Evaluation Engine

Trust-plane component computing scorecards, running golden replays, and gating releases.

Reference DesignLast reviewed: 2026-05-04 Edit on GitHub

At a glance

Trust planeControl over the other four

Scoring and release-gate component — per-run scorecards, golden-set replays, and pass/block verdicts on every change.

Inputs

Trace bundles from the Observability component
Tool transcripts and evidence manifests
Operator corrections and approval-gate decisions
Golden sets per intent
Release-gate thresholds per eval_targets entry from the active pack

Outputs

Per-run RunScore record bound to trace_id and record_id
Replay datasets (re-derive verdicts against pinned snapshot)
Release-gate verdict (pass / block) with deltas
Improvement proposals (consumed by Insight Synthesizer / Strategy Compiler / Autotune)

Lifecycle

score
replay
compare
gate

Canonical types

RunScore
GoldenSet
ReleaseVerdict
ReplayDataset

Reference Architecture

The Evaluation Engine is the scoring and release-gate component of the Trust plane. It computes per-run scorecards, runs golden-set replays on every change, and emits release-gate verdicts.

Definition

A coordinated set of evaluators (rule-based scorers + LLM-as-judge + cost/latency aggregators) running on every completed run (sampled in production, exhaustive in CI). Outputs are typed RunScore records bound to a trace_id and a Decision Record.

Why it exists

LLM-driven systems drift. Without continuous, structured evaluation, regressions ship undetected. The Evaluation Engine is the gate that catches them — both online (sampled production) and offline (golden-set replays for every change to the prompt, pack, model, or policy).

Evaluators

Letter	Dimension	Examples
P	Policy compliance	rule-violation rate, must-refuse coverage, approval-gate honored rate
U	Utility	task success rate, decision-correctness on golden set, user-corrected rate
L	Latency	p50/p95/p99 per stage, end-to-end Run Context wall-clock
S	Safety	redaction success, evidence-coverage, hallucination rate by intent
E	Economics	tokens per decision, tool-call count per decision, $/decision by tier

scorecard deltas are tracked per intent, per tenant, and per pack version so regressions are localizable.

Inputs

Trace bundles from the Observability component
Tool transcripts and evidence manifests
Operator corrections and approval-gate decisions
Golden sets per intent
Release-gate thresholds per eval_targets entry from the active pack

Outputs

Per-run RunScore record bound to trace_id and record_id
Replay datasets (re-derive verdicts against pinned snapshot)
Release-gate verdict (pass / block) with deltas
Improvement proposals (consumed by Insight Synthesizer / Strategy Compiler / Autotune)

How it works

Score — every completed run gets a scorecard record (sampled in prod, exhaustive in CI).
Replay — golden runs re-execute against pinned pack + snapshot + transcripts; verdicts re-derived.
Gate — release candidates compared against baseline; block if any scorecard delta exceeds threshold.
Propose — failed and corrected runs become typed proposals routed to the improvement primitives.

Failure modes

Sampling not stratified by risk / tenant — blinds rare-but-high-risk regressions.
LLM-as-judge drift between judge model versions — mitigated by pinning judge model + rubric and re-baselining on judge upgrade.
Replay non-determinism from an unpinned dependency — mitigated by full pinning checks.
Trace gaps when a custom adapter forgets W3C propagation — caught by trace-coverage assertion.

Operational concerns

Sampling stratified by intent, risk tier, tenant.
Tail-based sampling forced for runs crossing destructive gates or failing thresholds.
Cost budget per evaluation run separate from production Run Budget.
Quarterly rebaselining of golden sets and judge rubrics.

Evaluation metrics

Coverage (fraction of production runs with full scorecard).
Detection lead time (regression introduced → release-gate block).
Replay determinism (identical record on identical inputs).
Improvement adoption rate (proposals that ship after review).