Evaluation Engine
Trust-plane component computing scorecards, running golden replays, and gating releases.
Scoring and release-gate component — per-run scorecards, golden-set replays, and pass/block verdicts on every change.
- Trace bundles from the Observability component
- Tool transcripts and evidence manifests
- Operator corrections and approval-gate decisions
- Golden sets per intent
- Release-gate thresholds per eval_targets entry from the active pack
- Per-run RunScore record bound to trace_id and record_id
- Replay datasets (re-derive verdicts against pinned snapshot)
- Release-gate verdict (pass / block) with deltas
- Improvement proposals (consumed by Insight Synthesizer / Strategy Compiler / Autotune)
- score
- replay
- compare
- gate
- RunScore
- GoldenSet
- ReleaseVerdict
- ReplayDataset
The Evaluation Engine is the scoring and release-gate component of the Trust plane. It computes per-run scorecards, runs golden-set replays on every change, and emits release-gate verdicts.
Definition
A coordinated set of evaluators (rule-based scorers + LLM-as-judge + cost/latency aggregators) running on every completed run (sampled in production, exhaustive in CI). Outputs are typed RunScore records bound to a trace_id and a Decision Record.
Why it exists
LLM-driven systems drift. Without continuous, structured evaluation, regressions ship undetected. The Evaluation Engine is the gate that catches them — both online (sampled production) and offline (golden-set replays for every change to the prompt, pack, model, or policy).
Evaluators
| Letter | Dimension | Examples |
|---|---|---|
| P | Policy compliance | rule-violation rate, must-refuse coverage, approval-gate honored rate |
| U | Utility | task success rate, decision-correctness on golden set, user-corrected rate |
| L | Latency | p50/p95/p99 per stage, end-to-end Run Context wall-clock |
| S | Safety | redaction success, evidence-coverage, hallucination rate by intent |
| E | Economics | tokens per decision, tool-call count per decision, $/decision by tier |
scorecard deltas are tracked per intent, per tenant, and per pack version so regressions are localizable.
Inputs
- Trace bundles from the Observability component
- Tool transcripts and evidence manifests
- Operator corrections and approval-gate decisions
- Golden sets per intent
- Release-gate thresholds per
eval_targetsentry from the active pack
Outputs
- Per-run
RunScorerecord bound totrace_idandrecord_id - Replay datasets (re-derive verdicts against pinned snapshot)
- Release-gate verdict (
pass/block) with deltas - Improvement proposals (consumed by Insight Synthesizer / Strategy Compiler / Autotune)
How it works
- Score — every completed run gets a scorecard record (sampled in prod, exhaustive in CI).
- Replay — golden runs re-execute against pinned pack + snapshot + transcripts; verdicts re-derived.
- Gate — release candidates compared against baseline; block if any scorecard delta exceeds threshold.
- Propose — failed and corrected runs become typed proposals routed to the improvement primitives.
Failure modes
- Sampling not stratified by risk / tenant — blinds rare-but-high-risk regressions.
- LLM-as-judge drift between judge model versions — mitigated by pinning judge model + rubric and re-baselining on judge upgrade.
- Replay non-determinism from an unpinned dependency — mitigated by full pinning checks.
- Trace gaps when a custom adapter forgets W3C propagation — caught by trace-coverage assertion.
Operational concerns
- Sampling stratified by intent, risk tier, tenant.
- Tail-based sampling forced for runs crossing destructive gates or failing thresholds.
- Cost budget per evaluation run separate from production Run Budget.
- Quarterly rebaselining of golden sets and judge rubrics.
Evaluation metrics
- Coverage (fraction of production runs with full scorecard).
- Detection lead time (regression introduced → release-gate block).
- Replay determinism (identical record on identical inputs).
- Improvement adoption rate (proposals that ship after review).