The sentence “the agent got better” is not an engineering statement.
Better at what? On which intent? Against which dataset? Did safety hold? Did policy hold? Did cost rise? Did latency move? Did it solve more tasks, or did it just answer more confidently?
Agent teams get into trouble when they collapse quality into one global feeling. A single score is seductive because it simplifies release decisions. It is also dangerous because agents can improve one dimension by damaging another. ContextOS uses Evaluation and Observability to keep those dimensions separate.
Cost improves when the harness retrieves less evidence. Latency improves when it avoids tools. Utility improves when it answers instead of escalating. User satisfaction improves when the agent says yes. None of those are improvements if the agent bypassed policy, skipped evidence, or created an irreversible side effect.
Great AI engineers use scorecards over vibes.
Five dimensions
A production agent scorecard needs at least five dimensions:
| Dimension | Question |
|---|---|
| Policy | Did the harness obey the rules outside the model? |
| Safety | Did it avoid unsafe, unsupported, or sensitive behavior? |
| Utility | Did it complete the task correctly and usefully? |
| Latency | Did it finish within the workflow’s time budget? |
| Economics | Did it spend a reasonable amount to reach a verified outcome? |
Policy and Safety are floor constraints. Utility, Latency, and Economics are optimization dimensions. That split is the first guardrail against metric gaming.
The OpenAI evals and trace-grading direction reinforces this: final-output scoring is useful, but agent evaluation needs structured graders and trace-level inspection when tool calls, handoffs, and guardrails matter. A scorecard turns that into an operating practice.
The scorecard shape
Use one scorecard per run, then aggregate by intent, pack version, policy version, tool manifest, evaluator suite, model profile, and risk class.
{
"run_id": "run_refund_017",
"intent": "support.refund",
"pack_version": "ctxpack.support@5.2.0",
"policy_bundle": "policy.returns@4.1.0",
"tool_manifest": "tools.support@3.7.0",
"evaluator_suite": "evals.refund@2.4.0",
"scores": {
"policy": {
"score": 1.0,
"violations": [],
"approval_gate_honored": true
},
"safety": {
"score": 1.0,
"unsupported_claim": false,
"redaction_success": true
},
"utility": {
"score": 0.92,
"task_success": true,
"operator_correction": false
},
"latency": {
"wall_clock_ms": 1840,
"p95_target_ms": 3000
},
"economics": {
"tokens": 4720,
"tool_calls": 3,
"cost_cents": 0.91
}
}
}The exact numbers matter less than the dimensions and joins. If you cannot slice by intent and version, you cannot know what changed.
Policy
Policy answers whether the harness obeyed rules outside model discretion.
Good policy evals are usually boring:
| Check | Example |
|---|---|
| Must-deny | refund above limit without approval returns deny or gate |
| Must-approve-with-gate | destructive action requires the named gate |
| Must-redact | restricted attribute never reaches final answer |
| Must-not-use-tool | tool absent from manifest cannot be called |
| Must-record | policy decision ID appears on DecisionRecord |
A Policy score of 0.99 is not always acceptable. For some workflows, one policy violation blocks release. Treat hard policy rules as floors, not averages.
Safety
Safety is broader than policy. Policy is the explicit rule set. Safety catches unsupported claims, prompt-injection success, harmful tool-use patterns, leakage, and evidence gaps.
Useful safety metrics:
| Metric | Meaning |
|---|---|
| unsupported_claim_rate | final claims without evidence refs |
| evidence_coverage | required evidence present before decision |
| prompt_injection_resistance | untrusted text did not become instruction |
| redaction_success | sensitive fields stayed out of unsafe surfaces |
| irreversible_action_without_gate | should always be zero |
Safety evals should include adversarial and accidental failures. Most production safety failures are not cinematic attacks. They are stale documents, copied user text, confused identity, and missing evidence.
Utility
Utility is where teams are tempted to spend all their time. It matters, but it is not enough.
For agents, utility should be task-grounded:
| Metric | Better than |
|---|---|
| decision correctness | generic “helpfulness” |
| task completion | thumbs-up |
| operator-corrected rate | anecdotal feedback |
| answer completeness | length |
| escalation appropriateness | escalation count alone |
The hard part is deciding expected outcomes. That is why the dataset has owners. An AI engineer should not invent refund policy labels without support-ops and policy ownership.
Latency
Latency is not one number. Agent latency is staged.
| Stage | What to track |
|---|---|
| compile | retrieval, policy resolution, memory recall, budget assembly |
| model | generation time and queueing |
| tool | external service waits and retries |
| critic | validation and scoring time |
| approval | human gate wait, if applicable |
Stage latency matters because fixes differ. A slow retrieval stage needs Context-plane indexing or budget work. A slow tool stage needs Adapter Mesh reliability. A slow Critic needs evaluator design. A slow approval stage needs operations, not model tuning.
Economics
Cost is meaningful only per verified outcome.
Do not report only tokens. Report:
| Metric | Why |
|---|---|
| cents per decision | comparable across models and tools |
| cents per verified success | avoids cheap wrong answers |
| tokens by bucket | shows context pressure |
| tool calls per decision | exposes unnecessary orchestration |
| retry cost | captures unreliable tools and loops |
Cost improvements are suspicious until Policy, Safety, and Utility hold. A cheaper agent that skips evidence is not efficient. It is under-instrumented.
Scorecard deltas
The scorecard becomes useful when comparing a baseline to a candidate under a release gate.
candidate: ctxpack.support@5.3.0
baseline: ctxpack.support@5.2.0
intent: support.refund
set: heldout release
policy: 1.000 -> 1.000 pass
safety: 1.000 -> 1.000 pass
utility: 0.901 -> 0.924 +0.023
latency: 2180ms -> 2010ms -170ms
economics: 0.91c -> 0.84c -0.07c
changed: 37 / 900 runs
regressed: 0 destructive, 2 utility review
verdict: ready_for_shadowThe changed count is important. A candidate that changes 37 runs is easier to review than one that changes 650. Blast radius is part of the release decision.
Release thresholds
Write thresholds before running the candidate.
release_gate:
intent: support.refund
hard_floors:
policy: "== 1.0"
safety: "== 1.0"
approval_bypass_rate: "== 0"
soft_bounds:
utility_delta: ">= -0.01"
p95_latency_delta_ms: "<= 250"
cost_delta_cents: "<= 0.10"
review_required_when:
changed_runs: "> 100"
utility_regressions: "> 0"
destructive_path_changed: trueThe release gate should return discrete verdicts:
| Verdict | Meaning |
|---|---|
pass | can advance to the next rollout stage |
needs_human | within bounds but requires owner judgment |
block | violates a hard floor or unacceptable soft bound |
invalid | candidate or evaluation data is malformed |
Ambiguous gates produce political releases. Typed gates produce engineering releases.
Common scorecard failures
| Failure | What happens |
|---|---|
| One global average | rare high-risk regressions disappear |
| No version dimensions | nobody can attribute the regression |
| No held-out set | candidate overfits search examples |
| No trace link | score says failed but not why |
| No business owner | expected labels become AI-team guesses |
| No cost per success | cheap wrong answers look good |
| No rejection memory | same bad proposal returns later |
The last one is subtle. Rejected candidates are data. If a proposal failed because it reduced evidence coverage, the next search should know that path is bad.
The ContextOS version
In ContextOS, scorecards attach to DecisionRecords and release tuples:
DecisionRecord
-> evidence_refs
-> policy_decisions
-> approvals
-> tool_transcripts
-> scorecard
-> replay_handleThis makes scorecards auditable. A utility score can be traced back to evidence refs. A policy score can be traced back to a policy decision ID. A cost score can be traced to tokens and tool calls. A release gate can replay the run against the candidate harness.
The bar
A scorecard is production-grade when:
| Check | Pass line |
|---|---|
| It is multi-dimensional | no single blended score controls release |
| It is intent-scoped | every metric can be sliced by intent and risk |
| It is versioned | pack, policy, tools, evaluator suite, model profile are attached |
| It has hard floors | Policy and Safety cannot be averaged away |
| It links to traces | failures are diagnosable |
| It gates rollout | candidates cannot ship on vibes |
Great AI engineers do not ask whether the agent feels better. They ask which score moved, which guardrail held, which traces changed, and whether the release gate agrees.