Scorecards Over Vibes: The Five Metrics That Keep Agents Honest

The sentence “the agent got better” is not an engineering statement.

Better at what? On which intent? Against which dataset? Did safety hold? Did policy hold? Did cost rise? Did latency move? Did it solve more tasks, or did it just answer more confidently?

Agent teams get into trouble when they collapse quality into one global feeling. A single score is seductive because it simplifies release decisions. It is also dangerous because agents can improve one dimension by damaging another. ContextOS uses Evaluation and Observability to keep those dimensions separate.

Cost improves when the harness retrieves less evidence. Latency improves when it avoids tools. Utility improves when it answers instead of escalating. User satisfaction improves when the agent says yes. None of those are improvements if the agent bypassed policy, skipped evidence, or created an irreversible side effect.

Great AI engineers use scorecards over vibes.

Five dimensions

A production agent scorecard needs at least five dimensions:

Dimension	Question
Policy	Did the harness obey the rules outside the model?
Safety	Did it avoid unsafe, unsupported, or sensitive behavior?
Utility	Did it complete the task correctly and usefully?
Latency	Did it finish within the workflow’s time budget?
Economics	Did it spend a reasonable amount to reach a verified outcome?

Policy and Safety are floor constraints. Utility, Latency, and Economics are optimization dimensions. That split is the first guardrail against metric gaming.

The OpenAI evals and trace-grading direction reinforces this: final-output scoring is useful, but agent evaluation needs structured graders and trace-level inspection when tool calls, handoffs, and guardrails matter. A scorecard turns that into an operating practice.

The scorecard shape

Use one scorecard per run, then aggregate by intent, pack version, policy version, tool manifest, evaluator suite, model profile, and risk class.

{
  "run_id": "run_refund_017",
  "intent": "support.refund",
  "pack_version": "ctxpack.support@5.2.0",
  "policy_bundle": "policy.returns@4.1.0",
  "tool_manifest": "tools.support@3.7.0",
  "evaluator_suite": "evals.refund@2.4.0",
  "scores": {
    "policy": {
      "score": 1.0,
      "violations": [],
      "approval_gate_honored": true
    },
    "safety": {
      "score": 1.0,
      "unsupported_claim": false,
      "redaction_success": true
    },
    "utility": {
      "score": 0.92,
      "task_success": true,
      "operator_correction": false
    },
    "latency": {
      "wall_clock_ms": 1840,
      "p95_target_ms": 3000
    },
    "economics": {
      "tokens": 4720,
      "tool_calls": 3,
      "cost_cents": 0.91
    }
  }
}

The exact numbers matter less than the dimensions and joins. If you cannot slice by intent and version, you cannot know what changed.

Policy

Policy answers whether the harness obeyed rules outside model discretion.

Good policy evals are usually boring:

Check	Example
Must-deny	refund above limit without approval returns deny or gate
Must-approve-with-gate	destructive action requires the named gate
Must-redact	restricted attribute never reaches final answer
Must-not-use-tool	tool absent from manifest cannot be called
Must-record	policy decision ID appears on DecisionRecord

A Policy score of 0.99 is not always acceptable. For some workflows, one policy violation blocks release. Treat hard policy rules as floors, not averages.

Safety

Safety is broader than policy. Policy is the explicit rule set. Safety catches unsupported claims, prompt-injection success, harmful tool-use patterns, leakage, and evidence gaps.

Useful safety metrics:

Metric	Meaning
unsupported_claim_rate	final claims without evidence refs
evidence_coverage	required evidence present before decision
prompt_injection_resistance	untrusted text did not become instruction
redaction_success	sensitive fields stayed out of unsafe surfaces
irreversible_action_without_gate	should always be zero

Safety evals should include adversarial and accidental failures. Most production safety failures are not cinematic attacks. They are stale documents, copied user text, confused identity, and missing evidence.

Utility

Utility is where teams are tempted to spend all their time. It matters, but it is not enough.

For agents, utility should be task-grounded:

Metric	Better than
decision correctness	generic “helpfulness”
task completion	thumbs-up
operator-corrected rate	anecdotal feedback
answer completeness	length
escalation appropriateness	escalation count alone

The hard part is deciding expected outcomes. That is why the dataset has owners. An AI engineer should not invent refund policy labels without support-ops and policy ownership.

Latency

Latency is not one number. Agent latency is staged.

Stage	What to track
compile	retrieval, policy resolution, memory recall, budget assembly
model	generation time and queueing
tool	external service waits and retries
critic	validation and scoring time
approval	human gate wait, if applicable

Stage latency matters because fixes differ. A slow retrieval stage needs Context-plane indexing or budget work. A slow tool stage needs Adapter Mesh reliability. A slow Critic needs evaluator design. A slow approval stage needs operations, not model tuning.

Economics

Cost is meaningful only per verified outcome.

Do not report only tokens. Report:

Metric	Why
cents per decision	comparable across models and tools
cents per verified success	avoids cheap wrong answers
tokens by bucket	shows context pressure
tool calls per decision	exposes unnecessary orchestration
retry cost	captures unreliable tools and loops

Cost improvements are suspicious until Policy, Safety, and Utility hold. A cheaper agent that skips evidence is not efficient. It is under-instrumented.

Scorecard deltas

The scorecard becomes useful when comparing a baseline to a candidate under a release gate.

candidate: ctxpack.support@5.3.0
baseline:  ctxpack.support@5.2.0
intent:    support.refund
set:       heldout release
 
policy:    1.000 -> 1.000   pass
safety:    1.000 -> 1.000   pass
utility:   0.901 -> 0.924   +0.023
latency:   2180ms -> 2010ms -170ms
economics: 0.91c -> 0.84c  -0.07c
changed:   37 / 900 runs
regressed: 0 destructive, 2 utility review
verdict:   ready_for_shadow

The changed count is important. A candidate that changes 37 runs is easier to review than one that changes 650. Blast radius is part of the release decision.

Release thresholds

Write thresholds before running the candidate.

release_gate:
  intent: support.refund
  hard_floors:
    policy: "== 1.0"
    safety: "== 1.0"
    approval_bypass_rate: "== 0"
  soft_bounds:
    utility_delta: ">= -0.01"
    p95_latency_delta_ms: "<= 250"
    cost_delta_cents: "<= 0.10"
  review_required_when:
    changed_runs: "> 100"
    utility_regressions: "> 0"
    destructive_path_changed: true

The release gate should return discrete verdicts:

Verdict	Meaning
`pass`	can advance to the next rollout stage
`needs_human`	within bounds but requires owner judgment
`block`	violates a hard floor or unacceptable soft bound
`invalid`	candidate or evaluation data is malformed

Ambiguous gates produce political releases. Typed gates produce engineering releases.

Common scorecard failures

Failure	What happens
One global average	rare high-risk regressions disappear
No version dimensions	nobody can attribute the regression
No held-out set	candidate overfits search examples
No trace link	score says failed but not why
No business owner	expected labels become AI-team guesses
No cost per success	cheap wrong answers look good
No rejection memory	same bad proposal returns later

The last one is subtle. Rejected candidates are data. If a proposal failed because it reduced evidence coverage, the next search should know that path is bad.

The ContextOS version

In ContextOS, scorecards attach to DecisionRecords and release tuples:

DecisionRecord
  -> evidence_refs
  -> policy_decisions
  -> approvals
  -> tool_transcripts
  -> scorecard
  -> replay_handle

This makes scorecards auditable. A utility score can be traced back to evidence refs. A policy score can be traced back to a policy decision ID. A cost score can be traced to tokens and tool calls. A release gate can replay the run against the candidate harness.

The bar

A scorecard is production-grade when:

Check	Pass line
It is multi-dimensional	no single blended score controls release
It is intent-scoped	every metric can be sliced by intent and risk
It is versioned	pack, policy, tools, evaluator suite, model profile are attached
It has hard floors	Policy and Safety cannot be averaged away
It links to traces	failures are diagnosable
It gates rollout	candidates cannot ship on vibes

Great AI engineers do not ask whether the agent feels better. They ask which score moved, which guardrail held, which traces changed, and whether the release gate agrees.