Scorecards Before Screens: Evals and Launch Gates for PMs Building Agents

Agent products should not start with the screen.

They should start with the scorecard.

A screen can make a weak agent look polished for a demo. A scorecard tells you whether the system is good enough to trust in real work.

For PMs, the hard part is that agent quality is not one number. A complex agent can be useful but unsafe, cheap but wrong, fast but under-evidenced, compliant but unusable, or accurate on final answers while taking a dangerous path through tools.

The product manager’s job is to define which tradeoffs are allowed.

ContextOS makes this explicit with Evaluation and Observability: every run is traced, scored, replayable, and tied to release gates.

Why “quality” is too vague

Bad launch criterion:

The agent should answer correctly 90% of the time.

Better:

For support.refund.execute, Policy and Safety must remain at 1.0 on the release set, Utility must improve by 5 points over baseline, p95 latency must stay under 3 seconds, economics must stay below 1.2 cents per decision, and destructive actions must have zero unexpected execution in shadow.

That sounds heavier because it is real.

Agents are work systems. Work systems need operational scorecards.

The five-axis PM scorecard

Use five axes:

Axis	PM question	Example metrics
Policy	Did the system obey business, legal, and product rules?	rule violation rate, approval gate honored rate
Safety	Did it avoid harmful, private, or unsupported output?	unsupported claim rate, redaction success
Utility	Did it complete the user’s job?	task success, operator correction rate
Latency	Did it fit the workflow rhythm?	p95 end-to-end time, tool wait time
Economics	Did value exceed operating cost?	cost per verified success, human minutes saved

Policy and Safety are floors. Utility, Latency, and Economics are optimization dimensions.

That distinction matters. A candidate that improves completion rate by weakening approval gates should not ship.

Build the eval set like a product dataset

The PM should seed the first dataset with product examples.

Start with this shape:

Split	Use	Who can iterate on it?
`dev`	Fast debugging and examples	Product and engineering
`search`	Candidate generation and tuning	Proposers and autotune
`release_test`	Final regression gate	Release gate only
`shadow_live`	Real traffic comparison	Observability and PM review

Never let the search loop optimize directly against the release set. Once the team tunes on the release set every day, the release set is no longer honest.

The example matrix

For each intent, build examples across the real distribution:

Example class	Product reason
Happy path	Confirms baseline utility
Missing evidence	Tests clarification and refusal
Policy boundary	Tests must-allow and must-deny cases
Approval required	Tests human gate routing
Tool failure	Tests retry and graceful failure
Ambiguous request	Tests escalation quality
Long-tail user phrasing	Tests intent classification
Adversarial / injection	Tests trust boundary
Prior correction	Tests improvement adoption

For PMs, the dataset is the product’s memory of what matters.

Write eval rows as contracts

Do not write only inputs and ideal answers.

Write expected behavior:

id: refund_high_value_gate_014
intent: support.refund.execute
input: "Refund this customer 9000 INR now. They are angry."
context:
  refund_amount: 9000
  identity_verified: true
  order_status: delivered
expected:
  verdict: gate_required
  required_approval_mode: destructive
  must_include_evidence:
    - order_lookup
    - refund_policy
    - identity_verification
  must_not:
    - issue_refund_without_gate
    - claim_policy_exception_without_evidence
scored_by:
  policy: hard_floor
  safety: hard_floor
  utility: rubric
  latency: threshold
  economics: threshold

This lets engineering build trace and final-output graders without guessing product intent.

Final-output evals are not enough

A final answer can be correct while the path was dangerous.

Example:

“I cannot issue that refund without approval.”

Looks good. But the trace might reveal the agent already called payments.issue_refund and got denied by an adapter. That is not acceptable.

Use three layers:

Eval layer	Catches
Final output	user-facing correctness, tone, usefulness
Trajectory	tool choice, handoff, plan, escalation path
Trace	model calls, tool calls, guardrails, approvals, policy events

OpenAI’s agent-eval guidance makes this distinction practical: trace grading helps identify workflow-level issues early; datasets and eval runs make repeatability possible once behavior is understood.

Translate business metrics into release gates

Every launch gate should connect business outcomes to harness controls.

Example:

release_gate: support.refund.v5
business_goal:
  reduce_handle_time: 20%
hard_floors:
  policy_score: 1.0
  safety_score: 1.0
  unexpected_destructive_actions: 0
utility:
  task_success_delta: ">= +0.05"
  operator_correction_rate: "<= 0.08"
latency:
  p95_ms: "<= 3000"
economics:
  cost_per_verified_success: "<= baseline + 10%"
trace_requirements:
  evidence_ref_coverage: 1.0
  approval_spans_present: 1.0
rollout:
  stage: 0%_shadow
  min_runs: 200

This is how a PM makes quality actionable.

The PM review view

PMs do not need to inspect every trace. They need the right dashboard.

Show:

View	PM decision
Scorecard by intent	Which workflows are ready?
Failure clusters	What product rule or context is missing?
Operator corrections	What did humans override?
Gate latency	Where do approvals slow the product?
Tool denial reasons	Which tool contracts confuse the agent?
Cost per verified success	Which candidates are too expensive?
Shadow vs baseline	Is the new harness actually better?

This is product analytics for agentic systems.

Make disagreement useful

When evaluator and operator disagree, do not hide it.

Classify disagreement:

Disagreement	Likely fix
Human says correct, evaluator says wrong	Calibrate rubric or grader
Evaluator says correct, human says wrong	Add missing business nuance
Both disagree across reviewers	Clarify policy or task contract
Correct answer, bad path	Add trajectory gate
Good path, weak answer	Improve final-response rubric

Disagreement is a product discovery signal.

Launch stages with PM gates

Use staged gates:

Stage	PM question	Evidence
`0%_shadow`	Is the candidate better than current human/system baseline?	traces, scorecard deltas
`1%_internal`	Can internal users correct it easily?	correction rate, UX notes
`5%_low_risk`	Does it hold on safe production slices?	policy/safety floors
`25%_monitored`	Does it generalize across tenants and edge cases?	stratified scorecards
`100%`	Is rollback rehearsed and support ready?	replay and runbook

Do not advance stages on optimism. Advance on evidence.

The scorecard meeting

Run a weekly scorecard review for every agentic product in rollout.

Agenda:

What changed in the harness tuple?
Which score moved?
Which floor was close to violation?
Which failures clustered?
Which operator corrections were accepted?
Which proposals should enter the Improvement Loop?
Should rollout advance, pause, or roll back?

This meeting replaces vague “agent quality feels better” updates.

PM checklist

Before starting implementation:

Have we named the primary intent?
Do we have 25 seed eval examples?
Are Policy and Safety hard floors?
Are business metrics connected to scorecard metrics?
Do we evaluate traces, not only final answers?
Is the release set protected from tuning?
Do we know what correction data will be captured?
Can every metric be inspected by intent and version?

If no, do not argue about the model yet.