The worst way to judge an AI system is to watch one good demo.
Demos are useful for imagination. They are bad for trust.
Real AI work needs a scorecard.
The driving test analogy
You do not give someone a driving license because they drove one clean lap around an empty parking lot.
You test:
- traffic,
- parking,
- signs,
- night driving,
- unexpected pedestrians,
- bad weather,
- judgment,
- safe stopping.
AI agents need the same kind of test. One correct answer does not prove safe operation.
Five questions every scorecard should answer
| Question | Scorecard dimension |
|---|---|
| Did it follow rules? | Policy |
| Was it safe? | Safety |
| Did it complete the work? | Utility |
| Was it fast enough? | Latency |
| Was it worth the cost? | Economics |
This is the ContextOS Evaluation and Observability model.
The important idea is simple: one score is not enough.
Why one score fails
Imagine a support refund AI.
It could improve “customer satisfaction” by approving every refund. That is not good.
It could reduce cost by refusing every refund. That is not good.
It could be fast by skipping evidence. That is not good.
It could be safe by escalating everything. That is not useful.
A scorecard prevents one metric from hiding another.
Start with examples
Non-technical teams can help create the first examples.
Collect:
| Example type | Why |
|---|---|
| Easy success | Shows the expected path |
| Missing information | Tests whether AI asks instead of guessing |
| Policy boundary | Tests rule-following |
| High-risk action | Tests approval gate |
| Tool failure | Tests graceful recovery |
| Ambiguous request | Tests escalation |
| Prior mistake | Tests whether the system learned |
Each example should say what good looks like.
Write expected behavior
Bad example:
Customer asks for refund. AI should answer correctly.
Good example:
input: "Refund my order. It was late."
expected:
intent: support.refund.evaluate
should:
- look up order
- check refund policy
- ask for missing identity evidence if needed
- explain result with policy evidence
must_not:
- issue refund without approval if amount is above threshold
- invent delivery status
- promise refund without policy supportThis is understandable to business reviewers and useful to technical teams.
Judge the path, not only the answer
AI can get the final answer right for the wrong reason.
So ask:
- What did it see?
- What did it decide?
- What tool did it use?
- What did the tool return?
- What policy applied?
- Why did it accept or escalate?
That path is called a trace.
OpenAI’s agent-eval guidance emphasizes traces for understanding workflow issues because final answers alone are too thin.
Human corrections are gold
When a human corrects AI, capture the correction.
Not:
AI was wrong.
Capture:
correction:
what_ai_did: approved refund
what_human_changed: escalated to finance
why: refund amount exceeded delegated threshold
evidence: policy section 4.2
future_rule: high-value refund requires approval gateThat correction can improve the harness later.
The launch gate
Before launch, decide what must be true.
Example:
| Gate | Required |
|---|---|
| Policy | no rule violations |
| Safety | no unsupported claims |
| Utility | fewer than 10% human corrections |
| Latency | p95 below workflow target |
| Economics | cost below value threshold |
| Trust | every high-risk action has receipt |
If a candidate fails the gate, it does not launch. That is discipline, not pessimism.
Business review questions
When reviewing an AI scorecard, ask:
- Were the examples realistic?
- Did we include hard cases?
- Did we test missing information?
- Did we test high-risk actions?
- Did humans review failures?
- Did we separate practice examples from final test examples?
- Did the AI leave receipts?
- Did the score improve for the right reason?
These questions are powerful even without technical details.
What good looks like
A good AI review says:
On 200 shadow runs, the system completed 82% without correction, escalated 13%, failed gracefully 5%, had zero policy violations, had zero unexpected high-risk actions, and reduced average handling time by 22%. The remaining failures cluster around missing bank verification evidence.
That is much better than:
The demo looked good.