Tag

Evaluation

14 essays tagged with Evaluation.

July 12, 2026·9 min read·Intermediate

Red-Team Agent Hijacking: Build a Security Eval Gate for Repeat Attacks

A practical agent-hijacking evaluation harness: scenario design, adaptive and repeated attempts, path-aware metrics, deterministic release gates, and production replay.

Read essay

May 19, 2026·33 min read·Intermediate

Agent Harness: An Architectural Framework for Production AI Agents

A whitepaper on typed contracts, policy gates, traces, verification loops, and release control for production AI agents.

Read essay

May 16, 2026·28 min read·Intermediate

ContextOS: A Research-Grounded Architecture for Governed Agent Runtimes

A research-grounded framing of ContextOS as a governed runtime for context, tools, memory, security, evaluation, replay, and optimization.

Read essay

May 14, 2026·7 min read·Intermediate

Harness Improvement Loops Need Replayable Environments

Why harness improvement needs replayable episodes, bounded mutations, scorecards, source closure, and promotion gates.

Read essay

May 13, 2026·4 min read·Beginner

How to Judge AI Work: Scorecards, Not Vibes

A practical guide for business teams evaluating AI agents with scorecards, examples, traces, human corrections, and launch gates instead of demos and vibes.

Read essay

May 13, 2026·6 min read·Beginner

Scorecards Before Screens: Evals and Launch Gates for PMs Building Agents

A PM guide to defining agent quality with datasets, trace reviews, scorecards, release gates, and business metrics before building the agent UI.

Read essay

May 12, 2026·11 min read·Intermediate

Autotune the Harness: Baking the Improvement Loop into ContextOS

How ContextOS treats autotune as a gated loop over traces, scorecards, replay sets, bounded candidates, approval, and rollout.

Read essay

May 12, 2026·7 min read·Intermediate

Dataset-First Agent Engineering: The Golden Sets Behind Reliable Agents

A practical guide to golden sets, task distributions, corrected runs, held-out releases, and production slices for agent engineering.

Read essay

May 12, 2026·13 min read·Intermediate

How Great AI Engineers Build Agents: Datasets, Scores, and Harnesses That Improve

Why strong AI engineers build datasets, scorecards, traces, and improvement loops instead of treating agents as prompts plus tools.

Read essay

May 12, 2026·6 min read·Intermediate

Scorecards Over Vibes: The Five Metrics That Keep Agents Honest

The five metrics that keep agents honest: policy, utility, latency, safety, and economics.

Read essay

May 12, 2026·6 min read·Intermediate

Trace Review Is the Agent Debugger: Grade the Path, Not Just the Answer

How trace review grades the path, not just the answer, by inspecting context, plans, tools, guardrails, critic verdicts, and corrections.

Read essay

May 9, 2026·18 min read·Intermediate

Agentic AI Systems Before and After ContextOS

A table-first guide to why agentic systems need bounded context, governed tools, typed decisions, replay, evaluation, and controlled improvement.

Read essay

March 18, 2026·4 min read·Intermediate

Building a Reliability Reviewer Agent: 70 Lines Past the Compliance One

How to extend the reviewer pattern for reliability: timeouts, retries, idempotency, fallback behavior, and rollback declarations.

Read essay

March 15, 2026·6 min read·Intermediate

Building a Compliance Reviewer Agent in 60 Lines and a Golden Set

How to build a compliance reviewer agent with a typed verdict envelope, rubric, golden set, and change-control queue.

Read essay