Agent engineering series
How strong AI engineers build agents with datasets, scorecards, traces, and harness improvement loops.
How to Develop an Agent with an Agent Harness, End to End
A field guide for building production agents as measurable harnesses: intent, RunContext, Context Pack compilation, Planner-Critic-Executor, Tool Gateway, DecisionRecords, trace evals, staged rollout, and the improvement loop.
How Great AI Engineers Build Agents: Datasets, Scores, and Harnesses That Improve
Great agent engineers do not treat an agent as a prompt plus tools. They build datasets, scorecards, traces, and improvement loops, then treat the harness itself as a versioned artifact that can be measured and improved like a model.
Dataset-First Agent Engineering: The Golden Sets Behind Reliable Agents
Great agent teams build datasets before they tune prompts. This is the practical guide to task distributions, golden sets, corrected runs, held-out release sets, and production slices for ContextOS-style harness engineering.
Scorecards Over Vibes: The Five Metrics That Keep Agents Honest
Agent quality cannot be managed with one score. Production teams need scorecards that separate policy, safety, utility, latency, and economics, then gate harness changes by intent and version.
Trace Review Is the Agent Debugger: Grade the Path, Not Just the Answer
Final-answer evals miss how agents actually fail. Trace review captures context, plans, tool calls, guardrails, critic verdicts, and corrections so teams can improve the harness where the failure really happened.
Harness Candidates Are Model Checkpoints: How to Improve Agents Without Silent Mutation
If the harness is the real optimization target, then every prompt, retrieval, planner, tool, policy, and evaluator change should be treated like a checkpoint candidate: scored, reviewed, rolled out, and reversible.