Most product teams start an agent project with the wrong noun.
They say: “We need an agent.”
That sounds concrete, but it is usually too vague. A useful agent is not a personality, a chat box, or a workflow diagram with model calls between boxes. A useful agent is a controlled operating system for work: it knows what job it is doing, what evidence it may use, what tools it may touch, what counts as success, when it must ask for approval, and how the product improves after every failure.
The product manager’s job is not to “write the prompt.” The PM’s job is to define the work system.
The best analogy is an airport.
An airport is not “a plane.” It is flight plans, gates, control tower clearances, weather reports, ground crew, baggage routing, security, maintenance logs, incident review, and a flight recorder. Planes are the visible part. The airport system is what makes complex movement safe.
A complex agentic product is the same. The model is the plane. The harness is the airport.
In ContextOS, that harness is decomposed into five planes: Intelligence, Context, Decision, Action, and Trust. This post is a PM guide to using those constructs without losing the product thread.
The PM’s shift: from features to work systems
A normal feature spec says:
Let users ask questions about invoices.
An agentic systems spec says:
For
finance.invoice.dispute, help an AP operator investigate an invoice mismatch, gather evidence from ERP and contract systems, propose a resolution, draft the supplier message, and execute the credit adjustment only after the right approval gate. Every run must produce a DecisionRecord, scorecard, and replay handle.
The second version is not just more detailed. It names the work, the authority boundary, the evidence, the tools, the risk, and the receipt.
That is the product manager’s new unit of design.
Research signal: what PMs should internalize
Current agent guidance is converging on five practical lessons:
- Google’s People + AI Guidebook starts with user needs, defining success, data/evaluation, mental models, feedback/control, and graceful failure. That is the right PM order: product value before model architecture.
- Anthropic’s building effective agents guidance separates predictable workflows from autonomous agents and recommends the simplest solution that works before adding agentic complexity.
- Anthropic’s context engineering work treats context as finite and emphasizes just-in-time retrieval, compaction, structured notes, and subagents for long-horizon tasks.
- OpenAI’s agent evals guidance starts with traces while behavior is still being understood, then moves to datasets and eval runs for repeatability.
- OpenTelemetry’s GenAI spans are making model calls, tool definitions, and tool responses observable primitives. That matters because product quality must be debugged from the trajectory, not only the final answer.
Translated for PMs: do not spec an agent. Spec the job, the evidence, the authority, the scorecard, the failure modes, and the improvement loop.
ContextOS translation table for PMs
Use this table as the bridge between product language and harness language:
| PM question | ContextOS construct | Product artifact |
|---|---|---|
| What job is the product doing? | Intent-Task Catalog | Intent map, workflow taxonomy |
| Who is asking, under what authority, with what budget? | RunContext and RunBudget | Runtime assumptions, delegation model, limits |
| What should the model know right now? | Context Pack and CompiledContext | Evidence policy, memory policy, context budget |
| What decisions must be made? | Decision Catalog and DecisionRecord | Decision specs, acceptance criteria, receipts |
| Which tools may affect the world? | Tool Gateway | Tool inventory, side-effect classes, API contracts |
| What requires approval? | Governance and approval modes | Risk matrix, human-in-the-loop policy |
| How do we know it worked? | Evaluation and Observability | Scorecards, trace review, eval sets |
| How does it improve? | Improvement Loop | Feedback loop, proposal queue, rollout process |
If a PRD does not answer these questions, engineering will invent the answers in code. That is how agent products become brittle.
Step 1: Choose a real work system, not a demo
Start with the work people already do. Do not start with the UI.
Pick a workflow with:
- a clear business outcome,
- recurring volume,
- visible pain,
- bounded authority,
- evidence that can be gathered,
- a real operator who can judge correctness,
- failure modes you can name.
Bad first agent:
“An agent that helps with operations.”
Good first agent:
“For
customer.onboarding.enterprise, coordinate contract review, KYC checks, workspace provisioning, billing setup, data migration, and kickoff scheduling for new enterprise customers, while escalating legal and finance exceptions.”
The second version can become a ContextOS intent. The first version is a slogan.
The automation versus augmentation decision
PMs must decide whether the agent should do the work, prepare the work, or coach the human through the work.
| Mode | Use when | ContextOS implication |
|---|---|---|
| Assist | Human remains primary decision-maker | read_only tools, recommendation DecisionRecords |
| Draft | Agent prepares artifacts for review | local_write tools, explicit review gates |
| Delegate | Agent completes bounded work on user authority | delegated tools, user claims in RunContext |
| Execute | Agent performs high-risk side effects | destructive tools, approval gates and frozen evidence |
This decision should be made per intent, not per product. A finance agent may assist on tax classification, draft supplier emails, delegate invoice lookup, and require destructive approval for credit issuance.
Step 2: Map the existing workflow like an operations diagram
Before naming agents, draw the current workflow.
For each step, capture:
| Map item | PM asks | Harness consequence |
|---|---|---|
| Actor | Who does this today? | Owner role and escalation path |
| Evidence | What facts do they inspect? | Context Pack buckets and evidence refs |
| System | Which tool or database is touched? | Tool Gateway manifest |
| Decision | What judgment is made? | Decision Spec and DecisionRecord field |
| Risk | What can go wrong? | Approval mode and policy bundle |
| Exception | When do they pause? | Critic verdict and escalation state |
| Feedback | How do they learn? | FeedbackStore and Improvement Loop |
Do this with operators in the room. The workflow map is not a diagram for executives; it is the raw material for the harness.
Step 3: Write the PRD as a harness contract
For complex systems, a PRD should not be a story about screens. It should be a contract for a governed runtime.
Use this skeleton:
intent: customer.onboarding.enterprise
user: implementation_manager
business_outcome: reduce time from signed contract to active workspace
mode: mixed_assist_delegate_execute
risk_class: destructive
must_never:
- create production workspace without signed contract evidence
- send supplier/customer email without human-visible draft
- provision regulated-data tenant without compliance approval
success_metrics:
utility:
- onboarding_cycle_time_days
- operator_correction_rate
- customer_blocker_count
trust:
- evidence_coverage
- approval_gate_honored_rate
- audit_gap_rate
economics:
- tool_calls_per_onboarding
- human_minutes_per_onboarding
launch_gate:
shadow_runs: 50
policy_floor: 1.0
safety_floor: 1.0
replay_drift: 0_unexpected_destructive_actionsThis is still product work. It just uses the right shape for agentic systems.
Step 4: Define intents before agents
A multi-agent system should not start with agent names. It should start with canonical intents.
Example:
| Intent | Description | Risk | First runtime shape |
|---|---|---|---|
onboarding.intake | Gather contract, stakeholders, environment, constraints | read_only | Fixed workflow |
onboarding.kyc_check | Validate KYC, compliance, and sanctions evidence | network | Workflow + critic |
onboarding.workspace_provision | Create or configure tenant workspace | delegated | Plan/execute/critic |
onboarding.billing_setup | Configure billing account and entitlements | destructive | Human-gated workflow |
onboarding.customer_update | Draft status update for customer | local_write | Draft + review |
This is the Intent-Task Catalog. It is the PM’s flight schedule. Without it, every conversation becomes “what should the agent do now?” With it, the runtime can route, score, compare, and improve work by intent.
Step 5: Decide whether you need multiple agents
Multi-agent architecture is useful when it creates real separation:
- different context,
- different tools,
- different risk,
- different owner,
- different eval rubric,
- parallel work that reduces time,
- independent expertise that improves quality.
It is harmful when it is only an org chart made of prompts.
Use this decision rule:
| Split into a specialist agent when… | Keep it in one workflow when… |
|---|---|
| The subtask needs a different evidence set | The same context serves every step |
| The subtask can run in parallel | Steps are strictly sequential |
| The subtask has a different approval mode | Risk is uniform |
| The subtask needs a different evaluator | One scorecard covers the whole path |
| The subtask output is a typed artifact | It returns vague prose |
| The parent can reject or accept the result | The worker can mutate final state directly |
In ContextOS terms, specialist agents should be subagent lanes under Orchestration, not independent actors with unrestricted authority. The parent orchestrator owns the final DecisionRecord.
A practical multi-agent pattern
For enterprise onboarding:
Parent Orchestrator: onboarding.enterprise
├── Intake Agent: normalize customer facts and missing fields
├── Contract Agent: extract obligations and signed terms
├── Compliance Agent: KYC, risk flags, policy obligations
├── Provisioning Agent: workspace setup plan, no direct execution
├── Billing Agent: entitlement and invoice setup proposal
└── Customer Comms Agent: draft updates, never send directly
Critic:
verifies evidence, approvals, tool permissions, and final receiptThe parent is the control tower. Specialists are crews. A crew can inspect, prepare, and recommend. Only the tower clears the runway.
Step 6: Specify context like a briefing packet
An agent with too little context guesses. An agent with too much context loses focus.
The PM should define the briefing packet for each intent:
| Context bucket | PM decision | Example |
|---|---|---|
| Mission | What work is this run doing? | onboarding.workspace_provision |
| User and authority | Who is the operator and what can they delegate? | implementation manager, tenant admin claim |
| Evidence | What sources are allowed to support decisions? | signed contract, CRM, KYC result, SKU catalog |
| Policy | Which rules are active? | regulated data, export control, billing approval |
| Tools | Which capabilities are visible? | read contract, create workspace draft, request approval |
| Memory | What promoted facts can be recalled? | customer region, prior onboarding blockers |
| Budget | What are the limits? | max tool calls, max cost, max wall time |
That is the Context Pack. Think of it as the flight briefing. The Planner should not discover the runway rules halfway through takeoff.
Step 7: Make tools product surfaces
PMs often treat tools as engineering details. That is a mistake.
Tool design shapes product behavior. A vague tool produces vague action. A precise tool creates a safer product.
Every high-value tool should have:
| Tool field | PM-level meaning |
|---|---|
| Name | What job the tool exists for |
| Description | When to use it and when not to use it |
| Arguments | Which business facts must be known first |
| Result | What evidence the tool returns |
| Side effect | Whether it reads, drafts, writes, delegates, or destroys |
| Owner | Which team owns correctness and uptime |
| Approval mode | What clearance is required |
| Error states | What the agent should do when it fails |
The Tool Gateway is the boundary that turns “the agent wants to call an API” into a governed action. For PMs, it is the difference between a product feature and an operational liability.
Step 8: Define the receipt before the answer
For complex agentic work, the final answer is not the artifact of record. The receipt is.
A DecisionRecord should answer:
- What intent was handled?
- Which context pack and policy versions were active?
- What evidence was used?
- Which tools were called?
- Which approvals were required and obtained?
- What did the Critic accept or reject?
- What changed in the world?
- Which trace can replay the run?
PMs should write acceptance criteria against the DecisionRecord, not only the UI.
Bad acceptance criterion:
The agent tells the user onboarding is complete.
Good acceptance criterion:
The agent emits a DecisionRecord showing signed-contract evidence, KYC pass, workspace provisioning result, billing entitlement result, customer email draft approval, no unresolved policy obligations, and replay handle.
If the receipt is correct, the UI can be improved later. If the receipt is missing, the product is not production-grade.
Step 9: Build the scorecard before the prototype
Complex agents fail in more than one way. A single “quality” metric hides the important failures.
Use a five-axis scorecard:
| Score | PM question | Example metric |
|---|---|---|
| Policy | Did it obey product and regulatory rules? | approval gate honored rate |
| Safety | Did it avoid harmful, unsupported, or private output? | unsupported claim rate |
| Utility | Did it complete the work? | operator correction rate |
| Latency | Did it finish within the work rhythm? | p95 onboarding run time |
| Economics | Did it create enough value for the cost? | cost per verified onboarding |
Then create three datasets:
| Dataset | Who can use it | Purpose |
|---|---|---|
dev | product + engineering | fast iteration and examples |
search | proposer / autotune | candidate generation |
release_test | release gate only | honest regression check |
The release set is sacred. If the team tunes against it every day, it stops being a release gate.
Step 10: Use traces to debug product behavior
When an agent fails, PMs need to ask a better question than “what did it answer?”
Ask:
- Did we classify the right intent?
- Did the context pack include the right evidence?
- Did the Planner choose a sensible path?
- Did the Critic reject a bad step early enough?
- Did the Tool Gateway deny or require approval correctly?
- Did the final DecisionRecord explain the outcome?
This is why Evaluation and Observability is a product capability. Trace review lets PMs identify whether the issue is user need, context, tool design, policy, orchestration, or model behavior.
Step 11: Roll out like operations, not SaaS copy
Do not launch a complex multi-agent system by flipping it on.
Use staged rollout:
| Stage | Product posture | PM reads |
|---|---|---|
0%_shadow | Agent observes and produces receipts, no user impact | scorecard vs human baseline |
1%_internal | Internal users only | operator corrections, missing evidence |
5%_low_risk | Low-risk intents and tenants | policy and safety floors |
25%_monitored | Broader traffic with tail sampling | tenant cliffs, long-tail failures |
100% | Full release with pinned rollback | weekly drift review |
Every stage needs a kill switch: re-pin the prior harness tuple and stop the new path without redeploying the product.
Step 12: Make feedback a first-class product loop
The product should improve from real work.
Every operator correction should become one of:
| Feedback signal | ContextOS primitive | Example |
|---|---|---|
| “This was wrong” | FeedbackStore | correction tied to DecisionRecord |
| “This keeps happening” | InsightSynthesizer | recurring missing-evidence pattern |
| “Do this next time” | StrategyCompiler | planner rule proposal |
| “We need new knowledge” | ResearchQueue | knowledge patch request |
| “This can be tuned” | Autotune | context or prompt candidate |
| “This needs attention” | ChiefOfStaff | open-loop note |
The critical PM rule: feedback is not a Slack thread. Feedback is a product event with provenance, owner, and release path.
The PM artifact stack
For a serious agentic product, the PM should own this artifact set:
| Artifact | Why it exists |
|---|---|
| Workflow map | Shows where real work, systems, people, and decisions live |
| Intent catalog | Gives every workflow a canonical name and risk class |
| Authority matrix | Defines automation, augmentation, delegation, approval |
| Context policy | Names what the agent may know and how it is compiled |
| Tool inventory | Exposes product-owned side effects and evidence sources |
| DecisionRecord spec | Defines the receipt for work done |
| Scorecard | Makes success multi-dimensional |
| Eval datasets | Keeps iteration honest |
| Rollout plan | Prevents big-bang failure |
| Feedback loop | Turns production corrections into harness improvements |
This is the agentic PRD.
The hard parts PMs must not outsource
PMs do not need to implement the Planner, Gateway, or Critic. But they must own the product decisions those systems enforce.
1. The cost of being wrong
False positives and false negatives are product decisions.
For an onboarding agent:
- false positive: provision a workspace before compliance is clear,
- false negative: block a good customer unnecessarily,
- ambiguous case: route to human review with a useful evidence bundle.
The PM must decide which error is worse by intent and risk class.
2. The human role
Humans are not only fallback.
They may be:
- approvers,
- teachers,
- exception handlers,
- auditors,
- policy owners,
- customer-facing reviewers,
- operators who correct the harness.
Design the human role explicitly. “Human-in-the-loop” is not enough.
3. The shape of trust
Trust is not “the model seems smart.”
Trust is:
- the right context,
- deterministic policy,
- clear tool boundaries,
- high-signal traces,
- evidence-backed decisions,
- reversible workflows where possible,
- staged rollout,
- a visible correction path.
That is why ContextOS puts governance, evaluation, replay, and improvement in the Trust plane.
A worked example: enterprise onboarding
Suppose the product goal is:
Reduce enterprise onboarding time from 14 days to 5 days without increasing compliance or billing errors.
Product framing
| Question | Answer |
|---|---|
| User | Implementation manager |
| Customer | Enterprise admin |
| Business outcome | Faster time to active workspace |
| High-risk actions | Workspace provisioning, billing setup, regulated-data enablement |
| Human approvals | Legal, finance, compliance |
| Primary metric | onboarding cycle time |
| Guardrails | no compliance bypass, no unapproved billing setup |
ContextOS blueprint
| Plane | Design |
|---|---|
| Intelligence | Customer graph, contract facts, prior onboarding memory |
| Context | Pack per intent: intake, KYC, provisioning, billing, comms |
| Decision | Parent orchestrator with specialist lanes |
| Action | Tool Gateway to CRM, contract repository, provisioning API, billing |
| Trust | Policy gates, scorecards, DecisionRecords, replay, feedback |
Multi-agent layout
| Agent lane | Owns | Cannot do |
|---|---|---|
| Intake | normalize customer facts, missing fields | approve anything |
| Contract | extract obligations and signed terms | change contract terms |
| Compliance | evaluate KYC and regulated-data constraints | override policy |
| Provisioning | draft workspace setup plan | execute without delegated authority |
| Billing | draft entitlements and invoice setup | commit destructive billing changes |
| Comms | draft customer updates | send without approval |
| Critic | verify plan, evidence, tools, approvals | call external tools directly |
The parent orchestrator owns final synthesis. The Critic owns acceptance. The Gateway owns side effects. The DecisionRecord owns audit.
The 30-60-90 day plan
Days 0-30: prove one vertical slice
Ship one intent, one path, one receipt.
- Map the current workflow.
- Define one canonical intent.
- Create the task template.
- Compile a Context Pack.
- Wire read-only tools.
- Emit DecisionRecords.
- Run shadow traces against real work.
Exit criteria: operators agree the receipts explain the work, even if the agent is not yet faster.
Days 31-60: add authority and gates
Move from advice to bounded action.
- Add delegated or destructive tools only behind Gateway policy.
- Add approval gates with frozen evidence snapshots.
- Add Critic verdicts.
- Build release evals.
- Start low-risk internal rollout.
Exit criteria: the agent can safely complete bounded work with approvals and replay.
Days 61-90: expand to multi-agent lanes
Split only where specialization improves scorecard outcomes.
- Add specialist lanes for separable work.
- Add lane-specific context packs.
- Add lane-specific evals.
- Route production corrections into the Improvement Loop.
- Start staged tenant rollout.
Exit criteria: the multi-agent system improves utility or latency without policy, safety, or audit regression.
PM launch checklist
Before launch, ask:
- Can we name the intent and risk class?
- Can we show the workflow map?
- Can we show the Context Pack and evidence policy?
- Can we list every tool and approval mode?
- Can we inspect a DecisionRecord for a successful run?
- Can we inspect a DecisionRecord for a denied or escalated run?
- Can we replay a failed run?
- Can we explain the scorecard?
- Can we prove the release set was not used for tuning?
- Can we roll back the harness tuple?
- Can operators correct the system in a structured way?
- Can a non-engineering reviewer understand why a high-risk action was allowed?
If the answer is no, the product is still a prototype.
What PMs should write in the roadmap
Do not write:
Q2: Build onboarding agent.
Write:
Q2:
outcome: reduce enterprise onboarding cycle time from 14d to 5d
slice_1:
intent: onboarding.intake
mode: assist
success: 90% evidence completeness on shadow runs
slice_2:
intent: onboarding.workspace_provision
mode: delegated_with_gate
success: 0 policy violations, <10% operator correction rate
slice_3:
intent: onboarding.customer_update
mode: draft
success: 80% drafts accepted with minor edits
trust:
required: DecisionRecord, trace grading, release set, rollback tuple
improvement:
required: FeedbackStore, StrategyRule proposals, weekly reviewThis is legible to executives, operators, engineers, and auditors.
The simplest summary
For PMs, ContextOS is a way to keep agentic ambition tied to product reality.
- The Intent Catalog keeps the product scope named.
- The RunContext keeps authority and budget explicit.
- The Context Pack keeps the model’s working memory honest.
- The Planner / Executor / Critic keeps work bounded and debuggable.
- The Tool Gateway keeps side effects governed.
- The DecisionRecord keeps every run accountable.
- The Scorecard keeps quality multi-dimensional.
- The Rollout gate keeps launches reversible.
- The Improvement Loop keeps learning from becoming folklore.
Product managers who learn these constructs do not become backend engineers. They become better owners of real AI products.
They stop asking “what can the agent do?”
They ask:
What work system are we building, what evidence governs it, what authority does it have, how will we know it worked, and how will it improve safely after launch?
That is the question complex agentic products need.
What to read next
ContextOS
- Harness Engineering
- Intent-Task Catalog
- Context Pack
- Orchestration
- Governance
- Evaluation and Observability
- Improvement Loop
Blog series
- Product Management series
- Agent Engineering series
- How to Develop an Agent with an Agent Harness, End to End
- Dataset-first agent engineering
- Scorecards over vibes
- Trace review is the agent debugger
External references