Product Managers: How to Think About and Build Complex Agentic Systems

Most product teams start an agent project with the wrong noun.

They say: “We need an agent.”

That sounds concrete, but it is usually too vague. A useful agent is not a personality, a chat box, or a workflow diagram with model calls between boxes. A useful agent is a controlled operating system for work: it knows what job it is doing, what evidence it may use, what tools it may touch, what counts as success, when it must ask for approval, and how the product improves after every failure.

The product manager’s job is not to “write the prompt.” The PM’s job is to define the work system.

The best analogy is an airport.

An airport is not “a plane.” It is flight plans, gates, control tower clearances, weather reports, ground crew, baggage routing, security, maintenance logs, incident review, and a flight recorder. Planes are the visible part. The airport system is what makes complex movement safe.

A complex agentic product is the same. The model is the plane. The harness is the airport.

In ContextOS, that harness is decomposed into five planes: Intelligence, Context, Decision, Action, and Trust. This post is a PM guide to using those constructs without losing the product thread.

The PM’s shift: from features to work systems

A normal feature spec says:

Let users ask questions about invoices.

An agentic systems spec says:

For finance.invoice.dispute, help an AP operator investigate an invoice mismatch, gather evidence from ERP and contract systems, propose a resolution, draft the supplier message, and execute the credit adjustment only after the right approval gate. Every run must produce a DecisionRecord, scorecard, and replay handle.

The second version is not just more detailed. It names the work, the authority boundary, the evidence, the tools, the risk, and the receipt.

That is the product manager’s new unit of design.

Research signal: what PMs should internalize

Current agent guidance is converging on five practical lessons:

Google’s People + AI Guidebook starts with user needs, defining success, data/evaluation, mental models, feedback/control, and graceful failure. That is the right PM order: product value before model architecture.
Anthropic’s building effective agents guidance separates predictable workflows from autonomous agents and recommends the simplest solution that works before adding agentic complexity.
Anthropic’s context engineering work treats context as finite and emphasizes just-in-time retrieval, compaction, structured notes, and subagents for long-horizon tasks.
OpenAI’s agent evals guidance starts with traces while behavior is still being understood, then moves to datasets and eval runs for repeatability.
OpenTelemetry’s GenAI spans are making model calls, tool definitions, and tool responses observable primitives. That matters because product quality must be debugged from the trajectory, not only the final answer.

Translated for PMs: do not spec an agent. Spec the job, the evidence, the authority, the scorecard, the failure modes, and the improvement loop.

PM rule

A complex agentic product is an airport, not a pilot.

The visible agent is only one actor. The product is the control tower, flight plan, gates, crews, receipts, scorecards, and incident review that make the work safe enough to run repeatedly.

ContextOS translation table for PMs

Use this table as the bridge between product language and harness language:

PM question	ContextOS construct	Product artifact
What job is the product doing?	Intent-Task Catalog	Intent map, workflow taxonomy
Who is asking, under what authority, with what budget?	`RunContext` and `RunBudget`	Runtime assumptions, delegation model, limits
What should the model know right now?	Context Pack and `CompiledContext`	Evidence policy, memory policy, context budget
What decisions must be made?	Decision Catalog and DecisionRecord	Decision specs, acceptance criteria, receipts
Which tools may affect the world?	Tool Gateway	Tool inventory, side-effect classes, API contracts
What requires approval?	Governance and approval modes	Risk matrix, human-in-the-loop policy
How do we know it worked?	Evaluation and Observability	Scorecards, trace review, eval sets
How does it improve?	Improvement Loop	Feedback loop, proposal queue, rollout process

If a PRD does not answer these questions, engineering will invent the answers in code. That is how agent products become brittle.

Step 1: Choose a real work system, not a demo

Start with the work people already do. Do not start with the UI.

Pick a workflow with:

a clear business outcome,
recurring volume,
visible pain,
bounded authority,
evidence that can be gathered,
a real operator who can judge correctness,
failure modes you can name.

Bad first agent:

“An agent that helps with operations.”

Good first agent:

“For customer.onboarding.enterprise, coordinate contract review, KYC checks, workspace provisioning, billing setup, data migration, and kickoff scheduling for new enterprise customers, while escalating legal and finance exceptions.”

The second version can become a ContextOS intent. The first version is a slogan.

The automation versus augmentation decision

PMs must decide whether the agent should do the work, prepare the work, or coach the human through the work.

Mode	Use when	ContextOS implication
Assist	Human remains primary decision-maker	`read_only` tools, recommendation DecisionRecords
Draft	Agent prepares artifacts for review	`local_write` tools, explicit review gates
Delegate	Agent completes bounded work on user authority	`delegated` tools, user claims in RunContext
Execute	Agent performs high-risk side effects	`destructive` tools, approval gates and frozen evidence

This decision should be made per intent, not per product. A finance agent may assist on tax classification, draft supplier emails, delegate invoice lookup, and require destructive approval for credit issuance.

Step 2: Map the existing workflow like an operations diagram

Before naming agents, draw the current workflow.

For each step, capture:

Map item	PM asks	Harness consequence
Actor	Who does this today?	Owner role and escalation path
Evidence	What facts do they inspect?	Context Pack buckets and evidence refs
System	Which tool or database is touched?	Tool Gateway manifest
Decision	What judgment is made?	Decision Spec and DecisionRecord field
Risk	What can go wrong?	Approval mode and policy bundle
Exception	When do they pause?	Critic verdict and escalation state
Feedback	How do they learn?	FeedbackStore and Improvement Loop

Do this with operators in the room. The workflow map is not a diagram for executives; it is the raw material for the harness.

Step 3: Write the PRD as a harness contract

For complex systems, a PRD should not be a story about screens. It should be a contract for a governed runtime.

Use this skeleton:

intent: customer.onboarding.enterprise
user: implementation_manager
business_outcome: reduce time from signed contract to active workspace
mode: mixed_assist_delegate_execute
risk_class: destructive
must_never:
  - create production workspace without signed contract evidence
  - send supplier/customer email without human-visible draft
  - provision regulated-data tenant without compliance approval
success_metrics:
  utility:
    - onboarding_cycle_time_days
    - operator_correction_rate
    - customer_blocker_count
  trust:
    - evidence_coverage
    - approval_gate_honored_rate
    - audit_gap_rate
  economics:
    - tool_calls_per_onboarding
    - human_minutes_per_onboarding
launch_gate:
  shadow_runs: 50
  policy_floor: 1.0
  safety_floor: 1.0
  replay_drift: 0_unexpected_destructive_actions

This is still product work. It just uses the right shape for agentic systems.

Step 4: Define intents before agents

A multi-agent system should not start with agent names. It should start with canonical intents.

Example:

Intent	Description	Risk	First runtime shape
`onboarding.intake`	Gather contract, stakeholders, environment, constraints	`read_only`	Fixed workflow
`onboarding.kyc_check`	Validate KYC, compliance, and sanctions evidence	`network`	Workflow + critic
`onboarding.workspace_provision`	Create or configure tenant workspace	`delegated`	Plan/execute/critic
`onboarding.billing_setup`	Configure billing account and entitlements	`destructive`	Human-gated workflow
`onboarding.customer_update`	Draft status update for customer	`local_write`	Draft + review

This is the Intent-Task Catalog. It is the PM’s flight schedule. Without it, every conversation becomes “what should the agent do now?” With it, the runtime can route, score, compare, and improve work by intent.

Step 5: Decide whether you need multiple agents

Multi-agent architecture is useful when it creates real separation:

different context,
different tools,
different risk,
different owner,
different eval rubric,
parallel work that reduces time,
independent expertise that improves quality.

It is harmful when it is only an org chart made of prompts.

Use this decision rule:

Split into a specialist agent when…	Keep it in one workflow when…
The subtask needs a different evidence set	The same context serves every step
The subtask can run in parallel	Steps are strictly sequential
The subtask has a different approval mode	Risk is uniform
The subtask needs a different evaluator	One scorecard covers the whole path
The subtask output is a typed artifact	It returns vague prose
The parent can reject or accept the result	The worker can mutate final state directly

In ContextOS terms, specialist agents should be subagent lanes under Orchestration, not independent actors with unrestricted authority. The parent orchestrator owns the final DecisionRecord.

A practical multi-agent pattern

For enterprise onboarding:

Parent Orchestrator: onboarding.enterprise
  ├── Intake Agent: normalize customer facts and missing fields
  ├── Contract Agent: extract obligations and signed terms
  ├── Compliance Agent: KYC, risk flags, policy obligations
  ├── Provisioning Agent: workspace setup plan, no direct execution
  ├── Billing Agent: entitlement and invoice setup proposal
  └── Customer Comms Agent: draft updates, never send directly
 
Critic:
  verifies evidence, approvals, tool permissions, and final receipt

The parent is the control tower. Specialists are crews. A crew can inspect, prepare, and recommend. Only the tower clears the runway.

Step 6: Specify context like a briefing packet

An agent with too little context guesses. An agent with too much context loses focus.

The PM should define the briefing packet for each intent:

Context bucket	PM decision	Example
Mission	What work is this run doing?	`onboarding.workspace_provision`
User and authority	Who is the operator and what can they delegate?	implementation manager, tenant admin claim
Evidence	What sources are allowed to support decisions?	signed contract, CRM, KYC result, SKU catalog
Policy	Which rules are active?	regulated data, export control, billing approval
Tools	Which capabilities are visible?	read contract, create workspace draft, request approval
Memory	What promoted facts can be recalled?	customer region, prior onboarding blockers
Budget	What are the limits?	max tool calls, max cost, max wall time

That is the Context Pack. Think of it as the flight briefing. The Planner should not discover the runway rules halfway through takeoff.

Step 7: Make tools product surfaces

PMs often treat tools as engineering details. That is a mistake.

Tool design shapes product behavior. A vague tool produces vague action. A precise tool creates a safer product.

Every high-value tool should have:

Tool field	PM-level meaning
Name	What job the tool exists for
Description	When to use it and when not to use it
Arguments	Which business facts must be known first
Result	What evidence the tool returns
Side effect	Whether it reads, drafts, writes, delegates, or destroys
Owner	Which team owns correctness and uptime
Approval mode	What clearance is required
Error states	What the agent should do when it fails

The Tool Gateway is the boundary that turns “the agent wants to call an API” into a governed action. For PMs, it is the difference between a product feature and an operational liability.

Step 8: Define the receipt before the answer

For complex agentic work, the final answer is not the artifact of record. The receipt is.

A DecisionRecord should answer:

What intent was handled?
Which context pack and policy versions were active?
What evidence was used?
Which tools were called?
Which approvals were required and obtained?
What did the Critic accept or reject?
What changed in the world?
Which trace can replay the run?

PMs should write acceptance criteria against the DecisionRecord, not only the UI.

Bad acceptance criterion:

The agent tells the user onboarding is complete.

Good acceptance criterion:

The agent emits a DecisionRecord showing signed-contract evidence, KYC pass, workspace provisioning result, billing entitlement result, customer email draft approval, no unresolved policy obligations, and replay handle.

If the receipt is correct, the UI can be improved later. If the receipt is missing, the product is not production-grade.

Step 9: Build the scorecard before the prototype

Complex agents fail in more than one way. A single “quality” metric hides the important failures.

Use a five-axis scorecard:

Score	PM question	Example metric
Policy	Did it obey product and regulatory rules?	approval gate honored rate
Safety	Did it avoid harmful, unsupported, or private output?	unsupported claim rate
Utility	Did it complete the work?	operator correction rate
Latency	Did it finish within the work rhythm?	p95 onboarding run time
Economics	Did it create enough value for the cost?	cost per verified onboarding

Then create three datasets:

Dataset	Who can use it	Purpose
`dev`	product + engineering	fast iteration and examples
`search`	proposer / autotune	candidate generation
`release_test`	release gate only	honest regression check

The release set is sacred. If the team tunes against it every day, it stops being a release gate.

Step 10: Use traces to debug product behavior

When an agent fails, PMs need to ask a better question than “what did it answer?”

Ask:

Did we classify the right intent?
Did the context pack include the right evidence?
Did the Planner choose a sensible path?
Did the Critic reject a bad step early enough?
Did the Tool Gateway deny or require approval correctly?
Did the final DecisionRecord explain the outcome?

This is why Evaluation and Observability is a product capability. Trace review lets PMs identify whether the issue is user need, context, tool design, policy, orchestration, or model behavior.

Step 11: Roll out like operations, not SaaS copy

Do not launch a complex multi-agent system by flipping it on.

Use staged rollout:

Stage	Product posture	PM reads
`0%_shadow`	Agent observes and produces receipts, no user impact	scorecard vs human baseline
`1%_internal`	Internal users only	operator corrections, missing evidence
`5%_low_risk`	Low-risk intents and tenants	policy and safety floors
`25%_monitored`	Broader traffic with tail sampling	tenant cliffs, long-tail failures
`100%`	Full release with pinned rollback	weekly drift review

Every stage needs a kill switch: re-pin the prior harness tuple and stop the new path without redeploying the product.

Step 12: Make feedback a first-class product loop

The product should improve from real work.

Every operator correction should become one of:

Feedback signal	ContextOS primitive	Example
“This was wrong”	FeedbackStore	correction tied to DecisionRecord
“This keeps happening”	InsightSynthesizer	recurring missing-evidence pattern
“Do this next time”	StrategyCompiler	planner rule proposal
“We need new knowledge”	ResearchQueue	knowledge patch request
“This can be tuned”	Autotune	context or prompt candidate
“This needs attention”	ChiefOfStaff	open-loop note

The critical PM rule: feedback is not a Slack thread. Feedback is a product event with provenance, owner, and release path.

The PM artifact stack

For a serious agentic product, the PM should own this artifact set:

Artifact	Why it exists
Workflow map	Shows where real work, systems, people, and decisions live
Intent catalog	Gives every workflow a canonical name and risk class
Authority matrix	Defines automation, augmentation, delegation, approval
Context policy	Names what the agent may know and how it is compiled
Tool inventory	Exposes product-owned side effects and evidence sources
DecisionRecord spec	Defines the receipt for work done
Scorecard	Makes success multi-dimensional
Eval datasets	Keeps iteration honest
Rollout plan	Prevents big-bang failure
Feedback loop	Turns production corrections into harness improvements

This is the agentic PRD.

The hard parts PMs must not outsource

PMs do not need to implement the Planner, Gateway, or Critic. But they must own the product decisions those systems enforce.

1. The cost of being wrong

False positives and false negatives are product decisions.

For an onboarding agent:

false positive: provision a workspace before compliance is clear,
false negative: block a good customer unnecessarily,
ambiguous case: route to human review with a useful evidence bundle.

The PM must decide which error is worse by intent and risk class.

2. The human role

Humans are not only fallback.

They may be:

approvers,
teachers,
exception handlers,
auditors,
policy owners,
customer-facing reviewers,
operators who correct the harness.

Design the human role explicitly. “Human-in-the-loop” is not enough.

3. The shape of trust

Trust is not “the model seems smart.”

Trust is:

the right context,
deterministic policy,
clear tool boundaries,
high-signal traces,
evidence-backed decisions,
reversible workflows where possible,
staged rollout,
a visible correction path.

That is why ContextOS puts governance, evaluation, replay, and improvement in the Trust plane.

A worked example: enterprise onboarding

Suppose the product goal is:

Reduce enterprise onboarding time from 14 days to 5 days without increasing compliance or billing errors.

Product framing

Question	Answer
User	Implementation manager
Customer	Enterprise admin
Business outcome	Faster time to active workspace
High-risk actions	Workspace provisioning, billing setup, regulated-data enablement
Human approvals	Legal, finance, compliance
Primary metric	onboarding cycle time
Guardrails	no compliance bypass, no unapproved billing setup

ContextOS blueprint

Plane	Design
Intelligence	Customer graph, contract facts, prior onboarding memory
Context	Pack per intent: intake, KYC, provisioning, billing, comms
Decision	Parent orchestrator with specialist lanes
Action	Tool Gateway to CRM, contract repository, provisioning API, billing
Trust	Policy gates, scorecards, DecisionRecords, replay, feedback

Multi-agent layout

Agent lane	Owns	Cannot do
Intake	normalize customer facts, missing fields	approve anything
Contract	extract obligations and signed terms	change contract terms
Compliance	evaluate KYC and regulated-data constraints	override policy
Provisioning	draft workspace setup plan	execute without delegated authority
Billing	draft entitlements and invoice setup	commit destructive billing changes
Comms	draft customer updates	send without approval
Critic	verify plan, evidence, tools, approvals	call external tools directly

The parent orchestrator owns final synthesis. The Critic owns acceptance. The Gateway owns side effects. The DecisionRecord owns audit.

The 30-60-90 day plan

Days 0-30: prove one vertical slice

Ship one intent, one path, one receipt.

Map the current workflow.
Define one canonical intent.
Create the task template.
Compile a Context Pack.
Wire read-only tools.
Emit DecisionRecords.
Run shadow traces against real work.

Exit criteria: operators agree the receipts explain the work, even if the agent is not yet faster.

Days 31-60: add authority and gates

Move from advice to bounded action.

Add delegated or destructive tools only behind Gateway policy.
Add approval gates with frozen evidence snapshots.
Add Critic verdicts.
Build release evals.
Start low-risk internal rollout.

Exit criteria: the agent can safely complete bounded work with approvals and replay.

Days 61-90: expand to multi-agent lanes

Split only where specialization improves scorecard outcomes.

Add specialist lanes for separable work.
Add lane-specific context packs.
Add lane-specific evals.
Route production corrections into the Improvement Loop.
Start staged tenant rollout.

Exit criteria: the multi-agent system improves utility or latency without policy, safety, or audit regression.

PM launch checklist

Before launch, ask:

Can we name the intent and risk class?
Can we show the workflow map?
Can we show the Context Pack and evidence policy?
Can we list every tool and approval mode?
Can we inspect a DecisionRecord for a successful run?
Can we inspect a DecisionRecord for a denied or escalated run?
Can we replay a failed run?
Can we explain the scorecard?
Can we prove the release set was not used for tuning?
Can we roll back the harness tuple?
Can operators correct the system in a structured way?
Can a non-engineering reviewer understand why a high-risk action was allowed?

If the answer is no, the product is still a prototype.

What PMs should write in the roadmap

Do not write:

Q2: Build onboarding agent.

Write:

Q2:
  outcome: reduce enterprise onboarding cycle time from 14d to 5d
  slice_1:
    intent: onboarding.intake
    mode: assist
    success: 90% evidence completeness on shadow runs
  slice_2:
    intent: onboarding.workspace_provision
    mode: delegated_with_gate
    success: 0 policy violations, <10% operator correction rate
  slice_3:
    intent: onboarding.customer_update
    mode: draft
    success: 80% drafts accepted with minor edits
  trust:
    required: DecisionRecord, trace grading, release set, rollback tuple
  improvement:
    required: FeedbackStore, StrategyRule proposals, weekly review

This is legible to executives, operators, engineers, and auditors.

The simplest summary

For PMs, ContextOS is a way to keep agentic ambition tied to product reality.

The Intent Catalog keeps the product scope named.
The RunContext keeps authority and budget explicit.
The Context Pack keeps the model’s working memory honest.
The Planner / Executor / Critic keeps work bounded and debuggable.
The Tool Gateway keeps side effects governed.
The DecisionRecord keeps every run accountable.
The Scorecard keeps quality multi-dimensional.
The Rollout gate keeps launches reversible.
The Improvement Loop keeps learning from becoming folklore.

Product managers who learn these constructs do not become backend engineers. They become better owners of real AI products.

They stop asking “what can the agent do?”

They ask:

What work system are we building, what evidence governs it, what authority does it have, how will we know it worked, and how will it improve safely after launch?

That is the question complex agentic products need.

What to read next

ContextOS

Blog series

External references