Skip to content
Back to Blog
Product management series
May 13, 2026
·by Piyush·18 min read

Product Managers: How to Think About and Build Complex Agentic Systems

ContextOS
Product Management
Agents
Harness Engineering
Multi-Agent Systems
Share:XHN

Most product teams start an agent project with the wrong noun.

They say: “We need an agent.”

That sounds concrete, but it is usually too vague. A useful agent is not a personality, a chat box, or a workflow diagram with model calls between boxes. A useful agent is a controlled operating system for work: it knows what job it is doing, what evidence it may use, what tools it may touch, what counts as success, when it must ask for approval, and how the product improves after every failure.

The product manager’s job is not to “write the prompt.” The PM’s job is to define the work system.

The best analogy is an airport.

An airport is not “a plane.” It is flight plans, gates, control tower clearances, weather reports, ground crew, baggage routing, security, maintenance logs, incident review, and a flight recorder. Planes are the visible part. The airport system is what makes complex movement safe.

A complex agentic product is the same. The model is the plane. The harness is the airport.

In ContextOS, that harness is decomposed into five planes: Intelligence, Context, Decision, Action, and Trust. This post is a PM guide to using those constructs without losing the product thread.

The PM’s shift: from features to work systems

A normal feature spec says:

Let users ask questions about invoices.

An agentic systems spec says:

For finance.invoice.dispute, help an AP operator investigate an invoice mismatch, gather evidence from ERP and contract systems, propose a resolution, draft the supplier message, and execute the credit adjustment only after the right approval gate. Every run must produce a DecisionRecord, scorecard, and replay handle.

The second version is not just more detailed. It names the work, the authority boundary, the evidence, the tools, the risk, and the receipt.

That is the product manager’s new unit of design.

Research signal: what PMs should internalize

Current agent guidance is converging on five practical lessons:

  • Google’s People + AI Guidebook starts with user needs, defining success, data/evaluation, mental models, feedback/control, and graceful failure. That is the right PM order: product value before model architecture.
  • Anthropic’s building effective agents guidance separates predictable workflows from autonomous agents and recommends the simplest solution that works before adding agentic complexity.
  • Anthropic’s context engineering work treats context as finite and emphasizes just-in-time retrieval, compaction, structured notes, and subagents for long-horizon tasks.
  • OpenAI’s agent evals guidance starts with traces while behavior is still being understood, then moves to datasets and eval runs for repeatability.
  • OpenTelemetry’s GenAI spans are making model calls, tool definitions, and tool responses observable primitives. That matters because product quality must be debugged from the trajectory, not only the final answer.

Translated for PMs: do not spec an agent. Spec the job, the evidence, the authority, the scorecard, the failure modes, and the improvement loop.

PM rule
A complex agentic product is an airport, not a pilot.
The visible agent is only one actor. The product is the control tower, flight plan, gates, crews, receipts, scorecards, and incident review that make the work safe enough to run repeatedly.

ContextOS translation table for PMs

Use this table as the bridge between product language and harness language:

PM questionContextOS constructProduct artifact
What job is the product doing?Intent-Task CatalogIntent map, workflow taxonomy
Who is asking, under what authority, with what budget?RunContext and RunBudgetRuntime assumptions, delegation model, limits
What should the model know right now?Context Pack and CompiledContextEvidence policy, memory policy, context budget
What decisions must be made?Decision Catalog and DecisionRecordDecision specs, acceptance criteria, receipts
Which tools may affect the world?Tool GatewayTool inventory, side-effect classes, API contracts
What requires approval?Governance and approval modesRisk matrix, human-in-the-loop policy
How do we know it worked?Evaluation and ObservabilityScorecards, trace review, eval sets
How does it improve?Improvement LoopFeedback loop, proposal queue, rollout process

If a PRD does not answer these questions, engineering will invent the answers in code. That is how agent products become brittle.

Step 1: Choose a real work system, not a demo

Start with the work people already do. Do not start with the UI.

Pick a workflow with:

  • a clear business outcome,
  • recurring volume,
  • visible pain,
  • bounded authority,
  • evidence that can be gathered,
  • a real operator who can judge correctness,
  • failure modes you can name.

Bad first agent:

“An agent that helps with operations.”

Good first agent:

“For customer.onboarding.enterprise, coordinate contract review, KYC checks, workspace provisioning, billing setup, data migration, and kickoff scheduling for new enterprise customers, while escalating legal and finance exceptions.”

The second version can become a ContextOS intent. The first version is a slogan.

The automation versus augmentation decision

PMs must decide whether the agent should do the work, prepare the work, or coach the human through the work.

ModeUse whenContextOS implication
AssistHuman remains primary decision-makerread_only tools, recommendation DecisionRecords
DraftAgent prepares artifacts for reviewlocal_write tools, explicit review gates
DelegateAgent completes bounded work on user authoritydelegated tools, user claims in RunContext
ExecuteAgent performs high-risk side effectsdestructive tools, approval gates and frozen evidence

This decision should be made per intent, not per product. A finance agent may assist on tax classification, draft supplier emails, delegate invoice lookup, and require destructive approval for credit issuance.

Step 2: Map the existing workflow like an operations diagram

Before naming agents, draw the current workflow.

For each step, capture:

Map itemPM asksHarness consequence
ActorWho does this today?Owner role and escalation path
EvidenceWhat facts do they inspect?Context Pack buckets and evidence refs
SystemWhich tool or database is touched?Tool Gateway manifest
DecisionWhat judgment is made?Decision Spec and DecisionRecord field
RiskWhat can go wrong?Approval mode and policy bundle
ExceptionWhen do they pause?Critic verdict and escalation state
FeedbackHow do they learn?FeedbackStore and Improvement Loop

Do this with operators in the room. The workflow map is not a diagram for executives; it is the raw material for the harness.

Step 3: Write the PRD as a harness contract

For complex systems, a PRD should not be a story about screens. It should be a contract for a governed runtime.

Use this skeleton:

intent: customer.onboarding.enterprise
user: implementation_manager
business_outcome: reduce time from signed contract to active workspace
mode: mixed_assist_delegate_execute
risk_class: destructive
must_never:
  - create production workspace without signed contract evidence
  - send supplier/customer email without human-visible draft
  - provision regulated-data tenant without compliance approval
success_metrics:
  utility:
    - onboarding_cycle_time_days
    - operator_correction_rate
    - customer_blocker_count
  trust:
    - evidence_coverage
    - approval_gate_honored_rate
    - audit_gap_rate
  economics:
    - tool_calls_per_onboarding
    - human_minutes_per_onboarding
launch_gate:
  shadow_runs: 50
  policy_floor: 1.0
  safety_floor: 1.0
  replay_drift: 0_unexpected_destructive_actions

This is still product work. It just uses the right shape for agentic systems.

Step 4: Define intents before agents

A multi-agent system should not start with agent names. It should start with canonical intents.

Example:

IntentDescriptionRiskFirst runtime shape
onboarding.intakeGather contract, stakeholders, environment, constraintsread_onlyFixed workflow
onboarding.kyc_checkValidate KYC, compliance, and sanctions evidencenetworkWorkflow + critic
onboarding.workspace_provisionCreate or configure tenant workspacedelegatedPlan/execute/critic
onboarding.billing_setupConfigure billing account and entitlementsdestructiveHuman-gated workflow
onboarding.customer_updateDraft status update for customerlocal_writeDraft + review

This is the Intent-Task Catalog. It is the PM’s flight schedule. Without it, every conversation becomes “what should the agent do now?” With it, the runtime can route, score, compare, and improve work by intent.

Step 5: Decide whether you need multiple agents

Multi-agent architecture is useful when it creates real separation:

  • different context,
  • different tools,
  • different risk,
  • different owner,
  • different eval rubric,
  • parallel work that reduces time,
  • independent expertise that improves quality.

It is harmful when it is only an org chart made of prompts.

Use this decision rule:

Split into a specialist agent when…Keep it in one workflow when…
The subtask needs a different evidence setThe same context serves every step
The subtask can run in parallelSteps are strictly sequential
The subtask has a different approval modeRisk is uniform
The subtask needs a different evaluatorOne scorecard covers the whole path
The subtask output is a typed artifactIt returns vague prose
The parent can reject or accept the resultThe worker can mutate final state directly

In ContextOS terms, specialist agents should be subagent lanes under Orchestration, not independent actors with unrestricted authority. The parent orchestrator owns the final DecisionRecord.

A practical multi-agent pattern

For enterprise onboarding:

Parent Orchestrator: onboarding.enterprise
  ├── Intake Agent: normalize customer facts and missing fields
  ├── Contract Agent: extract obligations and signed terms
  ├── Compliance Agent: KYC, risk flags, policy obligations
  ├── Provisioning Agent: workspace setup plan, no direct execution
  ├── Billing Agent: entitlement and invoice setup proposal
  └── Customer Comms Agent: draft updates, never send directly
 
Critic:
  verifies evidence, approvals, tool permissions, and final receipt

The parent is the control tower. Specialists are crews. A crew can inspect, prepare, and recommend. Only the tower clears the runway.

Step 6: Specify context like a briefing packet

An agent with too little context guesses. An agent with too much context loses focus.

The PM should define the briefing packet for each intent:

Context bucketPM decisionExample
MissionWhat work is this run doing?onboarding.workspace_provision
User and authorityWho is the operator and what can they delegate?implementation manager, tenant admin claim
EvidenceWhat sources are allowed to support decisions?signed contract, CRM, KYC result, SKU catalog
PolicyWhich rules are active?regulated data, export control, billing approval
ToolsWhich capabilities are visible?read contract, create workspace draft, request approval
MemoryWhat promoted facts can be recalled?customer region, prior onboarding blockers
BudgetWhat are the limits?max tool calls, max cost, max wall time

That is the Context Pack. Think of it as the flight briefing. The Planner should not discover the runway rules halfway through takeoff.

Step 7: Make tools product surfaces

PMs often treat tools as engineering details. That is a mistake.

Tool design shapes product behavior. A vague tool produces vague action. A precise tool creates a safer product.

Every high-value tool should have:

Tool fieldPM-level meaning
NameWhat job the tool exists for
DescriptionWhen to use it and when not to use it
ArgumentsWhich business facts must be known first
ResultWhat evidence the tool returns
Side effectWhether it reads, drafts, writes, delegates, or destroys
OwnerWhich team owns correctness and uptime
Approval modeWhat clearance is required
Error statesWhat the agent should do when it fails

The Tool Gateway is the boundary that turns “the agent wants to call an API” into a governed action. For PMs, it is the difference between a product feature and an operational liability.

Step 8: Define the receipt before the answer

For complex agentic work, the final answer is not the artifact of record. The receipt is.

A DecisionRecord should answer:

  • What intent was handled?
  • Which context pack and policy versions were active?
  • What evidence was used?
  • Which tools were called?
  • Which approvals were required and obtained?
  • What did the Critic accept or reject?
  • What changed in the world?
  • Which trace can replay the run?

PMs should write acceptance criteria against the DecisionRecord, not only the UI.

Bad acceptance criterion:

The agent tells the user onboarding is complete.

Good acceptance criterion:

The agent emits a DecisionRecord showing signed-contract evidence, KYC pass, workspace provisioning result, billing entitlement result, customer email draft approval, no unresolved policy obligations, and replay handle.

If the receipt is correct, the UI can be improved later. If the receipt is missing, the product is not production-grade.

Step 9: Build the scorecard before the prototype

Complex agents fail in more than one way. A single “quality” metric hides the important failures.

Use a five-axis scorecard:

ScorePM questionExample metric
PolicyDid it obey product and regulatory rules?approval gate honored rate
SafetyDid it avoid harmful, unsupported, or private output?unsupported claim rate
UtilityDid it complete the work?operator correction rate
LatencyDid it finish within the work rhythm?p95 onboarding run time
EconomicsDid it create enough value for the cost?cost per verified onboarding

Then create three datasets:

DatasetWho can use itPurpose
devproduct + engineeringfast iteration and examples
searchproposer / autotunecandidate generation
release_testrelease gate onlyhonest regression check

The release set is sacred. If the team tunes against it every day, it stops being a release gate.

Step 10: Use traces to debug product behavior

When an agent fails, PMs need to ask a better question than “what did it answer?”

Ask:

  • Did we classify the right intent?
  • Did the context pack include the right evidence?
  • Did the Planner choose a sensible path?
  • Did the Critic reject a bad step early enough?
  • Did the Tool Gateway deny or require approval correctly?
  • Did the final DecisionRecord explain the outcome?

This is why Evaluation and Observability is a product capability. Trace review lets PMs identify whether the issue is user need, context, tool design, policy, orchestration, or model behavior.

Step 11: Roll out like operations, not SaaS copy

Do not launch a complex multi-agent system by flipping it on.

Use staged rollout:

StageProduct posturePM reads
0%_shadowAgent observes and produces receipts, no user impactscorecard vs human baseline
1%_internalInternal users onlyoperator corrections, missing evidence
5%_low_riskLow-risk intents and tenantspolicy and safety floors
25%_monitoredBroader traffic with tail samplingtenant cliffs, long-tail failures
100%Full release with pinned rollbackweekly drift review

Every stage needs a kill switch: re-pin the prior harness tuple and stop the new path without redeploying the product.

Step 12: Make feedback a first-class product loop

The product should improve from real work.

Every operator correction should become one of:

Feedback signalContextOS primitiveExample
“This was wrong”FeedbackStorecorrection tied to DecisionRecord
“This keeps happening”InsightSynthesizerrecurring missing-evidence pattern
“Do this next time”StrategyCompilerplanner rule proposal
“We need new knowledge”ResearchQueueknowledge patch request
“This can be tuned”Autotunecontext or prompt candidate
“This needs attention”ChiefOfStaffopen-loop note

The critical PM rule: feedback is not a Slack thread. Feedback is a product event with provenance, owner, and release path.

The PM artifact stack

For a serious agentic product, the PM should own this artifact set:

ArtifactWhy it exists
Workflow mapShows where real work, systems, people, and decisions live
Intent catalogGives every workflow a canonical name and risk class
Authority matrixDefines automation, augmentation, delegation, approval
Context policyNames what the agent may know and how it is compiled
Tool inventoryExposes product-owned side effects and evidence sources
DecisionRecord specDefines the receipt for work done
ScorecardMakes success multi-dimensional
Eval datasetsKeeps iteration honest
Rollout planPrevents big-bang failure
Feedback loopTurns production corrections into harness improvements

This is the agentic PRD.

The hard parts PMs must not outsource

PMs do not need to implement the Planner, Gateway, or Critic. But they must own the product decisions those systems enforce.

1. The cost of being wrong

False positives and false negatives are product decisions.

For an onboarding agent:

  • false positive: provision a workspace before compliance is clear,
  • false negative: block a good customer unnecessarily,
  • ambiguous case: route to human review with a useful evidence bundle.

The PM must decide which error is worse by intent and risk class.

2. The human role

Humans are not only fallback.

They may be:

  • approvers,
  • teachers,
  • exception handlers,
  • auditors,
  • policy owners,
  • customer-facing reviewers,
  • operators who correct the harness.

Design the human role explicitly. “Human-in-the-loop” is not enough.

3. The shape of trust

Trust is not “the model seems smart.”

Trust is:

  • the right context,
  • deterministic policy,
  • clear tool boundaries,
  • high-signal traces,
  • evidence-backed decisions,
  • reversible workflows where possible,
  • staged rollout,
  • a visible correction path.

That is why ContextOS puts governance, evaluation, replay, and improvement in the Trust plane.

A worked example: enterprise onboarding

Suppose the product goal is:

Reduce enterprise onboarding time from 14 days to 5 days without increasing compliance or billing errors.

Product framing

QuestionAnswer
UserImplementation manager
CustomerEnterprise admin
Business outcomeFaster time to active workspace
High-risk actionsWorkspace provisioning, billing setup, regulated-data enablement
Human approvalsLegal, finance, compliance
Primary metriconboarding cycle time
Guardrailsno compliance bypass, no unapproved billing setup

ContextOS blueprint

PlaneDesign
IntelligenceCustomer graph, contract facts, prior onboarding memory
ContextPack per intent: intake, KYC, provisioning, billing, comms
DecisionParent orchestrator with specialist lanes
ActionTool Gateway to CRM, contract repository, provisioning API, billing
TrustPolicy gates, scorecards, DecisionRecords, replay, feedback

Multi-agent layout

Agent laneOwnsCannot do
Intakenormalize customer facts, missing fieldsapprove anything
Contractextract obligations and signed termschange contract terms
Complianceevaluate KYC and regulated-data constraintsoverride policy
Provisioningdraft workspace setup planexecute without delegated authority
Billingdraft entitlements and invoice setupcommit destructive billing changes
Commsdraft customer updatessend without approval
Criticverify plan, evidence, tools, approvalscall external tools directly

The parent orchestrator owns final synthesis. The Critic owns acceptance. The Gateway owns side effects. The DecisionRecord owns audit.

The 30-60-90 day plan

Days 0-30: prove one vertical slice

Ship one intent, one path, one receipt.

  • Map the current workflow.
  • Define one canonical intent.
  • Create the task template.
  • Compile a Context Pack.
  • Wire read-only tools.
  • Emit DecisionRecords.
  • Run shadow traces against real work.

Exit criteria: operators agree the receipts explain the work, even if the agent is not yet faster.

Days 31-60: add authority and gates

Move from advice to bounded action.

  • Add delegated or destructive tools only behind Gateway policy.
  • Add approval gates with frozen evidence snapshots.
  • Add Critic verdicts.
  • Build release evals.
  • Start low-risk internal rollout.

Exit criteria: the agent can safely complete bounded work with approvals and replay.

Days 61-90: expand to multi-agent lanes

Split only where specialization improves scorecard outcomes.

  • Add specialist lanes for separable work.
  • Add lane-specific context packs.
  • Add lane-specific evals.
  • Route production corrections into the Improvement Loop.
  • Start staged tenant rollout.

Exit criteria: the multi-agent system improves utility or latency without policy, safety, or audit regression.

PM launch checklist

Before launch, ask:

  • Can we name the intent and risk class?
  • Can we show the workflow map?
  • Can we show the Context Pack and evidence policy?
  • Can we list every tool and approval mode?
  • Can we inspect a DecisionRecord for a successful run?
  • Can we inspect a DecisionRecord for a denied or escalated run?
  • Can we replay a failed run?
  • Can we explain the scorecard?
  • Can we prove the release set was not used for tuning?
  • Can we roll back the harness tuple?
  • Can operators correct the system in a structured way?
  • Can a non-engineering reviewer understand why a high-risk action was allowed?

If the answer is no, the product is still a prototype.

What PMs should write in the roadmap

Do not write:

Q2: Build onboarding agent.

Write:

Q2:
  outcome: reduce enterprise onboarding cycle time from 14d to 5d
  slice_1:
    intent: onboarding.intake
    mode: assist
    success: 90% evidence completeness on shadow runs
  slice_2:
    intent: onboarding.workspace_provision
    mode: delegated_with_gate
    success: 0 policy violations, <10% operator correction rate
  slice_3:
    intent: onboarding.customer_update
    mode: draft
    success: 80% drafts accepted with minor edits
  trust:
    required: DecisionRecord, trace grading, release set, rollback tuple
  improvement:
    required: FeedbackStore, StrategyRule proposals, weekly review

This is legible to executives, operators, engineers, and auditors.

The simplest summary

For PMs, ContextOS is a way to keep agentic ambition tied to product reality.

  • The Intent Catalog keeps the product scope named.
  • The RunContext keeps authority and budget explicit.
  • The Context Pack keeps the model’s working memory honest.
  • The Planner / Executor / Critic keeps work bounded and debuggable.
  • The Tool Gateway keeps side effects governed.
  • The DecisionRecord keeps every run accountable.
  • The Scorecard keeps quality multi-dimensional.
  • The Rollout gate keeps launches reversible.
  • The Improvement Loop keeps learning from becoming folklore.

Product managers who learn these constructs do not become backend engineers. They become better owners of real AI products.

They stop asking “what can the agent do?”

They ask:

What work system are we building, what evidence governs it, what authority does it have, how will we know it worked, and how will it improve safely after launch?

That is the question complex agentic products need.

ContextOS

Blog series

External references

Found this useful? Share it.

Share:XHN
Analytics consent

We use Google Analytics to understand site usage. You can opt in or decline.