Multi-agent systems are easy to draw and hard to operate.
The common failure is an org chart made of prompts:
Research Agent -> Planning Agent -> Execution Agent -> QA AgentIt looks sophisticated. It often creates more latency, more unclear ownership, more context loss, and more places where nobody knows why the final answer happened.
Product managers need a stricter rule:
Add another agent only when the product needs a separate context, authority, tool surface, or scorecard.
Use the airport analogy again. A control tower does not create a “landing agent,” “runway agent,” and “weather agent” because the diagram looks cleaner. It separates responsibilities because the work has different data, timing, authority, and failure modes.
That is the ContextOS view of multi-agent products: a parent orchestrator, specialist lanes, a Critic, a Tool Gateway, and one final receipt.
Workflow, agent, or multi-agent?
Before designing a multi-agent product, decide the runtime shape.
| Shape | Product fit | PM warning |
|---|---|---|
| Single call | Short, low-risk answer | Do not overbuild |
| Fixed workflow | Known steps, predictable handoffs | Better than “agent” for many products |
| Planner / Executor / Critic | Adaptive tool use and recovery needed | Requires trace and budget discipline |
| Orchestrator + lanes | Parallel or specialized work creates measurable value | Must preserve one owner for final decision |
| Long-running session | Work spans hours, days, or systems | Requires checkpoints and progress contracts |
Anthropic’s effective-agent guidance makes the same practical distinction: workflows use predefined code paths, while agents dynamically direct tool use. The PM implication is simple: do not buy autonomy you cannot score.
The control tower pattern
In ContextOS, a complex multi-agent system should look like this:
Parent Orchestrator
owns intent, RunContext, budget, final DecisionRecord
Specialist Lanes
run bounded subtasks with scoped tools and context
Critic
verifies plans, scores lane outputs, accepts or rejects synthesis
Tool Gateway
enforces schemas, policy, approval modes, and audit
DecisionRecord
records final outcome, evidence, approvals, trace, replay handleThe parent orchestrator is the control tower. Specialist agents are crews. Crews can inspect and prepare. They do not clear the runway.
When to split into specialist lanes
Use this test:
| Split condition | Example |
|---|---|
| Different evidence set | Contract review needs signed terms; billing needs SKU catalog |
| Different tool surface | Compliance can call screening tools; comms can draft emails |
| Different risk class | Intake is read-only; payment activation is destructive |
| Different evaluator | Legal accuracy and customer tone need different rubrics |
| Parallelizable work | KYC, contract extraction, and environment setup can run concurrently |
| Different owner | Legal, finance, support, and implementation have separate accountability |
If none of these are true, keep it in one workflow.
Multi-agent product anti-patterns
| Anti-pattern | Why it fails | Better pattern |
|---|---|---|
| Agent per department | Mirrors org politics, not work boundaries | Intent and evidence-based lanes |
| Worker can mutate final state | No single accountable decision | Parent accepts worker output before effects |
| Every worker sees everything | Context bloat and leakage | Scoped Context Pack per lane |
| Agent debate without evidence | More tokens, same uncertainty | Require evidence refs and Critic verdicts |
| No lane-specific evals | Cannot tell which specialist regressed | Score by lane and final outcome |
| Shared tool pool | Risk bleed across lanes | Tool Gateway per lane authority |
The PM should reject multi-agent diagrams that do not show authority, evidence, and final ownership.
Worked example: enterprise renewal desk
Goal:
Help account teams prepare, approve, and send enterprise renewal proposals.
The naive product idea:
A renewal agent that handles renewals.
The control tower version:
| Lane | Job | Context | Tools | Risk |
|---|---|---|---|---|
| Account Intake | Normalize account, renewal date, owners | CRM, account notes | read CRM | read_only |
| Usage Analyst | Analyze adoption and expansion signals | product analytics | query metrics | network |
| Contract Reviewer | Extract terms, renewal clauses, restrictions | contract repo | read contracts | read_only |
| Pricing Specialist | Draft pricing options | price book, discount policy | create quote draft | local_write |
| Risk Reviewer | Identify churn, legal, and finance risks | history, exceptions | policy eval | network |
| Comms Drafter | Draft customer-facing renewal narrative | approved facts | draft email | local_write |
| Deal Desk Gate | Approve discount or non-standard terms | full packet | approval gate | destructive |
The parent orchestrator owns the renewal packet and final DecisionRecord.
The PM spec for each lane
Each specialist lane needs a mini-spec:
lane: pricing_specialist
parent_intent: renewal.prepare_proposal
mission: draft pricing options and discount rationale
context_pack:
required:
- account_tier
- current_contract_value
- usage_trend
- approved_price_book
- discount_policy
tools:
allowed:
- pricebook.lookup
- quote.create_draft
denied:
- quote.send_to_customer
approval_mode: local_write
output:
type: pricing_recommendation
fields:
- recommended_package
- discount_percent
- rationale
- evidence_refs
evals:
- discount_policy_compliance
- margin_floor_preserved
- rationale_evidence_coverageIf a lane cannot be specified this way, it is not ready to be a separate agent.
Parent orchestration rules
The parent orchestrator should have rules like:
- It may spawn lanes only from approved task templates.
- It must pass each lane a scoped RunContext.
- It must set lane budgets.
- It must reject lane outputs without required evidence refs.
- It must not let lane outputs directly produce side effects.
- It must synthesize one final plan.
- It must produce one final DecisionRecord.
This is Orchestration, not “coordination by vibes.”
The Critic is the product safety net
The Critic is not a “QA agent” bolted on at the end.
It verifies:
| Check | Product question |
|---|---|
| Plan validity | Is this path allowed for the intent? |
| Evidence sufficiency | Do we have the facts needed to decide? |
| Tool authorization | Are these tools allowed for this RunContext? |
| Approval mode | Is the right gate required before side effects? |
| Lane quality | Did each specialist return a typed, usable result? |
| Final receipt | Does the DecisionRecord explain the work? |
For PMs, the Critic is where many acceptance criteria become executable.
Context management for multi-agent products
Do not share one giant prompt across all agents.
Use per-lane context:
| Context strategy | PM meaning |
|---|---|
| Up-front briefing | Stable mission, policy, owner, output shape |
| Just-in-time retrieval | Let lane fetch specific evidence when needed |
| Compaction | Preserve decisions and open questions, drop raw chatter |
| Structured notes | Persist progress outside the context window |
| Parent summary | Return typed output, not full lane transcript |
This follows the practical lesson from context engineering: context is finite and should be treated as an attention budget.
Product metrics for multi-agent systems
Do not only measure final task success.
Measure the system shape:
| Metric | Why it matters |
|---|---|
| Lane spawn rate | Detects unnecessary decomposition |
| Lane acceptance rate | Shows whether specialists produce useful artifacts |
| Parent rejection reasons | Reveals unclear lane contracts |
| Cross-lane contradiction rate | Shows context or policy conflicts |
| Tool denial rate by lane | Reveals authority mismatch |
| Critical path latency | Measures whether parallelism actually helps |
| Final DecisionRecord completeness | Determines audit readiness |
If multi-agent architecture does not improve utility, latency, or risk control, remove it.
Rollout path
Roll out multi-agent systems by lanes:
- Shadow the parent workflow with no lane side effects.
- Enable one read-only lane.
- Add lane-specific scorecards.
- Enable parallel lanes only after trace review shows value.
- Add delegated actions behind approval gates.
- Add destructive paths last, with rollback rehearsed.
The safest launch is not “all agents on.” It is “one lane earns trust at a time.”
PM checklist
Before approving a multi-agent design, ask:
- Why is a fixed workflow not enough?
- Which lanes have different context, tools, risk, or evals?
- Who owns each lane?
- What typed artifact does each lane return?
- Can the parent reject a lane output?
- Which lane can create side effects?
- Which approval gates apply?
- What is the final DecisionRecord?
- Which trace shows the full parent/child path?
- What metric proves multi-agent is better than single-agent?
If the diagram cannot answer these questions, it is not architecture. It is decoration.