ContextOS Metrics Glossary
Unified metric glossary across the five planes — Intelligence, Context, Decision, Action, Trust.
This page is the stable metrics contract for ContextOS. It defines the names, dimensions, source artifacts, owners, and minimum scorecard that every implementation should keep consistent across the five planes.
It is not an exhaustive dashboard inventory. Teams can add local metrics, but release gates, incidents, and executive scorecards should roll up through the contract below.
Naming Conventions
Metric names use one namespace and one shape:
contextos.<plane>.<component>.<signal>Examples:
| Metric | Meaning |
|---|---|
contextos.intelligence.gateway.request_duration_ms | AI Gateway request latency. |
contextos.context.pack.evidence_coverage_rate | Share of required evidence represented in the compiled Context Pack. |
contextos.decision.plan.validity_rate | Share of generated plans passing structural validation. |
contextos.action.tool.success_rate | Share of tool calls returning a completed ToolResultEnvelope. |
contextos.trust.trace.completeness_rate | Share of runs with the required trace and audit artifacts. |
Rules:
- Use lowercase snake case for components and signals.
- Use
_duration_msfor latency and wall-clock duration. - Use
_ratefor fractions,_countfor event counts,_totalfor monotonic counters, and_cost_usdor_cost_inrfor cost. - Record latency as distributions with at least p50, p95, and p99 rollups.
- Keep model names, provider names, tool IDs, and policy IDs in dimensions, not in metric names.
- Do not encode raw user text, prompts, document titles, or error messages in metric names or labels.
Required Dimensions
Every metric must be joinable to the run trace. High-cardinality IDs such as run_id and trace_id belong on spans, events, exemplars, and rollup records; only attach them as time-series labels when the backend is designed for that cardinality.
| Dimension | Required on | Purpose |
|---|---|---|
tenant_id | span, event, rollup | Tenant-level slicing and data isolation. |
environment | span, event, metric | dev, staging, prod, or equivalent. |
release_version | span, event, metric | Runtime or service release attribution. |
plane | span, event, metric | One of intelligence, context, decision, action, trust. |
component | span, event, metric | Gateway, compiler, planner, tool manager, evaluator, or observability component. |
component_version | span, event, rollup | Localizes regressions to a deployed component version. |
workflow_id | span, event, rollup | Workflow or journey-level rollup. |
intent_id | span, event, rollup | Intent-level quality, safety, and cost comparison. |
task_type | span, event, rollup | Stable task taxonomy used by scorecards. |
risk_class | span, event, rollup | Approval and safety tier. |
run_id | span, event, exemplar | Joins artifacts for a single run. |
trace_id | span, event, exemplar | Joins telemetry with the trace bundle. |
Use the following dimensions when they apply:
| Dimension | Applies to |
|---|---|
model_profile_id, provider, routing_policy_id | AI Gateway and LLM Router calls. |
pack_version, context_pack_id, evidence_source_type | Context Pack compilation and evidence metrics. |
memory_tier, knowledge_snapshot_id | Memory and knowledge-substrate metrics. |
tool_id, adapter_id, approval_mode_declared, approval_mode_effective | Tool Manager and adapter calls. |
policy_profile, policy_decision_id | Policy and approval metrics. |
evaluator_id, golden_set_id, replay_dataset_id | Evaluation and Observability metrics. |
channel, locale, user_cohort | User-facing adoption and UX metrics. |
Minimum Scorecard
Every production scorecard should expose these metrics by tenant_id, intent_id, risk_class, release_version, and pack_version when available.
| Metric | Definition | Owner | Direction |
|---|---|---|---|
contextos.decision.task.verified_success_rate | Tasks that pass the configured verifier divided by tasks started. | Decision plane | Up |
contextos.decision.task.safe_completion_rate | Tasks completed without policy or approval violations divided by tasks started. | Trust plane | Up |
contextos.run.end_to_end_duration_ms | Time from accepted user or system request to terminal run state. | Platform runtime | Down |
contextos.run.first_useful_response_duration_ms | Time from request acceptance to first useful response or action proposal. | Product + runtime | Down |
contextos.budget.cost_per_verified_success | Total run cost divided by verified successes in the rollup window. | Platform + product | Down |
contextos.context.answer.evidence_backed_rate | Responses requiring evidence that include valid evidence_refs. | Context plane | Up |
contextos.action.tool.success_rate | Successful tool results divided by attempted tool calls. | Action plane | Up |
contextos.trust.policy.violation_rate | Runs with policy violations divided by completed runs. | Trust plane | Down |
contextos.trust.trace.completeness_rate | Runs with required spans, scorecard, evidence, and audit links. | Observability | Up |
contextos.trust.replay.determinism_rate | Replay runs that reproduce the pinned expected verdict or record. | Evaluation | Up |
Thresholds are environment- and intent-specific. The contract defines formulas and owners; each deployment defines alert thresholds, sampling policy, and release gates.
Per-Plane Metrics
Intelligence Plane
The Intelligence plane owns model invocation, model routing, provider behavior, and model-side budgets through the AI Gateway and LLM Router.
| Area | Contract metrics |
|---|---|
| Availability | contextos.intelligence.gateway.availability_rate, contextos.intelligence.provider.error_rate, contextos.intelligence.provider.timeout_rate |
| Latency | contextos.intelligence.gateway.request_duration_ms, contextos.intelligence.router.decision_duration_ms, contextos.intelligence.provider.request_duration_ms |
| Routing | contextos.intelligence.router.fallback_rate, contextos.intelligence.router.model_switch_rate, contextos.intelligence.router.policy_rejection_count |
| Quality controls | contextos.intelligence.output.invalid_schema_rate, contextos.intelligence.output.refusal_rate, contextos.intelligence.output.repair_rate |
| Budget | contextos.intelligence.tokens.input_total, contextos.intelligence.tokens.output_total, contextos.intelligence.gateway.cost_usd, contextos.intelligence.gateway.cache_hit_rate |
Context Plane
The Context plane owns Context Packs, retrieval, memory, evidence, conflict handling, and knowledge snapshots. See Context Pack, Memory Model, and Knowledge Graph.
| Area | Contract metrics |
|---|---|
| Pack build | contextos.context.pack.build_duration_ms, contextos.context.pack.token_count, contextos.context.pack.context_window_utilization_rate |
| Evidence | contextos.context.pack.evidence_coverage_rate, contextos.context.answer.evidence_backed_rate, contextos.context.claim.attribution_rate |
| Retrieval | contextos.context.retrieval.precision_at_k, contextos.context.retrieval.recall_at_k, contextos.context.retrieval.stale_source_rate |
| Memory | contextos.context.memory.promotion_accept_rate, contextos.context.memory.correction_rate, contextos.context.memory.stale_read_rate |
| Context hazards | contextos.context.noise.irrelevant_token_rate, contextos.context.conflict.detected_rate, contextos.context.conflict.resolved_rate, contextos.context.poisoning.suspected_rate |
Decision Plane
The Decision plane owns planning, execution choice, critique, loop controls, and the Decision Record. See also the Decision Catalog.
| Area | Contract metrics |
|---|---|
| Planning | contextos.decision.plan.validity_rate, contextos.decision.plan.feasibility_rate, contextos.decision.plan.revision_count |
| Execution control | contextos.decision.executor.step_success_rate, contextos.decision.executor.loop_guard_trigger_rate, contextos.decision.executor.escalation_rate |
| Critique | contextos.decision.critic.veto_rate, contextos.decision.critic.repair_success_rate, contextos.decision.critic.false_pass_rate |
| Outcomes | contextos.decision.task.success_rate, contextos.decision.task.verified_success_rate, contextos.decision.task.abandonment_rate |
| Records | contextos.decision.record.completeness_rate, contextos.decision.record.evidence_ref_count, contextos.decision.record.policy_ref_count |
Action Plane
The Action plane owns side effects through the Tool Manager and the Adapter Mesh.
| Area | Contract metrics |
|---|---|
| Tool calls | contextos.action.tool.success_rate, contextos.action.tool.error_rate, contextos.action.tool.request_duration_ms, contextos.action.tool.retry_rate |
| Approval binding | contextos.action.approval.required_rate, contextos.action.approval.honored_rate, contextos.action.approval.denied_rate |
| Idempotency | contextos.action.idempotency.replay_hit_rate, contextos.action.idempotency.duplicate_effect_rate |
| Adapter health | contextos.action.adapter.availability_rate, contextos.action.adapter.schema_validation_error_rate, contextos.action.adapter.version_drift_count |
| Evidence return | contextos.action.tool.evidence_return_rate, contextos.action.tool.audit_link_rate |
Trust Plane
The Trust plane owns policy, approval gates, evaluation, observability, audit, replay, and security posture. See Evaluation and Observability, Observability, and the Policy Engine.
| Area | Contract metrics |
|---|---|
| Policy | contextos.trust.policy.violation_rate, contextos.trust.policy.must_refuse_coverage_rate, contextos.trust.policy.decision_duration_ms |
| Evaluation | contextos.trust.scorecard.coverage_rate, contextos.trust.eval.pass_rate, contextos.trust.eval.regression_rate, contextos.trust.eval.judge_agreement_rate |
| Observability | contextos.trust.trace.completeness_rate, contextos.trust.audit.completeness_rate, contextos.trust.trace.fetch_duration_ms |
| Replay | contextos.trust.replay.determinism_rate, contextos.trust.replay.dataset_coverage_rate, contextos.trust.replay.duration_ms |
| Security | contextos.trust.security.event_count, contextos.trust.security.cross_tenant_denial_count, contextos.trust.security.redaction_failure_rate |
| Adoption | contextos.trust.user_correction_rate, contextos.trust.operator_override_rate, contextos.trust.human_escalation_rate |
Emitted Artifacts
Metrics are only useful when their source artifacts are stable. Each run should emit the artifacts needed for its path; for example, a read-only answer may not emit an approval decision, but any governed action must.
| Artifact | Required contents | Primary owner |
|---|---|---|
ContextPackManifest | pack version, item IDs, evidence refs, token spans, source timestamps, retrieval query refs | Context plane |
RoutingDecision | model profile, provider adapter, routing policy, rejected candidates summary, fallback index, usage, estimated cost | Intelligence plane |
PlanRecord | plan ID, steps, feasibility checks, revisions, loop-guard state | Decision plane |
DecisionRecord | final decision, verifier result, evidence refs, policy refs, scorecard ref | Decision plane |
ToolCall / ToolResult | tool ID, adapter ID, schema refs, approval mode, idempotency key, result status, evidence refs | Action plane |
PolicyDecision | policy profile, decision ID, rule refs, gate status, approver ref when applicable | Trust plane |
ConflictLedger | conflicting sources, severity, chosen resolution, rule or reviewer reference | Context + Trust planes |
MemoryWriteProposal | proposed fact, provenance, promotion decision, reviewer or policy result | Context plane |
Scorecard | evaluator IDs, dimension scores, thresholds, release-gate verdict | Evaluation |
TraceBundle | W3C trace context, plane span chain, artifact refs, audit refs, sampling reason | Observability |
ReplayDataset | pinned input envelope, pack version, snapshot refs, tool transcripts, expected verdict | Evaluation |
Owners
Every contract metric needs a named owner before it is used in release gates or incident review.
| Owner | Responsibilities |
|---|---|
| Platform runtime | Gateway latency, routing, cost, token accounting, run-level duration, service availability. |
| Context engineering | Context Pack quality, retrieval, memory promotion, evidence attribution, conflict and poisoning signals. |
| Decision engineering | Plan validity, execution control, verifier outcomes, Decision Record completeness. |
| Action platform | Tool Manager reliability, approval-mode binding, adapter health, idempotency, tool evidence return. |
| Trust, security, and SRE | Policy decisions, audit, trace completeness, replay, security events, redaction, incident scorecards. |
| Product or domain owner | Intent taxonomy, done criteria, golden sets, acceptable thresholds, user correction interpretation. |
Owner duties:
- Maintain the formula, numerator, denominator, unit, and rollup windows.
- Declare the source artifact and required dimensions.
- Define alert thresholds and release-gate thresholds per environment.
- Review metric behavior after any schema, policy, model, tool, or Context Pack version change.
Link Map
- AI Gateway and LLM Router: model routing, provider calls, budgets, and token metrics.
- Tool Manager: tool-call success, approval binding, adapter health, and idempotency metrics.
- Evaluation and Observability: scorecards, replay, release gates, evaluator dimensions.
- Observability: trace bundles, audit records, sampling, and replay substrate.
- Policy Engine: policy decisions and approval gates.