Skip to content
Press / to search

ContextOS Metrics Glossary

Unified metric glossary across the five planes — Intelligence, Context, Decision, Action, Trust.

Reference DesignLast reviewed: Edit on GitHub
At a glance

This page is the stable metrics contract for ContextOS. It defines the names, dimensions, source artifacts, owners, and minimum scorecard that every implementation should keep consistent across the five planes.

It is not an exhaustive dashboard inventory. Teams can add local metrics, but release gates, incidents, and executive scorecards should roll up through the contract below.

Naming Conventions

Metric names use one namespace and one shape:

contextos.<plane>.<component>.<signal>

Examples:

MetricMeaning
contextos.intelligence.gateway.request_duration_msAI Gateway request latency.
contextos.context.pack.evidence_coverage_rateShare of required evidence represented in the compiled Context Pack.
contextos.decision.plan.validity_rateShare of generated plans passing structural validation.
contextos.action.tool.success_rateShare of tool calls returning a completed ToolResultEnvelope.
contextos.trust.trace.completeness_rateShare of runs with the required trace and audit artifacts.

Rules:

  • Use lowercase snake case for components and signals.
  • Use _duration_ms for latency and wall-clock duration.
  • Use _rate for fractions, _count for event counts, _total for monotonic counters, and _cost_usd or _cost_inr for cost.
  • Record latency as distributions with at least p50, p95, and p99 rollups.
  • Keep model names, provider names, tool IDs, and policy IDs in dimensions, not in metric names.
  • Do not encode raw user text, prompts, document titles, or error messages in metric names or labels.

Required Dimensions

Every metric must be joinable to the run trace. High-cardinality IDs such as run_id and trace_id belong on spans, events, exemplars, and rollup records; only attach them as time-series labels when the backend is designed for that cardinality.

DimensionRequired onPurpose
tenant_idspan, event, rollupTenant-level slicing and data isolation.
environmentspan, event, metricdev, staging, prod, or equivalent.
release_versionspan, event, metricRuntime or service release attribution.
planespan, event, metricOne of intelligence, context, decision, action, trust.
componentspan, event, metricGateway, compiler, planner, tool manager, evaluator, or observability component.
component_versionspan, event, rollupLocalizes regressions to a deployed component version.
workflow_idspan, event, rollupWorkflow or journey-level rollup.
intent_idspan, event, rollupIntent-level quality, safety, and cost comparison.
task_typespan, event, rollupStable task taxonomy used by scorecards.
risk_classspan, event, rollupApproval and safety tier.
run_idspan, event, exemplarJoins artifacts for a single run.
trace_idspan, event, exemplarJoins telemetry with the trace bundle.

Use the following dimensions when they apply:

DimensionApplies to
model_profile_id, provider, routing_policy_idAI Gateway and LLM Router calls.
pack_version, context_pack_id, evidence_source_typeContext Pack compilation and evidence metrics.
memory_tier, knowledge_snapshot_idMemory and knowledge-substrate metrics.
tool_id, adapter_id, approval_mode_declared, approval_mode_effectiveTool Manager and adapter calls.
policy_profile, policy_decision_idPolicy and approval metrics.
evaluator_id, golden_set_id, replay_dataset_idEvaluation and Observability metrics.
channel, locale, user_cohortUser-facing adoption and UX metrics.

Minimum Scorecard

Every production scorecard should expose these metrics by tenant_id, intent_id, risk_class, release_version, and pack_version when available.

MetricDefinitionOwnerDirection
contextos.decision.task.verified_success_rateTasks that pass the configured verifier divided by tasks started.Decision planeUp
contextos.decision.task.safe_completion_rateTasks completed without policy or approval violations divided by tasks started.Trust planeUp
contextos.run.end_to_end_duration_msTime from accepted user or system request to terminal run state.Platform runtimeDown
contextos.run.first_useful_response_duration_msTime from request acceptance to first useful response or action proposal.Product + runtimeDown
contextos.budget.cost_per_verified_successTotal run cost divided by verified successes in the rollup window.Platform + productDown
contextos.context.answer.evidence_backed_rateResponses requiring evidence that include valid evidence_refs.Context planeUp
contextos.action.tool.success_rateSuccessful tool results divided by attempted tool calls.Action planeUp
contextos.trust.policy.violation_rateRuns with policy violations divided by completed runs.Trust planeDown
contextos.trust.trace.completeness_rateRuns with required spans, scorecard, evidence, and audit links.ObservabilityUp
contextos.trust.replay.determinism_rateReplay runs that reproduce the pinned expected verdict or record.EvaluationUp

Thresholds are environment- and intent-specific. The contract defines formulas and owners; each deployment defines alert thresholds, sampling policy, and release gates.

Per-Plane Metrics

Intelligence Plane

The Intelligence plane owns model invocation, model routing, provider behavior, and model-side budgets through the AI Gateway and LLM Router.

AreaContract metrics
Availabilitycontextos.intelligence.gateway.availability_rate, contextos.intelligence.provider.error_rate, contextos.intelligence.provider.timeout_rate
Latencycontextos.intelligence.gateway.request_duration_ms, contextos.intelligence.router.decision_duration_ms, contextos.intelligence.provider.request_duration_ms
Routingcontextos.intelligence.router.fallback_rate, contextos.intelligence.router.model_switch_rate, contextos.intelligence.router.policy_rejection_count
Quality controlscontextos.intelligence.output.invalid_schema_rate, contextos.intelligence.output.refusal_rate, contextos.intelligence.output.repair_rate
Budgetcontextos.intelligence.tokens.input_total, contextos.intelligence.tokens.output_total, contextos.intelligence.gateway.cost_usd, contextos.intelligence.gateway.cache_hit_rate

Context Plane

The Context plane owns Context Packs, retrieval, memory, evidence, conflict handling, and knowledge snapshots. See Context Pack, Memory Model, and Knowledge Graph.

AreaContract metrics
Pack buildcontextos.context.pack.build_duration_ms, contextos.context.pack.token_count, contextos.context.pack.context_window_utilization_rate
Evidencecontextos.context.pack.evidence_coverage_rate, contextos.context.answer.evidence_backed_rate, contextos.context.claim.attribution_rate
Retrievalcontextos.context.retrieval.precision_at_k, contextos.context.retrieval.recall_at_k, contextos.context.retrieval.stale_source_rate
Memorycontextos.context.memory.promotion_accept_rate, contextos.context.memory.correction_rate, contextos.context.memory.stale_read_rate
Context hazardscontextos.context.noise.irrelevant_token_rate, contextos.context.conflict.detected_rate, contextos.context.conflict.resolved_rate, contextos.context.poisoning.suspected_rate

Decision Plane

The Decision plane owns planning, execution choice, critique, loop controls, and the Decision Record. See also the Decision Catalog.

AreaContract metrics
Planningcontextos.decision.plan.validity_rate, contextos.decision.plan.feasibility_rate, contextos.decision.plan.revision_count
Execution controlcontextos.decision.executor.step_success_rate, contextos.decision.executor.loop_guard_trigger_rate, contextos.decision.executor.escalation_rate
Critiquecontextos.decision.critic.veto_rate, contextos.decision.critic.repair_success_rate, contextos.decision.critic.false_pass_rate
Outcomescontextos.decision.task.success_rate, contextos.decision.task.verified_success_rate, contextos.decision.task.abandonment_rate
Recordscontextos.decision.record.completeness_rate, contextos.decision.record.evidence_ref_count, contextos.decision.record.policy_ref_count

Action Plane

The Action plane owns side effects through the Tool Manager and the Adapter Mesh.

AreaContract metrics
Tool callscontextos.action.tool.success_rate, contextos.action.tool.error_rate, contextos.action.tool.request_duration_ms, contextos.action.tool.retry_rate
Approval bindingcontextos.action.approval.required_rate, contextos.action.approval.honored_rate, contextos.action.approval.denied_rate
Idempotencycontextos.action.idempotency.replay_hit_rate, contextos.action.idempotency.duplicate_effect_rate
Adapter healthcontextos.action.adapter.availability_rate, contextos.action.adapter.schema_validation_error_rate, contextos.action.adapter.version_drift_count
Evidence returncontextos.action.tool.evidence_return_rate, contextos.action.tool.audit_link_rate

Trust Plane

The Trust plane owns policy, approval gates, evaluation, observability, audit, replay, and security posture. See Evaluation and Observability, Observability, and the Policy Engine.

AreaContract metrics
Policycontextos.trust.policy.violation_rate, contextos.trust.policy.must_refuse_coverage_rate, contextos.trust.policy.decision_duration_ms
Evaluationcontextos.trust.scorecard.coverage_rate, contextos.trust.eval.pass_rate, contextos.trust.eval.regression_rate, contextos.trust.eval.judge_agreement_rate
Observabilitycontextos.trust.trace.completeness_rate, contextos.trust.audit.completeness_rate, contextos.trust.trace.fetch_duration_ms
Replaycontextos.trust.replay.determinism_rate, contextos.trust.replay.dataset_coverage_rate, contextos.trust.replay.duration_ms
Securitycontextos.trust.security.event_count, contextos.trust.security.cross_tenant_denial_count, contextos.trust.security.redaction_failure_rate
Adoptioncontextos.trust.user_correction_rate, contextos.trust.operator_override_rate, contextos.trust.human_escalation_rate

Emitted Artifacts

Metrics are only useful when their source artifacts are stable. Each run should emit the artifacts needed for its path; for example, a read-only answer may not emit an approval decision, but any governed action must.

ArtifactRequired contentsPrimary owner
ContextPackManifestpack version, item IDs, evidence refs, token spans, source timestamps, retrieval query refsContext plane
RoutingDecisionmodel profile, provider adapter, routing policy, rejected candidates summary, fallback index, usage, estimated costIntelligence plane
PlanRecordplan ID, steps, feasibility checks, revisions, loop-guard stateDecision plane
DecisionRecordfinal decision, verifier result, evidence refs, policy refs, scorecard refDecision plane
ToolCall / ToolResulttool ID, adapter ID, schema refs, approval mode, idempotency key, result status, evidence refsAction plane
PolicyDecisionpolicy profile, decision ID, rule refs, gate status, approver ref when applicableTrust plane
ConflictLedgerconflicting sources, severity, chosen resolution, rule or reviewer referenceContext + Trust planes
MemoryWriteProposalproposed fact, provenance, promotion decision, reviewer or policy resultContext plane
Scorecardevaluator IDs, dimension scores, thresholds, release-gate verdictEvaluation
TraceBundleW3C trace context, plane span chain, artifact refs, audit refs, sampling reasonObservability
ReplayDatasetpinned input envelope, pack version, snapshot refs, tool transcripts, expected verdictEvaluation

Owners

Every contract metric needs a named owner before it is used in release gates or incident review.

OwnerResponsibilities
Platform runtimeGateway latency, routing, cost, token accounting, run-level duration, service availability.
Context engineeringContext Pack quality, retrieval, memory promotion, evidence attribution, conflict and poisoning signals.
Decision engineeringPlan validity, execution control, verifier outcomes, Decision Record completeness.
Action platformTool Manager reliability, approval-mode binding, adapter health, idempotency, tool evidence return.
Trust, security, and SREPolicy decisions, audit, trace completeness, replay, security events, redaction, incident scorecards.
Product or domain ownerIntent taxonomy, done criteria, golden sets, acceptable thresholds, user correction interpretation.

Owner duties:

  • Maintain the formula, numerator, denominator, unit, and rollup windows.
  • Declare the source artifact and required dimensions.
  • Define alert thresholds and release-gate thresholds per environment.
  • Review metric behavior after any schema, policy, model, tool, or Context Pack version change.