Span Metrics Without the Cardinality Explosion

Turn the spans you already emit into RED metrics with the spanmetrics connector, and avoid the one mistake that quietly multiplies your time series into a five-figure metrics bill.

Anas Milhem
Anas Milhem
June 13, 2026 · 14 min read

You finally instrumented your application whether you used auto-instrumentation, hand-written spans, or some mix of the two and now you want RED metrics (Rate, Errors, Duration) for every service. Metrics is what most dashboards and alerts are built on. Its the first thing that tells you something is wrong.

You might ask yourself. Do I even need to generate these myself? Doesn't my backend do this automatically? The answer is most likely yes it does. But this where teams miss a very important detail. How are my metrics accurate if you are doing any sort of sampling?

Doesn't my backend already do this?

Frequently, yes. Several backends derive metrics from traces during ingestion, with no extra work from you.

In the Grafana stack, Tempo's metrics-generator computes span (RED) and service-graph metrics from ingested traces and writes them to a Prometheus-compatible store like Mimir, as traces_spanmetrics_* and traces_service_graph_* series, switched on per tenant. It's the backend-side equivalent of the spanmetrics connector. The commercial APMs do it too: Datadog, Dynatrace, New Relic, and Honeycomb all compute request/error/latency and service-level aggregates from the spans they receive.

So why run the connector yourself? Two reasons.

Control. Backend generation is convenient and easy, but it runs on the backend's terms: cardinality limits, custom dimensions, and exemplar handling follow its config, not yours. The connector gives you that control

Sampling If you sample traces before they reach the backend (tail sampling at a gateway, head sampling at the edge), backend-generated metrics only count the spans that survived sampling. Your "request rate" quietly becomes "request rate among the 5% of traces we kept," which is wrong in a way nobody notices until an incident. A spanmetrics connector placed ahead of the sampler sees every span and computes metrics on the full stream, before sampling drops anything.

That second point is the whole case for generating metrics yourself. You will get accurate representation of the health of your services even if you are not sending all your traces.

How a connector is different

yourservicesspansOPENTELEMETRY COLLECTORTRACES PIPELINEotlpprocessorsbatchtail_samplingsampledtracingbackendas exporterevery spanspanmetricsconnectoras receiverMETRICS PIPELINEbatchmetricsmetricsbackend

A receiver brings data in. An exporter sends data out. A connector is both at once: it's the exporter at the end of one pipeline and the receiver at the start of another. The spanmetrics connector consumes spans and produces metrics, so it lives at the end of a traces pipeline and the start of a metrics pipeline.

What it produces

With nothing but the connector added, you get two metrics, namespaced under traces.span.metrics by default:

  • traces.span.metrics.calls is a monotonic counter of how many spans matched each dimension set. That's your Rate, and once you filter by status, your Errors.
  • traces.span.metrics.duration is a histogram of span durations. That's your Duration, the source of every p50/p95/p99 latency panel.

Both come pre-stamped with four default dimensions, lifted straight off each span:

  • service.name
  • span.name (the operation)
  • span.kind (server, client, internal, producer, consumer)
  • status.code (unset, ok, error)

Hold onto those four. Each one is a label on every series the connector emits. What happens when you add a fifth — or a fifth with the wrong values — is the rest of this post.

A working config

Here's the minimum that does something useful. The connector takes an empty config block, since the defaults are sensible. The real work is in the pipeline wiring:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [2ms, 4ms, 6ms, 8ms, 10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms, 2s, 5s, 10s, 15s]
    metrics_flush_interval: 15s

exporters:
  otlp/traces:
    endpoint: tracing-backend:4317
  otlphttp/metrics:
    endpoint: https://metrics-backend/v1/metrics

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp/traces, spanmetrics]   # fan out: real backend AND the connector
    metrics:
      receivers: [spanmetrics]                 # the connector feeds this pipeline
      exporters: [otlphttp/metrics]

Two things to notice. First, the traces pipeline lists two exporters, your tracing backend and the connector, so the connector doesn't consume the spans, it just watches them on the way out. Second, spanmetrics shows up as a receiver in the metrics pipeline. That cross-wiring is what makes it a connector rather than a processor.

The histogram buckets above are the connector's defaults, shown explicitly because they're worth tuning to your latency profile. The default unit is milliseconds; switch to seconds with histogram.unit: s if your backend expects it.

Want metric-to-trace jumps? Set exemplars.enabled: true. Each metric data point then carries a few example trace IDs (exemplars.max_per_data_point defaults to 5), so a spike on a latency panel links straight to an exemplar trace. It's off by default because exemplars add storage; turn it on when your backend supports the exemplar-to-trace handoff.

Adding dimensions

The four defaults answer "which operation, on which service, succeeded or failed, how fast." Real questions need more: which HTTP method, which status code, which route, which tenant. You add dimensions by naming the span or resource attributes you want promoted to labels:

connectors:
  spanmetrics:
    dimensions:
      - name: http.request.method
        default: GET
      - name: http.response.status_code
      - name: url.scheme
        default: https

Each entry pulls an attribute off the span. default supplies a value when the attribute is missing, which keeps a dimension from silently splitting into a "present" and "absent" pair of series. You can target individual metrics too: histogram.dimensions adds labels only to the duration histogram, calls_dimensions only to the calls counter. And you can drop a default you don't want with exclude_dimensions: [status.code].

This is the knob that makes the connector powerful. It's also the knob that blows up your bill.

The enemy of metrics: cardinality

Now you have metrics. You also have a new way to hurt yourself. The enemy of metrics is cardinality: the number of distinct time series. Most backends bill per series, so cardinality is the dial wired straight to your invoice, and the connector makes it very easy to turn.

Here's the scenario that gets teams. Someone ships a new checkout service whose instrumentation names spans after the request path, cart ID and all: POST /cart/8f2c9a/checkout, POST /cart/1b7e0d/checkout, one name per cart. Overnight the connector starts emitting a fresh calls and duration series for every cart anyone has ever checked out with. By morning that single service has added millions of series and the metrics bill for the whole org has doubled — from a deploy that didn't touch a single dashboard.

Why does it blow up so fast? Every unique combination of dimension values is a separate time series. Your metrics backend stores, indexes, and bills for each one. So the series count for the calls metric is, roughly, a product:

services × operations × span.kinds × status.codes × (every dimension you added)

Multiplication, not addition. A tidy setup (40 services, 30 operations each, 3 span kinds, 3 status codes) is already 40 × 30 × 3 × 3 ≈ 10,800 series before you add anything. Add http.response.status_code at, say, 12 distinct values and you're at ~130,000. Still fine.

The explosion doesn't come from adding more dimensions. It comes from adding one dimension with unbounded values. And the most common culprit isn't a dimension you added at all. It's span.name itself.

The OpenTelemetry semantic conventions require span names to be low cardinality: GET /product/{id}, not GET /product/1YMWWN1N4O. But plenty of instrumentation gets this wrong and bakes the raw path into the name. When that happens, every product ID, every session token, every ?_ga=GA1.2.569539246.1760114706 query string becomes its own value of span.name, and therefore its own row in every metric the connector produces. The README puts it plainly:

High cardinality issues in span metrics commonly manifest in APM dashboards as an excessive number of service operations with non-unique names. Examples include URIs with unique identifiers (e.g., GET /product/1YMWWN1N4O) or HTTP parameters with random values.

Your operation list fills up with thousands of one-hit-wonder entries, your dashboards turn to mush, and your metrics bill starts tracking the number of unique URLs your users happen to hit. Here's how to stop it, roughly in the order you should reach for each.

Fix it at the source: normalize the span name

The right fix is to make the span name comply with semantic conventions before the connector ever sees it. The Transform processor ships an OTTL function built for exactly this, set_semconv_span_name, which rewrites each span's name to the low-cardinality form the spec prescribes (GET /product/{id}) using the span's semconv attributes:

processors:
  transform/sanitize_spans:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - set_semconv_span_name("1.37.0")   # arg = semconv version to target

connectors:
  spanmetrics: {}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform/sanitize_spans]   # sanitize BEFORE the connector taps the stream
      exporters: [otlp/traces, spanmetrics]
    metrics:
      receivers: [spanmetrics]
      exporters: [otlphttp/metrics]

The placement is the whole point. The transform runs in the traces pipeline, ahead of the connector in the exporter fan-out, so the connector counts the sanitized name. error_mode: ignore keeps a span that's missing the expected attributes from failing the batch; it just keeps its original name and moves on.

This is also the only lever that fixes what the data means, not just how much of it there is. A normalized operation list is one you can actually read.

The circuit breaker: cap the cardinality

Sometimes you can't fix the instrumentation today: third-party libraries, a service you don't own. For that, the connector gives you a hard ceiling. aggregation_cardinality_limit caps the number of unique dimension combinations it will track:

connectors:
  spanmetrics:
    aggregation_cardinality_limit: 10000

Once the limit is hit, new combinations don't get their own series. They're all folded into a single overflow entry tagged otel.metric.overflow="true". Your series count stops growing and your bill stops climbing. (This needs Collector 0.130.0 or later.)

The limit is a fuse, not a fix. Overflowed data lands in one undifferentiated bucket, so you protect cost and Collector memory but lose the detail for everything past the cap. And you've still got meaningless operation names underneath. Treat aggregation_cardinality_limit as a safety net while you ship the real fix above, not as a substitute for it.

If you're on an older Collector you may still see dimensions_cache_size in configs and docs — it's deprecated in favor of aggregation_cardinality_limit. Use the limit.

Prune what you collect

Two more levers shrink the multiplication directly:

  • exclude_dimensions drops default dimensions you don't need. If you never slice by span.kind, removing it divides your series count by however many kinds you emit.
  • resource_metrics_key_attributes pins which resource attributes define a metric's identity. Without it, a deploy that changes a volatile resource attribute (a pod name, a build hash) can fork every series into a new one. Restrict identity to the stable attributes you actually group by:
connectors:
  spanmetrics:
    exclude_dimensions: [span.kind]
    resource_metrics_key_attributes:
      - service.name
      - service.namespace
      - telemetry.sdk.language

Shed dead series over time

Even with clean dimensions, series pile up. An operation that ran during last week's deploy is still a cumulative series the connector keeps reporting forever. Expiration lets it forget series it hasn't seen recently:

connectors:
  spanmetrics:
    metrics_expiration: 5m     # drop a metric with no new spans for 5m
    series_expiration: 5m      # drop individual stale series

Both default to 0 (never expire). Setting them trades a little long-tail history for a metrics store that reflects what's actually serving traffic right now.

Cumulative vs delta. The connector emits AGGREGATION_TEMPORALITY_CUMULATIVE by default, which is what Prometheus-style backends expect. If your backend wants delta temporality, set aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA. Delta also pairs better with expiration, since stale series just stop reporting instead of lingering at their last cumulative value.

Build it in Telflo

You can wire the whole thing visually — the traces pipeline, the sanitizing transform, the connector, the metrics pipeline — and watch the fan-out resolve as you connect nodes.

Get the span metrics template in Telflo →

Test the fix before you ship it

Eyeballing the operation list works once; a test works every deploy. Telflo's testing panel runs your pipeline against a fixture and checks validation rules against the output — and the one that matters here is the cardinality rule: it counts the distinct values a JSONPath resolves to and compares that to a threshold.

The rule that pins this fix:

cardinality of  $..spans[*].name  ≤ 1

Distinct span names are the number of calls / duration series the connector emits for that route, so the rule reads: "this route collapses to a single series." Run the same five-request fixture through the pipeline twice — once without the transform, once with it:

Cardinality test result: without set_semconv_span_name the span-name cardinality rule fails with 5 distinct values found; with it, the rule passes with 1 distinct value. The record_count rule stays 5 in both runs, so volume is unchanged.
Cardinality test result: without set_semconv_span_name the span-name cardinality rule fails with 5 distinct values found; with it, the rule passes with 1 distinct value. The record_count rule stays 5 in both runs, so volume is unchanged.

  • Without the transform: five distinct span names, so the rule fails — expected at most 1 distinct value for $..spans[*].name, found 5.
  • With set_semconv_span_name: every name normalizes to GET /product/{id}, one distinct value, the rule passes.

A companion record_count = 5 rule stays green in both runs. That's the whole story in one line: the span count never moved — only the cardinality did. Wire the rule into the config once and any future change that reintroduces an unbounded name fails the test instead of your bill.

Verifying

After the connector is live, look for these in your metrics backend:

  • traces.span.metrics.calls and traces.span.metrics.duration, each carrying service.name, span.name, span.kind, and status.code.
  • An operation list (span.name values) that reads like routes, GET /product/{id}, not like a log of individual requests. If you still see raw IDs, your transform isn't running ahead of the connector.
  • A flat or slowly-growing series count. If it climbs without bound, find the unbounded dimension: it's almost always span.name, occasionally a URL or user ID you promoted by hand.
  • No otel.metric.overflow="true" series. If you do see them, that's the alarm that you've hit your cap and the real fix is still pending.

Get the dimensions right and the spanmetrics connector is close to free: RED metrics for every service, derived from spans you were already paying to collect. Get them wrong and it's the most expensive line on your observability bill. The difference is one transform processor, placed one step before the connector.

Share this post
In this post

Related Reads