MVP Factory
ai startup development

Distributed tracing on a budget with OpenTelemetry and Grafana

KW
Krystian Wiewiór · · 5 min read

Meta description: Set up distributed tracing with OpenTelemetry tail-based sampling, Tempo, Loki, and Grafana for under $50/month at 10k RPM.

TL;DR: You don’t need Datadog’s $23-per-host pricing to get production visibility. An OpenTelemetry Collector with tail-based sampling, Grafana Tempo for traces, Loki for correlated logs, and Grafana dashboards gets you 90% of the APM experience at roughly 3% of the cost. This post covers the exact collector config, sampling policies, and trace-to-logs correlation that keeps storage under $50/month at 10,000 requests per minute.


The problem: observability costs scale faster than revenue

Most teams get this wrong about observability: they instrument everything, ship everything, and then get a bill that rivals their compute costs. At 10k RPM, a naive trace-everything approach generates roughly 14.4 million traces per day. Datadog charges $31/million spans ingested after the free tier. Do the math.

SolutionEstimated monthly cost (10k RPM)Trace retentionLog correlation
Datadog APM$800-$2,500+15 daysBuilt-in
New Relic$500-$1,200+8 days (free tier caps)Built-in
Grafana Cloud (free tier + storage)$0-$8030 daysManual setup
Self-hosted Grafana stack$30-$50 (storage only)30+ daysConfig below

The self-hosted stack wins on cost, but you pay in configuration time.

The architecture: four components, one pipeline

Services → OTel SDK (auto-instrumentation)
         → OTel Collector (tail-based sampling)
         → Tempo (trace storage) + Loki (log storage)
         → Grafana (dashboards + correlation)

Step 1: Auto-instrumentation with zero code changes

OpenTelemetry’s auto-instrumentation libraries cover most frameworks out of the box. For a typical Node.js or Kotlin/Spring backend:

# Node.js -- add to your entrypoint
node --require @opentelemetry/auto-instrumentations-node/register app.js

# Kotlin/Spring -- use the Java agent
java -javaagent:opentelemetry-javaagent.jar -jar your-service.jar

The Java agent automatically instruments Spring Web, gRPC, JDBC, Kafka, and HTTP clients. No code changes. I’ve found auto-instrumentation covers about 80% of what you need on day one.

Step 2: The collector config that controls costs

The important piece here is the OpenTelemetry Collector’s tail-based sampling processor. Unlike head-based sampling (which decides at trace start), tail-based sampling waits for the complete trace before deciding. You keep 100% of error traces and slow requests while aggressively sampling successful fast paths.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      - name: errors-always
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 2000}
      - name: high-cardinality-filter
        type: string_attribute
        string_attribute:
          key: http.target
          values: ["/health", "/ready", "/metrics"]
          enabled_regex_matching: true
          invert_match: true
      - name: baseline-sample
        type: probabilistic
        probabilistic: {sampling_percentage: 5}
    decision_cache:
      sampled_cache_size: 100000

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [otlp/tempo]

What this config actually does: keeps every error and every request over 2 seconds, drops health check and metrics endpoint noise entirely, and samples only 5% of normal traffic. At 10k RPM, that reduces stored traces from ~14.4M/day to roughly 720k/day plus all errors and slow requests. Tempo’s storage cost at that volume sits comfortably under $30/month on S3-compatible object storage.

Step 3: Trace-to-log correlation

This is the pattern that replaces expensive APM tools. Inject the trace ID into every log line, then configure Grafana to link them.

In your Loki logging config, include the traceID field as a label or structured metadata. Then in Grafana, set up a derived field on your Loki data source:

Name: TraceID
Regex: traceID=(\w+)
Internal link → Target data source: Tempo
Query: ${__value.raw}

Clicking any trace ID in your logs now jumps directly to the full distributed trace in Tempo. This one correlation pattern covers most of what teams actually use Datadog for day to day. Honestly, if you only set up one thing from this post, make it this.

Step 4: The dashboard that tells you what matters

Build a Grafana dashboard with these panels sourced from Tempo’s metrics-generator:

  • R.E.D. metrics (Rate, Error rate, Duration) from traces_spanmetrics_latency_bucket
  • Service map using Tempo’s built-in service graph
  • Top-N slow endpoints via TraceQL: {status = error} | avg(duration) > 1s

Storage budget breakdown

ComponentStorage backendMonthly cost
Tempo tracesS3/MinIO (~50 GB)~$20
Loki logsS3/MinIO (~80 GB)~$25
GrafanaStateless (no storage)$0
OTel CollectorStateless$0
Total~$45/month

Where to start

Start with tail-based sampling from day one. Retrofitting sampling policies after you’ve already committed to a vendor is painful. The collector config above is ready to drop into your setup and immediately cuts trace volume by 90%+ while keeping every trace that actually matters.

Instrument first, optimize later. Auto-instrumentation libraries give you immediate coverage. Add manual spans for business-critical paths once the baseline pipeline is running.

Set up trace-to-log correlation before you build dashboards. A single derived field in Grafana connecting Loki logs to Tempo traces replaces the core workflow that teams pay thousands per month for. It’s the single most valuable thing you can wire up in this whole stack.


TAGS: devops, backend, cloud, architecture, docker


Share: Twitter LinkedIn