eBPF-Based APM for Kotlin Backend Services: Zero-Instrumentation Latency Profiling, Continuous CPU Flame Graphs, and the Observability Pipeline That Replaces Your OpenTelemetry Agent
TL;DR
eBPF lets you profile Kotlin/JVM backend services at the kernel level. No SDK dependencies, no code changes, no restarts. Tools like Grafana Beyla and Pyroscope hook into kernel syscalls and use JVM perf-map files for symbol resolution, producing continuous CPU flame graphs at a fraction of the overhead of a traditional OpenTelemetry Java agent. Running this on production Kotlin services, we cut observability-related CPU overhead by 60-70% while catching tail-latency regressions that our old setup missed entirely.
The problem with agent-based instrumentation
Most teams get observability wrong by treating it as an application concern. You add the OpenTelemetry Java agent, sprinkle in some custom spans, wire up a collector, and call it a day. But that agent is a -javaagent bytecode transformer running inside your JVM. It shares your heap, your GC pauses, and your thread pool.
For Kotlin backend services running on coroutines, this creates a specific problem. The OTel agent’s context propagation was designed around threads, not structured concurrency. You end up fighting the instrumentation library instead of observing your actual application.
eBPF sidesteps this entirely. It runs in kernel space, attached to syscall tracepoints and kprobes, completely outside your JVM process.
How eBPF profiling works on the JVM
The pipeline has three layers:
- Kernel-level hooks: eBPF programs attach to
perf_eventfor CPU sampling or tracepoints likesys_enter_write/sys_enter_readfor I/O profiling - Stack unwinding: The BPF program walks the stack using frame pointers or DWARF info
- JVM symbolization: The JVM’s
-XX:+PreserveFramePointerflag andperf-map-agent(or the built-in-XX:+DumpPerfMapAtExitin JDK 20+) produce/tmp/perf-<pid>.mapfiles that map JIT-compiled addresses back to Kotlin method names
This means you see com.myapp.service.OrderService.processPayment in your flame graphs, not 0x7f3a2b1c4d50.
Tool comparison: eBPF vs. OpenTelemetry agent
| Dimension | OpenTelemetry Java Agent | Grafana Beyla (eBPF) | Pyroscope (eBPF) |
|---|---|---|---|
| Code changes required | None (agent attach) | None | None |
| JVM restart required | Yes | No | No |
| Typical CPU overhead | 3-8% | <1% | 1-2% |
| Memory overhead | 50-150 MB heap | ~10 MB (kernel) | ~20 MB |
| Kotlin coroutine-aware | Partial (requires extensions) | N/A (kernel-level) | N/A (kernel-level) |
| Continuous profiling | Requires additional setup | Built-in | Built-in |
| Distributed tracing | Full support | HTTP/gRPC auto-detection | Not primary focus |
| Flame graphs | Via additional exporters | Via Grafana integration | Native |
The overhead difference is not marginal. It is the difference between profiling being “something we turn on during incidents” and “something that runs continuously in production.” That shift changes how you think about performance work.
Building the continuous profiling pipeline
This is the architecture I have deployed across multiple Kotlin microservice environments:
┌─────────────┐ ┌──────────────┐ ┌───────────────┐
│ Kotlin/JVM │────▶│ eBPF Agent │────▶│ Pyroscope / │
│ Service │ │ (kernel) │ │ Grafana Cloud │
│ + perf-map │ │ │ │ │
└─────────────┘ └──────────────┘ └───────────────┘
│ │
│ JVM flag: │
│ -XX:+PreserveFramePointer ▼
│ ┌──────────────┐
└─────────────────────────────────▶│ Alert on P99 │
│ regression │
└──────────────┘
The JVM flags you need for your Kotlin services:
-XX:+PreserveFramePointer
-XX:+UnlockDiagnosticVMOptions
-XX:+DebugNonSafepoints
PreserveFramePointer costs roughly 1-2% CPU on modern JVMs, a well-documented tradeoff. DebugNonSafepoints ensures that profiling samples resolve to the actual executing line, not the nearest safepoint.
Catching tail-latency regressions
Continuous differential flame graphs are where this gets interesting. When a new deployment rolls out, you automatically compare the flame graph profile of the canary against the baseline. If P99 latency shifts or a new hot path appears in your Kotlin coroutine dispatchers, the alert fires before the rollout completes.
This works differently from threshold-based alerting. You are not waiting for SLOs to breach. You are detecting the structural change in execution patterns that will breach them.
I saw this pay off firsthand when it caught a Kotlin serialization regression: a single kotlinx.serialization codec change that added 12ms at P99. The alert fired within the first 5% of a canary rollout. Traditional metrics-based alerting would not have flagged it until the full deployment was live and customer-facing latency had already degraded. That is the kind of thing that makes you wonder how many similar regressions you shipped before without noticing.
When NOT to use eBPF profiling
eBPF is not a full replacement for OpenTelemetry. It does not give you distributed trace context propagation, custom business metrics, or structured log correlation out of the box. If you need to trace a request across 15 microservices, you still need distributed tracing.
The right architecture is layered: eBPF for continuous profiling and system-level observability, lightweight OTel SDK (not the full agent) for distributed tracing where you actually need it. Trying to pick one or the other is a false choice.
What to do next
Add -XX:+PreserveFramePointer and -XX:+DebugNonSafepoints to your JVM flags now. These are prerequisites for any eBPF profiling, and there is no reason to wait. Deploy them so the data is ready when you need it.
Once the flags are in place, start with Grafana Beyla for HTTP/gRPC auto-instrumentation. It requires zero application changes, runs as a sidecar or DaemonSet, and gives you request-level latency metrics from kernel space. I had it running in under an hour.
Then build differential flame graph comparisons into your CI/CD pipeline. Continuous profiling only pays off when you automate the comparison. Wire canary profile diffs into your deployment gates so regressions get caught before they reach production traffic, not after.