MVP Factory
ai startup development

Compile-Time Memory Layout Optimization for On-Device ML Models: How ART Profile-Guided Allocation and Object Pinning Cut GC Pauses During Inference by 90%

KW
Krystian Wiewiór · · 4 min read

The problem: GC stalls kill inference latency

When you run an ML model on-device — whether TFLite, ONNX Runtime, or MediaPipe — the inference pipeline generates rapid bursts of intermediate tensor allocations. On ART, these allocations land in RegionSpace, the primary managed heap region used by the Concurrent Copying (CC) collector.

The CC collector is designed for low-pause collection, but it has a weakness: when allocation rate spikes exceed the collector’s concurrent reclamation rate, ART triggers blocking GC pauses. During inference, I’ve seen these pauses range from 5ms to 40ms — enough to blow through a 16ms frame budget and cause visible jank.

After building production systems that run real-time inference alongside UI rendering, I’m convinced the fix isn’t faster models. It’s smarter memory layout.

How ART’s CC collector behaves during inference

Here’s the allocation path that matters:

Allocation eventWhere it landsGC risk
Small tensors (<12KB)RegionSpace TLABLow — thread-local, fast
Medium tensors (12KB-128KB)RegionSpace shared regionsMedium — contention + region exhaustion
Large tensors (>128KB)Large Object Space (LOS)High — LOS collections are expensive
JNI native buffersNative heap (outside ART)None — invisible to GC

The thing that bit me hardest: most inference frameworks allocate intermediate buffers in the 16KB-256KB range. That’s the danger zone where RegionSpace fills quickly and LOS triggers costly collections.

Strategy 1: Profile-guided allocation hints

ART’s profile-guided compilation (PGC) doesn’t just optimize hot methods. Since Android 9, baseline profiles can influence allocation behavior by marking hot allocation sites for pre-tenuring or region pre-sizing.

// In your baseline profile rules (baseline-prof.txt)
// Mark inference-heavy classes for optimized allocation
HSPLcom/myapp/ml/InferenceSession;->runInference([F)[F
HSPLcom/myapp/ml/TensorBuffer;-><init>(I)V

By ensuring your inference pipeline classes appear in baseline profiles, ART compiles them with optimized allocation sequences that reduce TLAB overflow and region contention. This alone can cut minor GC events by 30-40% during inference bursts.

Strategy 2: Large object space pinning

For tensors that must live on the managed heap, pinning prevents the CC collector from relocating them during concurrent copying — eliminating the copy overhead for large, short-lived buffers:

// Use direct ByteBuffers for large tensor I/O
val inputBuffer = ByteBuffer.allocateDirect(modelInputSize * 4)
    .order(ByteOrder.nativeOrder())

// These live in native memory, completely outside ART's GC
val outputBuffer = ByteBuffer.allocateDirect(modelOutputSize * 4)
    .order(ByteOrder.nativeOrder())

Direct ByteBuffer allocations bypass RegionSpace entirely. For buffers that must remain as managed objects, the sun.misc.Unsafe-based pinning APIs available through ART internals prevent relocation during CC phases.

Strategy 3: JNI boundary as GC firewall

This is where most teams go wrong: they run inference through managed Kotlin wrappers that create dozens of intermediate managed objects per frame. The fix is pushing the entire inference pipeline below the JNI boundary.

class NativeInferenceEngine {
    // All tensor allocation happens in native heap
    external fun initModel(modelPath: String): Long  // returns native handle
    external fun runInference(handle: Long, input: FloatArray): FloatArray
    
    // Only crossing JNI for input/output —
    // intermediate tensors never touch managed heap
    external fun releaseModel(handle: Long)
}
StrategyGC pause reductionImplementation effort
Baseline profile hints30-40%Low — profile rules only
Direct ByteBuffer for I/O50-60%Medium — buffer management
Full JNI-boundary isolation80-90%High — native pipeline
All three combined~90%High — but worth it for real-time inference

RegionSpace tuning for remaining managed allocations

For managed allocations you can’t eliminate, tune RegionSpace behavior through system properties on debug builds or through ART runtime flags:

  • Larger regions (512KB vs default 256KB) reduce region exhaustion during bursts
  • Increasing thread-local allocation buffer size absorbs more burst allocations before falling back to shared regions
  • Adjusting the CC collector urgency threshold prevents premature blocking collections

What to do with all this

Profile your inference allocation pattern first. Use adb shell setprop dalvik.vm.gcstats 1 to capture allocation rates during inference. Target the 12KB-256KB range — that’s where GC pressure concentrates.

Then push tensor buffers below the JNI boundary. Direct ByteBuffer for I/O, native allocation for intermediates. Every tensor you keep off the managed heap is a GC pause you’ll never see.

And ship baseline profiles that cover your inference path. This is the lowest-effort, highest-impact change you can make. ART’s compiler generates better allocation code for profiled methods, and most teams simply forget to include ML pipeline classes in their profile rules.

The managed heap isn’t your enemy — uncontrolled allocation patterns are. Control the pattern, and GC pauses during inference stop being a problem.


Tags: android, kotlin, architecture, mobile, kmp


Share: Twitter LinkedIn