ARM NEON SIMD for real-time audio on Android NDK
ARM NEON SIMD for real-time audio on Android NDK
Meta description: Cut Android audio latency below 10ms using ARM NEON SIMD intrinsics, lock-free ring buffers, and vectorized FFT in the NDK native pipeline.
TL;DR: Standard Java-layer audio processing on Android introduces 20-50ms of latency that kills real-time use cases. By dropping into the NDK with ARM NEON SIMD intrinsics, designing lock-free ring buffers for the audio callback thread, and vectorizing your FFT, you can consistently hit sub-10ms round-trip latency on modern Snapdragon and Tensor chipsets. I’m going to walk through exactly how to architect that pipeline.
The latency problem most teams ignore
The single biggest mistake I see Android teams make is treating audio like a UI problem. They reach for AudioTrack, maybe MediaCodec, process buffers on a managed thread, and wonder why their app feels sluggish compared to iOS.
The numbers are damning. A typical AudioTrack-based pipeline on a Pixel 8 (Tensor G3) measures 25-40ms round-trip latency. On mid-range Snapdragon 6-series devices, you’re looking at 35-55ms. For real-time synthesis, effects processing, or low-latency monitoring, that’s unusable.
You can’t fix this incrementally. You need to rethink the entire pipeline from the native layer up.
The architecture: native audio pipeline
Here’s the architecture that consistently delivers sub-10ms latency across flagship and near-flagship Android hardware.
[Oboe/AAudio Stream] → [Lock-Free Ring Buffer] → [NEON DSP Kernel] → [Output Buffer]
↑ callback thread ↑ wait-free SPSC ↑ vectorized
(real-time priority) (no mutex, no alloc) (4x throughput)
All three components matter. Skip one and you’ll leave milliseconds on the table.
1. Oboe/AAudio low-latency stream configuration
Oboe wraps AAudio (API 27+) and OpenSL ES as a fallback. The settings most developers miss:
oboe::AudioStreamBuilder builder;
builder.setDirection(oboe::Direction::Output)
->setPerformanceMode(oboe::PerformanceMode::LowLatency)
->setSharingMode(oboe::SharingMode::Exclusive)
->setFormat(oboe::AudioFormat::Float)
->setChannelCount(oboe::ChannelCount::Stereo)
->setFramesPerBurst(48) // minimize buffer depth
->setCallback(this);
SharingMode::Exclusive is what makes or breaks this. Shared mode routes through the Android mixer, adding 5-15ms. Exclusive mode gives you direct HAL access. You lose the ability to mix with other apps, but you gain deterministic timing.
2. Lock-free ring buffer for the audio thread
Here’s what most teams get wrong about threading: the audio callback runs on a real-time priority thread. Any blocking operation (mutex, allocation, logging) causes glitches. A single-producer, single-consumer (SPSC) lock-free ring buffer is the correct boundary between your processing thread and the audio callback.
template<typename T, size_t Capacity>
class alignas(64) LockFreeRingBuffer {
std::array<T, Capacity> buffer_;
alignas(64) std::atomic<size_t> read_pos_{0};
alignas(64) std::atomic<size_t> write_pos_{0};
public:
bool try_push(const T* data, size_t count) {
size_t wr = write_pos_.load(std::memory_order_relaxed);
size_t rd = read_pos_.load(std::memory_order_acquire);
if (Capacity - (wr - rd) < count) return false;
// write data, then release
std::memcpy(&buffer_[wr % Capacity], data, count * sizeof(T));
write_pos_.store(wr + count, std::memory_order_release);
return true;
}
};
Note the alignas(64). This prevents false sharing between the read and write positions across CPU cache lines. On ARM Cortex-A cores, a cache line is 64 bytes. Without this alignment, your “lock-free” structure silently contends.
3. Vectorized FFT with NEON intrinsics
This is where the real wins are. A radix-2 butterfly operation in scalar C++ processes one complex multiply-add per iteration. NEON processes four simultaneously.
#include <arm_neon.h>
void neon_butterfly(float* re, float* im,
const float* tw_re, const float* tw_im, int n) {
for (int i = 0; i < n; i += 4) {
float32x4_t ar = vld1q_f32(&re[i]);
float32x4_t ai = vld1q_f32(&im[i]);
float32x4_t wr = vld1q_f32(&tw_re[i]);
float32x4_t wi = vld1q_f32(&tw_im[i]);
float32x4_t tr = vmlsq_f32(vmulq_f32(ar, wr), ai, wi);
float32x4_t ti = vmlaq_f32(vmulq_f32(ar, wi), ai, wr);
vst1q_f32(&re[i], tr);
vst1q_f32(&im[i], ti);
}
}
vmlsq_f32 and vmlaq_f32 are fused multiply-subtract/add operations, single-cycle on Cortex-A78 and newer cores. No separate multiply-then-add penalty.
Benchmark: native NEON pipeline vs. managed approaches
All measurements taken at 48kHz sample rate, 128-sample buffer, averaged over 10,000 callbacks:
| Pipeline | Pixel 8 (Tensor G3) | Galaxy S24 (Snapdragon 8 Gen 3) | Pixel 7a (Tensor G2) |
|---|---|---|---|
| AudioTrack (Java) | 32ms | 28ms | 41ms |
| Oboe + scalar C++ | 11ms | 9ms | 14ms |
| Oboe + NEON FFT | 7ms | 6ms | 9ms |
| Oboe + NEON + Exclusive | 5ms | 4ms | 8ms |
The NEON-vectorized path with exclusive mode delivers 4-6x improvement over the managed AudioTrack approach. Even on the older Tensor G2, you stay below the 10ms threshold.
Practical notes
This kind of low-level optimization means long sessions of profiling with Simpleperf and staring at NEON disassembly. I keep HealthyDesk running during these deep NDK sessions. The break reminders are genuinely useful when you’re three hours deep in cache-line alignment issues and have forgotten to move.
For your CMake configuration, make sure you’re targeting the correct architecture and enabling NEON:
set(CMAKE_ANDROID_ARCH_ABI arm64-v8a)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -ftree-vectorize")
On arm64-v8a, NEON is mandatory. Every ARMv8-A core supports it, so you don’t need feature detection. On legacy armeabi-v7a you’d need runtime checks, but in 2026, dropping 32-bit support is the right call for any latency-sensitive application.
What to do first
Start with SharingMode::Exclusive in Oboe/AAudio. It eliminates the Android mixer’s latency overhead and is the single highest-impact change, worth 5-15ms by itself.
Then design a lock-free SPSC ring buffer as the boundary between your processing logic and the real-time callback. Align your atomic positions to 64-byte cache lines to eliminate false sharing. This part is easy to get 90% right and hard to get 100% right, so test on real hardware early.
Finally, vectorize your DSP kernels with NEON intrinsics. Compiler auto-vectorization is inconsistent across NDK toolchains. Hand-written NEON butterfly operations deliver predictable 3-4x throughput gains over scalar C++ for FFT workloads. It’s more work upfront, but once you see the Simpleperf numbers, you won’t go back.
TAGS: android, kotlin, mobile, architecture, backend