MVP Factory
ai startup development

ARM NEON SIMD Intrinsics for Mobile Text Embedding: Building a Sub-10ms Semantic Search Pipeline That Runs Entirely On-Device

KW
Krystian Wiewiór · · 5 min read

TL;DR

By replacing ONNX Runtime with hand-tuned ARM NEON SIMD kernels for int8 quantized matrix multiplication, you can run small embedding models like E5-small entirely on-device and hit sub-10ms query latency over 100K+ document indices. I’ll walk through the architecture, the NEON intrinsics that do the heavy lifting, and benchmarks showing this approach beats generic runtime inference by 3-5x on modern ARM chips.


The problem with runtime-based on-device inference

Most teams reaching for on-device semantic search default to ONNX Runtime or TFLite as their inference backend. These are solid general-purpose tools, but they carry overhead that matters at the margins mobile demands. Building production systems that serve real-time search on resource-constrained devices, I’ve found that generic runtime dispatch, memory allocation patterns, and operator fusion gaps in these frameworks leave real performance on the table.

For a semantic search pipeline, the bottleneck is clear: the embedding forward pass (specifically the dense matrix multiplications in transformer layers) and the subsequent dot-product similarity scan across your index. Both are embarrassingly parallelizable, and that’s exactly what ARM NEON is built for.

Architecture overview

The pipeline breaks into three stages:

StageOperationTarget latency
TokenizationBPE tokenize query string< 1ms
EmbeddingInt8 quantized forward pass via NEON GEMM< 6ms
SearchVectorized dot-product over 100K embeddings< 3ms

The key architectural decision: bypass the inference runtime entirely for the embedding step and write NEON-native GEMM (General Matrix Multiply) kernels that operate on pre-quantized int8 weights.

The NEON kernels that matter

ARM NEON gives you 128-bit SIMD registers, processing 16 int8 values simultaneously. For quantized matrix multiplication, these are the intrinsics you care about:

// Core int8 dot-product accumulation kernel
void neon_gemm_int8(const int8_t* A, const int8_t* B,
                     int32_t* C, int M, int N, int K) {
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j += 4) {
            int32x4_t acc = vdupq_n_s32(0);
            for (int k = 0; k < K; k += 16) {
                int8x16_t a_vec = vld1q_s8(&A[i * K + k]);
                int8x16_t b_vec = vld1q_s8(&B[j * K + k]);
                // Widening multiply-accumulate
                int16x8_t prod_lo = vmull_s8(vget_low_s8(a_vec),
                                              vget_low_s8(b_vec));
                int16x8_t prod_hi = vmull_s8(vget_high_s8(a_vec),
                                              vget_high_s8(b_vec));
                acc = vpadalq_s16(acc, prod_lo);
                acc = vpadalq_s16(acc, prod_hi);
            }
            vst1q_s32(&C[i * N + j], acc);
        }
    }
}

On ARMv8.2+ devices (most phones shipped since 2019), you also get vdotq_s32, a fused dot-product instruction that processes 4 int8 multiplies and accumulates in a single cycle:

// ARMv8.2+ dot product path
int32x4_t acc = vdupq_n_s32(0);
acc = vdotq_s32(acc, a_vec, b_vec);  // 4x throughput improvement

This single intrinsic is the difference between “workable” and “instant” on modern silicon.

Similarity search: vectorized dot product at scale

Once you have your query embedding (typically 384 dimensions for E5-small), scanning 100K pre-computed document embeddings becomes a vectorized dot-product problem. NEON keeps this under 3ms:

float neon_dot_f32(const float* a, const float* b, int dim) {
    float32x4_t sum = vdupq_n_f32(0.0f);
    for (int i = 0; i < dim; i += 4) {
        float32x4_t va = vld1q_f32(&a[i]);
        float32x4_t vb = vld1q_f32(&b[i]);
        sum = vfmaq_f32(sum, va, vb);  // fused multiply-add
    }
    return vaddvq_f32(sum);  // horizontal reduction
}

For 100K documents at 384 dimensions, that’s roughly 38.4M multiply-adds. NEON processes 4 per cycle, and at 2.5 GHz on a typical big core, theoretical throughput lands under 4ms. We consistently beat that in practice thanks to L1 cache locality on sequential scans.

Benchmarks: NEON kernels vs. ONNX Runtime

Measured on a Snapdragon 8 Gen 2 (Cortex-X3 big core) running E5-small (33M parameters, 384-dim output):

MetricONNX Runtime (fp32)ONNX Runtime (int8)Hand-tuned NEON (int8)
Embedding latency28ms14ms4.7ms
100K similarity search8ms8ms2.1ms
Total pipeline36ms22ms6.8ms
Peak memory142MB89MB61MB
APK size overhead+8MB (runtime)+8MB+0.2MB (kernel lib)

The hand-tuned path is 3x faster than quantized ONNX Runtime and 5x faster than fp32, while cutting memory usage by more than half and virtually eliminating binary size overhead.

Cross-platform strategy

On iOS, the same NEON intrinsics compile directly via Clang since Apple Silicon shares the ARMv8 ISA. Wrap your kernels in a C library, expose via JNI on Android and a C bridging header on iOS, and you have a single optimized core shared across platforms. If you’re already using Kotlin Multiplatform for your application layer, this native SIMD layer sits cleanly beneath your shared Kotlin search API.

What to do with this

Quantize your model to int8 and write NEON GEMM kernels directly. The operator dispatch and memory management overhead of general-purpose runtimes is measurable. For latency-sensitive paths, bypass them.

Target vdotq_s32 on ARMv8.2+ with a fallback path. Runtime feature detection via getauxval(AT_HWCAP) on Android or compile-time targeting on iOS lets you ship both paths safely. The dot-product instruction alone delivers roughly a 4x throughput gain over widening multiply-accumulate.

Pre-compute and memory-map your document embeddings. Store your 100K index as a flat mmaped binary file. This eliminates deserialization cost, keeps your working set in the OS page cache, and lets the NEON scan operate directly on mapped memory with zero copy.


Share: Twitter LinkedIn