MVP Factory
ai startup development

WebGPU Compute Shaders for On-Device LLM Inference in Android WebViews: The GPU Pipeline That Bypasses NNAPI Limitations

KW
Krystian Wiewiór · · 5 min read

TL;DR

NNAPI has been Android’s default for on-device ML acceleration, but its operator coverage gaps and vendor-specific bugs make it unreliable for LLM inference at scale. WebGPU compute shaders, accessible through Android WebView, offer a real alternative: direct GPU compute without the NNAPI abstraction layer. In this post, I walk through a hybrid architecture where WebGPU handles attention-layer matrix multiplications while CPU threads manage non-matmul operations, covering workgroup tuning, JS-to-native buffer mapping, and the real-world trade-offs.


The NNAPI problem no one talks about

Building production systems that target diverse Android hardware has taught me that NNAPI is a minefield. On paper, it delegates to the best available accelerator: GPU, DSP, NPU. In practice, you hit three walls:

  1. Operator coverage gaps. Custom or fused ops silently fall back to CPU.
  2. Vendor-specific bugs. Identical models produce different results on Qualcomm vs. MediaTek vs. Samsung Exynos.
  3. Quantization inconsistencies. INT8/INT4 support varies wildly across HAL implementations.

For LLM inference specifically, transformer attention layers rely heavily on batched matrix multiplications (GEMM), softmax, and layer normalization. NNAPI’s coverage of these ops, especially quantized variants, is incomplete on most shipping devices.

Why WebGPU is worth your attention

WebGPU compute shaders give you a standardized, portable GPU compute interface. Chrome 113+ ships WebGPU support and Android WebView inherits this, so you get compute shader access on any device running a recent WebView update. No vendor HAL required.

FactorNNAPIWebGPU via WebView
GPU accessVia vendor HALDirect via standardized API
Operator coverageVendor-dependent, partialYou write the shaders, full control
Quantization supportINT8 on some, INT4 rareCustom, implement what you need
Update mechanismOS/firmware updatePlay Store WebView update
DebuggingOpaque vendor stackChrome DevTools, shader logging
OverheadHAL + driver abstractionJS bridge + WebView process

The trade-off is bridge overhead versus reliability. For LLM workloads where each token generation involves hundreds of GEMM operations, predictable GPU execution beats unpredictable fallback-to-CPU every time.

The hybrid architecture

Most teams get this wrong: you shouldn’t run the entire LLM pipeline in WebGPU. The better architecture splits work between GPU and CPU.

WebGPU handles QKV projections, attention score computation, and feed-forward GEMM. Anything that is a dense matrix multiply on quantized weights.

CPU threads handle tokenization, embedding lookups, layer norm, residual connections, and sampling. Operations that are memory-bound or sequential.

// Native side: orchestrate the pipeline
class HybridLLMEngine(private val webView: WebView) {

    suspend fun generateToken(inputIds: IntArray): Int {
        val embeddings = cpuEmbeddingLookup(inputIds)

        // Ship to WebGPU for attention + FFN
        val hiddenState = webView.evaluateJavascriptSuspend(
            "runTransformerBlock(${embeddings.toJSArrayBuffer()})"
        )

        // Back to CPU for final norm + sampling
        return cpuSampleFromLogits(hiddenState)
    }
}
// WebGPU compute shader: quantized GEMM (INT4 × FP16)
@compute @workgroup_size(8, 8, 1)
fn matmul_q4_f16(
    @builtin(global_invocation_id) gid: vec3<u32>
) {
    let row = gid.x;
    let col = gid.y;
    var acc: f32 = 0.0;

    for (var k: u32 = 0u; k < K / 8u; k = k + 1u) {
        let packed = weights[row * (K / 8u) + k]; // 8 x INT4 packed
        let input_vec = activations[k * 8u]; // FP16 block
        acc += dequantDotProduct(packed, input_vec);
    }
    output[row * N + col] = acc;
}

Workgroup tuning: the performance lever

Workgroup size is the single biggest performance lever in your compute shaders. Mobile GPUs have different wavefront/warp sizes than desktop. Adreno typically operates on 64-wide waves, Mali on 16-wide warps.

A practical approach:

  • Start with @workgroup_size(8, 8, 1) (64 threads per workgroup, aligns with Adreno)
  • Profile with @workgroup_size(4, 4, 1) (16 threads, better for Mali GPUs)
  • Query the adapter limits at runtime and select the appropriate shader variant

The performance delta between tuned and untuned workgroups is real on mobile silicon. I’ve seen 2-3x differences on the same device just from workgroup sizing.

Buffer memory mapping: the bridge tax

The JS-to-native bridge is your bottleneck. Every token generation requires shipping activation tensors across the WebView boundary. Minimize crossings by batching: run all transformer layers in a single WebGPU dispatch sequence rather than bouncing back to native between layers.

Your buffer strategy should keep activation tensors GPU-resident between layers. Only the final logits vector crosses the bridge back to native code.

// Bad: cross bridge per layer (12 round trips for 12-layer model)
// Good: single dispatch, all layers GPU-side
webView.evaluateJavascript("runAllLayers(inputBuffer, 12)")

Use GPUBuffer with MAP_READ only on the final output. Intermediate buffers should be STORAGE only, never mapped, never crossing the bridge.

When not to use this approach

This architecture has clear limits. Avoid it when:

  • Target devices run Android < 10 or have outdated WebView versions without WebGPU
  • Model size exceeds GPU VRAM (most mobile GPUs cap around 1-3 GB shared memory)
  • Latency requirements are sub-50ms per token, since the JS bridge adds measurable overhead

For models in the 1-4B parameter range with INT4 quantization, fitting within mobile GPU memory is feasible, and the compute density of GEMM operations amortizes the bridge cost.

What to do with all this

First, audit your NNAPI operator coverage. Run your model through nnapi-check on your target devices. If more than 20% of ops fall back to CPU, the WebGPU path likely wins on aggregate throughput.

Second, split your pipeline at the GEMM boundary. Keep memory-bound and sequential ops on CPU threads, push compute-dense matrix multiplications to WebGPU shaders, and minimize bridge crossings by batching all transformer layers into a single dispatch.

Third, profile workgroup sizes per GPU family. Ship at least two shader variants, one tuned for Adreno (64-wide) and one for Mali (16-wide), and select at runtime based on GPUAdapterInfo. Don’t skip this step. The defaults are rarely optimal on mobile.


Share: Twitter LinkedIn