ai startup development

KV Cache Quantization for On-Device LLM Inference on Android: INT4 Attention States, Sliding Window Eviction, and the Memory Architecture That Fits a 7B Model in 4GB RAM

Krystian Wiewiór · May 11, 2026 · 1 min read

TAGS: android, kotlin, mobile, architecture

KV Cache Quantization for On-Device LLM Inference on Android: INT4 Attention States, Sliding Window Eviction, and the Memory Architecture That Fits a 7B Model in 4GB RAM

Deep dive into KV cache memory management for on-device LLM inference — covering how quantizing key-value attention caches from FP16 to INT4 with group-wise sca

May 11, 2026 · 1 min read

ai startup development

Streaming LLM Tokens to 10K Concurrent Users: Backpressure, Coroutine Channels, and the SSE Fan-Out Architecture That Scales Without Melting Your Server

Engineering deep-dive into scaling server-sent event streams for LLM token-by-token delivery — coroutine-per-connection with structured concurrency, bounded cha

May 11, 2026 · 5 min read

ai startup development

Eliminating Android ANRs in Production: Strict Mode Traps, Binder Transaction Limits, and the Background Thread Architecture That Dropped Our ANR Rate From 2.1% to 0.08%

Deep dive into the three most common ANR root causes in production Android apps — accidental main-thread disk I/O triggered by SharedPreferences.apply() during

May 8, 2026 · 5 min read

KV Cache Quantization for On-Device LLM Inference on Android: INT4 Attention States, Sliding Window Eviction, and the Memory Architecture That Fits a 7B Model in 4GB RAM

TAGS: android, kotlin, mobile, architecture

Related Posts

KV Cache Quantization for On-Device LLM Inference on Android: INT4 Attention States, Sliding Window Eviction, and the Memory Architecture That Fits a 7B Model in 4GB RAM

Streaming LLM Tokens to 10K Concurrent Users: Backpressure, Coroutine Channels, and the SSE Fan-Out Architecture That Scales Without Melting Your Server

Eliminating Android ANRs in Production: Strict Mode Traps, Binder Transaction Limits, and the Background Thread Architecture That Dropped Our ANR Rate From 2.1% to 0.08%