Related Posts
ai startup development
KV Cache Quantization for On-Device LLM Inference on Android: INT4 Attention States, Sliding Window Eviction, and the Memory Architecture That Fits a 7B Model in 4GB RAM
Deep dive into KV cache memory management for on-device LLM inference — covering how quantizing key-value attention caches from FP16 to INT4 with group-wise sca
· 1 min read
ai startup development
Streaming LLM Tokens to 10K Concurrent Users: Backpressure, Coroutine Channels, and the SSE Fan-Out Architecture That Scales Without Melting Your Server
Engineering deep-dive into scaling server-sent event streams for LLM token-by-token delivery — coroutine-per-connection with structured concurrency, bounded cha
· 5 min read
ai startup development
Eliminating Android ANRs in Production: Strict Mode Traps, Binder Transaction Limits, and the Background Thread Architecture That Dropped Our ANR Rate From 2.1% to 0.08%
Deep dive into the three most common ANR root causes in production Android apps — accidental main-thread disk I/O triggered by SharedPreferences.apply() during
· 5 min read