What Happens in the 400ms Between Your API Call and the LLM Response
The full journey: 7 stages
Most engineers treat LLM APIs as a black box: prompt goes in, text comes out. But between those two events, your request moves through an infrastructure stack that looks roughly the same across OpenAI, Anthropic, and Google. The mistake I see most teams make is optimizing their prompt without understanding where time actually goes.
Latency breakdown by stage
| Stage | Latency | % of Total | Key operations |
|---|---|---|---|
| 1. API gateway | ~5ms | ~1.2% | TLS termination, auth, rate limiting, schema validation, billing start |
| 2. Load balancer | ~2ms | ~0.5% | Geographic routing, least-connections, health checks |
| 3. Tokenization | ~3ms | ~0.7% | BPE/SentencePiece/WordPiece encoding, context window check |
| 4. Model router | ~1ms | ~0.2% | GPU cluster selection, queue management |
| 5. Inference | ~300-800ms | ~95% | Prefill + decode (the actual model work) |
| 6. Post-processing | ~5ms | ~1.2% | Detokenization, safety classifier, stop sequences, JSON packaging |
| 7. Billing | <1ms | - | Token counting, cost calculation |
Inference dominates everything. Nothing else comes close.
Stages 1-4: the fast path (~11ms)
The API gateway (~5ms) terminates TLS, authenticates your key, enforces rate limits, validates the request schema, and starts the billing clock. If you’ve ever hit a 429 Too Many Requests, this is where your request died, before a GPU ever saw it.
The load balancer (~2ms) routes your request using geographic proximity and least-connections algorithms while checking backend cluster health. This explains why latency varies between identical calls: your request may land on a different node each time.
Tokenization (~3ms) converts your text into tokens using algorithms like BPE, SentencePiece, or WordPiece. The rough conversion is ~4 characters per token. This is also where the context window check happens. Exceed the limit and your request gets rejected. Token count equals cost, so this stage determines your bill.
The model router (~1ms) decides where your request runs. Large models go to multi-GPU clusters, smaller models to single-GPU instances, embedding requests to dedicated clusters. Queue management happens here too. If all GPUs are saturated, you wait.
Stage 5: inference, where 95% of time goes
This is where the real work happens, and it splits into two phases:
Prefill phase
Your entire input is processed in parallel. The model computes query-key (QK) attention scores across all input tokens and generates the KV cache, a stored representation of your prompt that avoids redundant computation during generation.
Decode phase
This part is sequential: one token per forward pass. Each step reuses the KV cache from prefill, applies temperature and top-p sampling to select the next token, and (if streaming is enabled) sends each token to you immediately. This is why streaming feels faster. You see tokens as they’re generated rather than waiting for the full response.
The hardware
The hardware layer matters more than most people realize:
- GPUs are typically A100, H100, or H200 with 80GB+ HBM (high-bandwidth memory)
- Tensor parallelism splits a single model across multiple GPUs
- Multiple requests get batched together to maximize GPU utilization
- Flash Attention reduces memory overhead; Grouped-Query Attention (GQA) cuts KV cache size
GPU compute runs about $2-3/hr per card. Multiply that across thousands of GPUs serving millions of requests and the pricing starts to make sense.
Stages 6-7: the exit path
Post-processing (~5ms) detokenizes the output back into text, runs a safety classifier, checks for stop sequences, and packages everything into JSON.
Billing calculates your final cost. One thing worth knowing: output tokens cost 3-5x more than input tokens. Each output token requires a full forward pass through the model, while input tokens are processed in parallel during prefill. Prompt caching, where repeated prefixes reuse cached KV states, can cut input token costs significantly.
What this means for your architecture
Understanding this pipeline changes how you build systems on top of LLM APIs:
Optimization target -> Action
--------------------------------------------------
Reduce input tokens -> Shorter prompts, prompt caching
Reduce output tokens -> Constrained output, max_tokens limits
Reduce latency -> Streaming, smaller models, geographic routing
Reduce cost -> Cache prefixes, batch requests, right-size models
Takeaways
Optimize for inference, not the edges. ~95% of latency lives in the prefill/decode cycle. Cutting 200 tokens from your system prompt or enabling prompt caching will do more than any amount of gateway tuning.
Output tokens are your biggest cost lever. At 3-5x the price of input tokens, controlling output length with max_tokens, structured output schemas, and precise instructions has the most direct impact on your bill.
Use streaming. The decode phase generates tokens one at a time. Streaming delivers each token as it’s produced, making a 600ms response feel near-instant. If you’re not streaming, you’re making users stare at a spinner for no reason.