Kubernetes Pod Scheduling for GPU-Accelerated ML Inference: Topology-Aware Placement, Device Plugin Fractional Sharing, and the Affinity Rules That Cut Our P99 Latency by 40%
TL;DR
NUMA-misaligned GPU pod placement silently adds 2-5ms to every inference call. By combining Kubernetes 1.36’s topology manager in single-numa-node policy with fractional GPU sharing via MIG partitioning and pod topology spread constraints, we cut P99 inference latency by 40% across a heterogeneous A100/H100 cluster without adding a single GPU. This post breaks down how the scheduling stack fits together and where most teams trip up.
The quiet tax: cross-NUMA memory access
The number one overlooked latency source in production inference systems isn’t the model or the framework. It’s pod-to-GPU-to-CPU topology. When a pod lands on NUMA node 0 but its allocated GPU sits behind NUMA node 1, every tensor transfer crosses the interconnect.
| Placement | Avg Latency (ms) | P99 Latency (ms) | Throughput (req/s) |
|---|---|---|---|
| NUMA-aligned | 8.2 | 14.1 | 1,240 |
| NUMA-misaligned | 10.7 | 23.8 | 940 |
| Delta | +30% | +69% | -24% |
Benchmarks: BERT-large inference, batch size 1, A100 80GB, Kubernetes 1.36, Ubuntu 22.04
A 69% P99 penalty compounds across a fleet. At scale, that’s the gap between hitting your SLA and fielding pages at 2am.
Configuring topology-aware scheduling
Kubernetes’ topology manager coordinates resource alignment across the kubelet. You have two realistic policy choices for GPU inference:
single-numa-node: Strict. All resources (CPU, memory, GPU) must come from one NUMA node. Pod admission fails if alignment is impossible.best-effort: Prefers alignment but admits misaligned pods rather than rejecting them.
For latency-sensitive serving, always use single-numa-node. For batch training, best-effort works fine. Configure it in the kubelet:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
topologyManagerPolicy: single-numa-node
topologyManagerScope: pod
Set topologyManagerScope: pod (not container) so the entire pod’s resources align. This matters when sidecars share the pod spec.
Fractional GPU sharing: MIG vs MPS vs time-slicing
Not every inference workload needs a full A100. The mistake I see most often: teams default to time-slicing because it’s simplest, then can’t figure out why tail latencies spike under contention.
| Method | Isolation | Latency predictability | Memory guarantee | Setup complexity |
|---|---|---|---|---|
| Time-slicing | None (context switch) | Poor under contention | None | Low |
| MPS | Partial (shared context) | Moderate | None | Medium |
| MIG | Full (hardware partition) | Excellent | Yes | High |
MIG wins for inference, and it’s not close. Each MIG partition is a true hardware slice with its own memory bandwidth, compute units, and L2 cache. On an A100, you can carve seven 1g.10gb instances or various larger profiles. Each partition registers as a distinct extended resource:
resources:
limits:
nvidia.com/mig-1g.10gb: 1
The device plugin API v1beta1 in Kubernetes 1.36 reports these as individual allocatable resources, and the topology manager aligns them correctly to NUMA nodes. Time-slicing can’t match this. When two inference pods share a time-sliced GPU, a batch job’s large allocation can stall your serving workload for entire scheduling quanta.
Pod topology spread for failure domain balance
Once pods are NUMA-aligned, you still need to spread inference replicas across failure domains. Use topologySpreadConstraints to prevent all replicas from piling onto the same node or zone:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: inference-serving
This gives you zone-level resilience without sacrificing per-pod NUMA alignment. Each pod individually gets aligned placement; the fleet distributes across zones.
Priority-based preemption: protecting serving pods
In a heterogeneous cluster running both inference serving and batch training, you need clear preemption rules. Define two PriorityClasses:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: serving-critical
value: 1000000
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-training
value: 100
preemptionPolicy: Never
Serving pods at priority 1000000 will evict batch jobs when GPU capacity is constrained. Setting preemptionPolicy: Never on batch jobs prevents them from evicting each other in cascading thrash. One thing worth calling out: batch jobs should also set reasonable activeDeadlineSeconds so they don’t squat on fractional GPU slices indefinitely after preemption restarts.
What we measured
After rolling out the full stack (single-numa-node topology policy, MIG fractional sharing, topology spread constraints, and priority preemption) across a 48-GPU mixed cluster:
- P99 latency dropped 40% (23.8ms to 14.3ms)
- GPU utilization increased 22% from MIG packing vs whole-GPU allocation
- Zero SLA breaches in 90 days, down from 3-5/month caused by NUMA misalignment
No new hardware. Just better scheduling.
What to do with this
-
Enable
single-numa-nodetopology manager policy on all inference-serving nodes. The admission strictness is a feature. It surfaces misalignment at deploy time instead of showing up as mysterious P99 spikes in production. -
Use MIG over time-slicing for fractional GPU sharing on inference workloads. The hardware isolation eliminates noisy-neighbor latency variance. Save time-slicing for dev environments where predictability doesn’t matter.
-
Pair pod topology spread constraints with priority-based preemption. Spread handles failure domain resilience; priority classes handle resource protection. Together they let serving and training coexist on the same cluster without latency regression.