Kubernetes Pod Scheduling for GPU-Accelerated ML Inference: Topology-Aware Placement, Device Plugin Fractional Sharing, and the Affinity Rules That Cut Our P99 Latency by 40%

TL;DR

NUMA-misaligned GPU pod placement silently adds 2-5ms to every inference call. By combining Kubernetes 1.36’s topology manager in single-numa-node policy with fractional GPU sharing via MIG partitioning and pod topology spread constraints, we cut P99 inference latency by 40% across a heterogeneous A100/H100 cluster without adding a single GPU. This post breaks down how the scheduling stack fits together and where most teams trip up.

The quiet tax: cross-NUMA memory access

The number one overlooked latency source in production inference systems isn’t the model or the framework. It’s pod-to-GPU-to-CPU topology. When a pod lands on NUMA node 0 but its allocated GPU sits behind NUMA node 1, every tensor transfer crosses the interconnect.

Placement	Avg Latency (ms)	P99 Latency (ms)	Throughput (req/s)
NUMA-aligned	8.2	14.1	1,240
NUMA-misaligned	10.7	23.8	940
Delta	+30%	+69%	-24%

Benchmarks: BERT-large inference, batch size 1, A100 80GB, Kubernetes 1.36, Ubuntu 22.04

A 69% P99 penalty compounds across a fleet. At scale, that’s the gap between hitting your SLA and fielding pages at 2am.

Configuring topology-aware scheduling

Kubernetes’ topology manager coordinates resource alignment across the kubelet. You have two realistic policy choices for GPU inference:

single-numa-node: Strict. All resources (CPU, memory, GPU) must come from one NUMA node. Pod admission fails if alignment is impossible.
best-effort: Prefers alignment but admits misaligned pods rather than rejecting them.

For latency-sensitive serving, always use single-numa-node. For batch training, best-effort works fine. Configure it in the kubelet:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
topologyManagerPolicy: single-numa-node
topologyManagerScope: pod

Set topologyManagerScope: pod (not container) so the entire pod’s resources align. This matters when sidecars share the pod spec.

Not every inference workload needs a full A100. The mistake I see most often: teams default to time-slicing because it’s simplest, then can’t figure out why tail latencies spike under contention.

Method	Isolation	Latency predictability	Memory guarantee	Setup complexity
Time-slicing	None (context switch)	Poor under contention	None	Low
MPS	Partial (shared context)	Moderate	None	Medium
MIG	Full (hardware partition)	Excellent	Yes	High

MIG wins for inference, and it’s not close. Each MIG partition is a true hardware slice with its own memory bandwidth, compute units, and L2 cache. On an A100, you can carve seven 1g.10gb instances or various larger profiles. Each partition registers as a distinct extended resource:

resources:
  limits:
    nvidia.com/mig-1g.10gb: 1

The device plugin API v1beta1 in Kubernetes 1.36 reports these as individual allocatable resources, and the topology manager aligns them correctly to NUMA nodes. Time-slicing can’t match this. When two inference pods share a time-sliced GPU, a batch job’s large allocation can stall your serving workload for entire scheduling quanta.

Pod topology spread for failure domain balance

Once pods are NUMA-aligned, you still need to spread inference replicas across failure domains. Use topologySpreadConstraints to prevent all replicas from piling onto the same node or zone:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: inference-serving

This gives you zone-level resilience without sacrificing per-pod NUMA alignment. Each pod individually gets aligned placement; the fleet distributes across zones.

Priority-based preemption: protecting serving pods

In a heterogeneous cluster running both inference serving and batch training, you need clear preemption rules. Define two PriorityClasses:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: serving-critical
value: 1000000
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-training
value: 100
preemptionPolicy: Never

Serving pods at priority 1000000 will evict batch jobs when GPU capacity is constrained. Setting preemptionPolicy: Never on batch jobs prevents them from evicting each other in cascading thrash. One thing worth calling out: batch jobs should also set reasonable activeDeadlineSeconds so they don’t squat on fractional GPU slices indefinitely after preemption restarts.

What we measured

After rolling out the full stack (single-numa-node topology policy, MIG fractional sharing, topology spread constraints, and priority preemption) across a 48-GPU mixed cluster:

P99 latency dropped 40% (23.8ms to 14.3ms)
GPU utilization increased 22% from MIG packing vs whole-GPU allocation
Zero SLA breaches in 90 days, down from 3-5/month caused by NUMA misalignment

No new hardware. Just better scheduling.

What to do with this

Enable single-numa-node topology manager policy on all inference-serving nodes. The admission strictness is a feature. It surfaces misalignment at deploy time instead of showing up as mysterious P99 spikes in production.
Use MIG over time-slicing for fractional GPU sharing on inference workloads. The hardware isolation eliminates noisy-neighbor latency variance. Save time-slicing for dev environments where predictability doesn’t matter.
Pair pod topology spread constraints with priority-based preemption. Spread handles failure domain resilience; priority classes handle resource protection. Together they let serving and training coexist on the same cluster without latency regression.

Kubernetes Pod Scheduling for GPU-Accelerated ML Inference: Topology-Aware Placement, Device Plugin Fractional Sharing, and the Affinity Rules That Cut Our P99 Latency by 40%

TL;DR

The quiet tax: cross-NUMA memory access

Configuring topology-aware scheduling

Pod topology spread for failure domain balance

Priority-based preemption: protecting serving pods

What we measured

What to do with this

Related Posts

PgBouncer transaction mode for 50k mobile users

Android LLM speed: KV cache persistence cuts latency 60%

gRPC-Web on mobile without a proxy: Connect Protocol

Kubernetes Pod Scheduling for GPU-Accelerated ML Inference: Topology-Aware Placement, Device Plugin Fractional Sharing, and the Affinity Rules That Cut Our P99 Latency by 40%

TL;DR

The quiet tax: cross-NUMA memory access

Configuring topology-aware scheduling

Fractional GPU sharing: MIG vs MPS vs time-slicing

Pod topology spread for failure domain balance

Priority-based preemption: protecting serving pods

What we measured

What to do with this

Related Posts

PgBouncer transaction mode for 50k mobile users

Android LLM speed: KV cache persistence cuts latency 60%

gRPC-Web on mobile without a proxy: Connect Protocol