MVP Factory
ai startup development

Kubernetes Pod Scheduling for GPU-Accelerated ML Inference: Topology-Aware Placement, Device Plugin Fractional Sharing, and the Affinity Rules That Cut Our P99 Latency by 40%

KW
Krystian Wiewiór · · 5 min read

TL;DR

NUMA-misaligned GPU pod placement silently adds 2-5ms to every inference call. By combining Kubernetes 1.36’s topology manager in single-numa-node policy with fractional GPU sharing via MIG partitioning and pod topology spread constraints, we cut P99 inference latency by 40% across a heterogeneous A100/H100 cluster without adding a single GPU. This post breaks down how the scheduling stack fits together and where most teams trip up.


The quiet tax: cross-NUMA memory access

The number one overlooked latency source in production inference systems isn’t the model or the framework. It’s pod-to-GPU-to-CPU topology. When a pod lands on NUMA node 0 but its allocated GPU sits behind NUMA node 1, every tensor transfer crosses the interconnect.

PlacementAvg Latency (ms)P99 Latency (ms)Throughput (req/s)
NUMA-aligned8.214.11,240
NUMA-misaligned10.723.8940
Delta+30%+69%-24%

Benchmarks: BERT-large inference, batch size 1, A100 80GB, Kubernetes 1.36, Ubuntu 22.04

A 69% P99 penalty compounds across a fleet. At scale, that’s the gap between hitting your SLA and fielding pages at 2am.

Configuring topology-aware scheduling

Kubernetes’ topology manager coordinates resource alignment across the kubelet. You have two realistic policy choices for GPU inference:

  • single-numa-node: Strict. All resources (CPU, memory, GPU) must come from one NUMA node. Pod admission fails if alignment is impossible.
  • best-effort: Prefers alignment but admits misaligned pods rather than rejecting them.

For latency-sensitive serving, always use single-numa-node. For batch training, best-effort works fine. Configure it in the kubelet:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
topologyManagerPolicy: single-numa-node
topologyManagerScope: pod

Set topologyManagerScope: pod (not container) so the entire pod’s resources align. This matters when sidecars share the pod spec.

Fractional GPU sharing: MIG vs MPS vs time-slicing

Not every inference workload needs a full A100. The mistake I see most often: teams default to time-slicing because it’s simplest, then can’t figure out why tail latencies spike under contention.

MethodIsolationLatency predictabilityMemory guaranteeSetup complexity
Time-slicingNone (context switch)Poor under contentionNoneLow
MPSPartial (shared context)ModerateNoneMedium
MIGFull (hardware partition)ExcellentYesHigh

MIG wins for inference, and it’s not close. Each MIG partition is a true hardware slice with its own memory bandwidth, compute units, and L2 cache. On an A100, you can carve seven 1g.10gb instances or various larger profiles. Each partition registers as a distinct extended resource:

resources:
  limits:
    nvidia.com/mig-1g.10gb: 1

The device plugin API v1beta1 in Kubernetes 1.36 reports these as individual allocatable resources, and the topology manager aligns them correctly to NUMA nodes. Time-slicing can’t match this. When two inference pods share a time-sliced GPU, a batch job’s large allocation can stall your serving workload for entire scheduling quanta.

Pod topology spread for failure domain balance

Once pods are NUMA-aligned, you still need to spread inference replicas across failure domains. Use topologySpreadConstraints to prevent all replicas from piling onto the same node or zone:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: inference-serving

This gives you zone-level resilience without sacrificing per-pod NUMA alignment. Each pod individually gets aligned placement; the fleet distributes across zones.

Priority-based preemption: protecting serving pods

In a heterogeneous cluster running both inference serving and batch training, you need clear preemption rules. Define two PriorityClasses:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: serving-critical
value: 1000000
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-training
value: 100
preemptionPolicy: Never

Serving pods at priority 1000000 will evict batch jobs when GPU capacity is constrained. Setting preemptionPolicy: Never on batch jobs prevents them from evicting each other in cascading thrash. One thing worth calling out: batch jobs should also set reasonable activeDeadlineSeconds so they don’t squat on fractional GPU slices indefinitely after preemption restarts.

What we measured

After rolling out the full stack (single-numa-node topology policy, MIG fractional sharing, topology spread constraints, and priority preemption) across a 48-GPU mixed cluster:

  • P99 latency dropped 40% (23.8ms to 14.3ms)
  • GPU utilization increased 22% from MIG packing vs whole-GPU allocation
  • Zero SLA breaches in 90 days, down from 3-5/month caused by NUMA misalignment

No new hardware. Just better scheduling.

What to do with this

  1. Enable single-numa-node topology manager policy on all inference-serving nodes. The admission strictness is a feature. It surfaces misalignment at deploy time instead of showing up as mysterious P99 spikes in production.

  2. Use MIG over time-slicing for fractional GPU sharing on inference workloads. The hardware isolation eliminates noisy-neighbor latency variance. Save time-slicing for dev environments where predictability doesn’t matter.

  3. Pair pod topology spread constraints with priority-based preemption. Spread handles failure domain resilience; priority classes handle resource protection. Together they let serving and training coexist on the same cluster without latency regression.


Share: Twitter LinkedIn