nvidia-device-plugin ConfigMap

TL;DR

Running mixed-priority LLM inference on shared GPU nodes is probably the best infrastructure money you’ll spend at moderate scale. By combining Kubernetes device plugins, NVIDIA MPS for GPU time-slicing, and a custom priority queue that preempts batch jobs for real-time requests, we cut our GPU serving costs by 70% compared to API-based inference and 55% compared to naive dedicated-node deployments. This is the resource architecture that made it work.

The trigger

When Canva’s AI-powered Magic Layers feature recently replaced the word “Palestine” in user designs — a failure in their content generation pipeline — it was a sharp reminder of something every engineering team running AI at scale already knows: the infrastructure behind inference matters as much as the model itself. Reliability, priority handling, and resource governance aren’t optional. They’re the architecture.

From building production systems that serve both real-time user-facing requests and background batch workloads, I can tell you the single biggest cost lever is not model optimization. It’s scheduling.

The problem: GPU utilization is embarrassingly low

Most teams deploying LLMs on Kubernetes fall into one of two traps:

Approach	Avg GPU Utilization	Monthly Cost (8x A100 cluster)	Latency P99
Dedicated nodes per workload	15-25%	~$52,000	Low, stable
Naive shared scheduling	40-60%	~$35,000	Unpredictable spikes
Priority queue + MPS time-slicing	70-85%	~$16,000	Low for P0, relaxed for batch
External API calls (comparable throughput)	N/A	~$55,000	Variable

Dedicated nodes waste 75%+ of your GPU compute. Naive sharing creates latency nightmares. The priority-aware approach splits the difference — and actually lands it.

The architecture: three layers

Layer 1: NVIDIA MPS for GPU time-slicing

NVIDIA’s Multi-Process Service lets multiple pods share a single GPU with actual compute partitioning, not just memory splitting. In your Kubernetes device plugin config:

# nvidia-device-plugin ConfigMap
version: v1
sharing:
  timeSlicing:
    renameByDefault: false
    resources:
      - name: nvidia.com/gpu
        replicas: 4  # 4 virtual GPUs per physical GPU

This gives you 4 schedulable GPU slices per physical device. Each slice gets fair-share compute access, and MPS handles context switching at the hardware level, which is far more efficient than container-level time-sharing.

Layer 2: Priority classes and preemption

Define Kubernetes PriorityClasses that map to your workload tiers:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: realtime-inference
value: 1000000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "User-facing real-time LLM requests"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-inference
value: 100
preemptionPolicy: Never
globalDefault: false
description: "Background summarization, embeddings, batch jobs"

When a real-time inference pod needs GPU resources and the node is full, Kubernetes evicts batch pods automatically. Batch pods are designed to be idempotent and restart-safe; they pick up where they left off via checkpointed job queues.

Layer 3: The custom priority queue

The scheduler alone isn’t enough. You need an application-level priority queue sitting in front of your inference servers. Most teams get this wrong by trying to solve it entirely at the Kubernetes layer. But pod scheduling operates on minutes-scale granularity. Request-level prioritization needs millisecond decisions. Those are different problems.

We run a lightweight Go service that:

Accepts inference requests tagged with priority (P0 real-time, P1 near-real-time, P2 batch)
Routes P0 requests to a reserved capacity pool (guaranteed 30% of GPU slices)
Allows P1/P2 to fill remaining capacity with preemption semantics
Tracks per-tenant quotas via Redis-backed counters

The result: P0 latency stays under 200ms at P99, while batch throughput fills every idle GPU cycle.

The cost model

At moderate scale — roughly 2M-10M inference requests per month — self-hosted with this architecture breaks even against API pricing at around the 3M request mark. Beyond that, savings compound:

Monthly Requests	API Cost (est.)	Self-Hosted (this arch)	Savings
1M	$6,800	$16,000	-$9,200 (API wins)
3M	$20,400	$16,000	$4,400
5M	$34,000	$17,500	$16,500
10M	$68,000	$21,000	$47,000 (69%)

Infrastructure cost scales sub-linearly because GPU utilization increases with request volume. That’s the whole point of the architecture.

What to do with this

Enable MPS time-slicing before you buy more nodes. Most teams are running at 20% GPU utilization. NVIDIA MPS with 4x replicas per GPU can double or triple your effective capacity with a single ConfigMap change. It’s the cheapest win in this whole stack.

Separate scheduling concerns by timescale. Use Kubernetes PriorityClasses for pod-level preemption (seconds to minutes) and an application-level priority queue for request-level routing (milliseconds). Neither layer alone is sufficient, and I’ve watched teams burn weeks trying to force one layer to do both jobs.

Model your crossover point before committing. Self-hosted inference only wins at moderate scale. Below 3M requests/month, API calls are cheaper. Run the numbers with your actual token distributions and latency requirements before building infrastructure. I mean actually run them — not a back-of-napkin guess, but a week of production traffic logs through a cost simulator.

The GPU cost problem in AI serving is real, but it’s an architecture problem, not a hardware problem. Schedule smarter before you spend bigger.

nvidia-device-plugin ConfigMap

TL;DR

The trigger

The problem: GPU utilization is embarrassingly low

The architecture: three layers

Layer 1: NVIDIA MPS for GPU time-slicing

Layer 2: Priority classes and preemption

Layer 3: The custom priority queue

The cost model

What to do with this

Related Posts

Distributed tracing on a budget with OpenTelemetry and Grafana

nvidia-device-plugin ConfigMap

Record thermal + sched + freq data for 60 seconds