Spot node pool configuration
TL;DR
GitHub-hosted runners are convenient but expensive at scale. I migrated to self-hosted runners on spot instances using actions-runner-controller (ARC), added persistent Gradle and Docker layer caches on shared NVMe volumes, and built cost-per-build dashboards to keep things honest. The result: 85% lower CI/CD spend with build reliability above 99%. The trick is treating your build pipeline as an engineering cost center, not a blank check.
The problem: nobody watches the CI bill
CI/CD is the last budget line most teams scrutinize. I’ve watched orgs obsess over shaving pennies from their application infrastructure while their build pipeline quietly burns thousands a month. GitHub-hosted runners bill per-minute with no volume discount, and costs scale linearly with team size and merge frequency.
The numbers make the case on their own:
| Runner type | vCPU | RAM | Cost/min (Linux) | Monthly cost (2,000 build-hrs) |
|---|---|---|---|---|
| GitHub-hosted (4-core) | 4 | 16 GB | $0.064 | ~$7,680 |
| Self-hosted on-demand (c6a.xlarge) | 4 | 8 GB | ~$0.025 | ~$3,000 |
| Self-hosted spot (c6a.xlarge) | 4 | 8 GB | ~$0.008 | ~$960 |
That bottom row is where the 85% reduction lives.
Architecture: actions-runner-controller on spot instances
actions-runner-controller (ARC) manages self-hosted runner pods in Kubernetes. Here’s how I set it up.
Cluster setup
Dedicate a node pool to CI runners using spot/preemptible instances. Taints keep production workloads off these nodes:
# Spot node pool configuration
nodePool:
name: ci-runners
machineType: c6a.xlarge
spotInstances: true
taints:
- key: workload-type
value: ci
effect: NoSchedule
labels:
role: ci-runner
ARC’s RunnerDeployment targets this pool with matching tolerations and a nodeSelector, so runners only land on spot nodes.
Graceful preemption handling
Spot instances can be reclaimed with a two-minute warning. If you don’t handle this, builds get corrupted mid-run. The approach has three pieces:
- A termination handler DaemonSet watches the cloud provider’s metadata endpoint for interruption notices.
- On notice, the handler cordons the node and sends
SIGTERMto the runner process. - ARC’s runner reports failure gracefully, and the workflow’s
retrystrategy re-queues the job on a healthy node.
# Workflow with preemption-safe retry
jobs:
build:
runs-on: self-hosted
strategy:
max-parallel: 4
timeout-minutes: 30
# Re-queue on spot eviction
max-attempts: 2
Spot eviction rates on compute-heavy instance families (which are widely available) tend to sit between 3-8%. With retry logic, actual build failures from preemption drop below 1%.
The caching layer that makes it work
Spot savings are worthless if every evicted job restarts from scratch. You need persistent caching. Full stop.
Shared NVMe cache architecture
Provision a persistent volume (EBS io2 or local NVMe with a replication layer) mounted to all runner pods:
| Cache target | Typical size | Cold build | Warm build | Savings |
|---|---|---|---|---|
| Gradle dependencies + build cache | 2-5 GB | 8-12 min | 1-3 min | ~75% |
| Docker layer cache (BuildKit) | 5-15 GB | 6-10 min | 1-2 min | ~80% |
| Node modules (hashed) | 1-3 GB | 2-4 min | 10-20s | ~90% |
The Gradle build cache matters most for Kotlin/Android projects. Point it at the shared volume:
# gradle.properties
org.gradle.caching=true
org.gradle.caching.local.directory=/mnt/ci-cache/gradle/build-cache
For Docker, point BuildKit at the shared cache:
docker buildx build \
--cache-from type=local,src=/mnt/ci-cache/docker \
--cache-to type=local,dest=/mnt/ci-cache/docker,mode=max \
.
Cache eviction
Without eviction, caches grow forever. A daily CronJob that prunes entries older than 7 days and caps total size at a fixed threshold handles this. Simple LRU based on access time works fine.
Cost-per-build metrics
If you’re not measuring cost per build, you’re guessing. Export these from every build:
- cost_per_build — (instance cost/min x duration) + storage cost
- cache_hit_rate — percentage of tasks served from cache
- spot_eviction_rate — evictions / total jobs
- queue_wait_time — time from trigger to runner assignment
Push them to Prometheus via a post-job hook and build Grafana dashboards. When cost-per-build trends upward, you can see exactly which cache degraded or which workflow lost parallelism. No guessing, no “I think builds got slower.”
Where to start
Start with ARC and a spot node pool. Even a bare-bones setup with retry logic cuts costs by 60%+ with minimal reliability risk. The infrastructure payoff is immediate.
Before you scale parallelism, invest in shared caches. Adding more runners without caching just multiplies cold-build costs. Gradle build cache and Docker layer cache give you the biggest return.
And instrument cost-per-build from day one. Dashboards and alerts make optimization conversations concrete instead of vibes-based, and they keep savings durable as your team grows. Without them, costs creep back up and nobody notices until the next quarterly review.