MVP Factory
ai startup development

Spot node pool configuration

KW
Krystian Wiewiór · · 4 min read

TL;DR

GitHub-hosted runners are convenient but expensive at scale. I migrated to self-hosted runners on spot instances using actions-runner-controller (ARC), added persistent Gradle and Docker layer caches on shared NVMe volumes, and built cost-per-build dashboards to keep things honest. The result: 85% lower CI/CD spend with build reliability above 99%. The trick is treating your build pipeline as an engineering cost center, not a blank check.


The problem: nobody watches the CI bill

CI/CD is the last budget line most teams scrutinize. I’ve watched orgs obsess over shaving pennies from their application infrastructure while their build pipeline quietly burns thousands a month. GitHub-hosted runners bill per-minute with no volume discount, and costs scale linearly with team size and merge frequency.

The numbers make the case on their own:

Runner typevCPURAMCost/min (Linux)Monthly cost (2,000 build-hrs)
GitHub-hosted (4-core)416 GB$0.064~$7,680
Self-hosted on-demand (c6a.xlarge)48 GB~$0.025~$3,000
Self-hosted spot (c6a.xlarge)48 GB~$0.008~$960

That bottom row is where the 85% reduction lives.


Architecture: actions-runner-controller on spot instances

actions-runner-controller (ARC) manages self-hosted runner pods in Kubernetes. Here’s how I set it up.

Cluster setup

Dedicate a node pool to CI runners using spot/preemptible instances. Taints keep production workloads off these nodes:

# Spot node pool configuration
nodePool:
  name: ci-runners
  machineType: c6a.xlarge
  spotInstances: true
  taints:
    - key: workload-type
      value: ci
      effect: NoSchedule
  labels:
    role: ci-runner

ARC’s RunnerDeployment targets this pool with matching tolerations and a nodeSelector, so runners only land on spot nodes.

Graceful preemption handling

Spot instances can be reclaimed with a two-minute warning. If you don’t handle this, builds get corrupted mid-run. The approach has three pieces:

  1. A termination handler DaemonSet watches the cloud provider’s metadata endpoint for interruption notices.
  2. On notice, the handler cordons the node and sends SIGTERM to the runner process.
  3. ARC’s runner reports failure gracefully, and the workflow’s retry strategy re-queues the job on a healthy node.
# Workflow with preemption-safe retry
jobs:
  build:
    runs-on: self-hosted
    strategy:
      max-parallel: 4
    timeout-minutes: 30
    # Re-queue on spot eviction
    max-attempts: 2

Spot eviction rates on compute-heavy instance families (which are widely available) tend to sit between 3-8%. With retry logic, actual build failures from preemption drop below 1%.


The caching layer that makes it work

Spot savings are worthless if every evicted job restarts from scratch. You need persistent caching. Full stop.

Shared NVMe cache architecture

Provision a persistent volume (EBS io2 or local NVMe with a replication layer) mounted to all runner pods:

Cache targetTypical sizeCold buildWarm buildSavings
Gradle dependencies + build cache2-5 GB8-12 min1-3 min~75%
Docker layer cache (BuildKit)5-15 GB6-10 min1-2 min~80%
Node modules (hashed)1-3 GB2-4 min10-20s~90%

The Gradle build cache matters most for Kotlin/Android projects. Point it at the shared volume:

# gradle.properties
org.gradle.caching=true
org.gradle.caching.local.directory=/mnt/ci-cache/gradle/build-cache

For Docker, point BuildKit at the shared cache:

docker buildx build \
  --cache-from type=local,src=/mnt/ci-cache/docker \
  --cache-to type=local,dest=/mnt/ci-cache/docker,mode=max \
  .

Cache eviction

Without eviction, caches grow forever. A daily CronJob that prunes entries older than 7 days and caps total size at a fixed threshold handles this. Simple LRU based on access time works fine.


Cost-per-build metrics

If you’re not measuring cost per build, you’re guessing. Export these from every build:

  • cost_per_build — (instance cost/min x duration) + storage cost
  • cache_hit_rate — percentage of tasks served from cache
  • spot_eviction_rate — evictions / total jobs
  • queue_wait_time — time from trigger to runner assignment

Push them to Prometheus via a post-job hook and build Grafana dashboards. When cost-per-build trends upward, you can see exactly which cache degraded or which workflow lost parallelism. No guessing, no “I think builds got slower.”


Where to start

Start with ARC and a spot node pool. Even a bare-bones setup with retry logic cuts costs by 60%+ with minimal reliability risk. The infrastructure payoff is immediate.

Before you scale parallelism, invest in shared caches. Adding more runners without caching just multiplies cold-build costs. Gradle build cache and Docker layer cache give you the biggest return.

And instrument cost-per-build from day one. Dashboards and alerts make optimization conversations concrete instead of vibes-based, and they keep savings durable as your team grows. Without them, costs creep back up and nobody notices until the next quarterly review.


Share: Twitter LinkedIn