Zero-downtime schema migrations in production PostgreSQL
Meta description: Advisory locks, ghost table swaps, and batched backfills let you ALTER TABLE on massive PostgreSQL databases without scheduling a maintenance window.
Tags: backend, architecture, devops, cicd, saas
TL;DR
Certain ALTER TABLE operations on large PostgreSQL tables acquire an ACCESS EXCLUSIVE lock and rewrite every row, blocking all reads and writes for minutes. On a 500GB multi-tenant SaaS database, that means downtime. Tools like pg_osc and pgroll solve this with a ghost table copy-and-swap pattern coordinated by advisory locks, trigger-based row sync, and batched backfills. This post covers how the pipeline works and how to embed it into your Ktor CI/CD flow.
The problem: some ALTER TABLE operations rewrite the entire table
Not every schema change is dangerous. PostgreSQL 11+ made ADD COLUMN ... DEFAULT val a catalog-only operation. It completes in milliseconds regardless of table size. But when you run ALTER TABLE orders ALTER COLUMN id TYPE bigint on a table with hundreds of millions of rows, PostgreSQL must rewrite every single row to widen the column. It acquires an ACCESS EXCLUSIVE lock for the duration, and every concurrent SELECT, INSERT, and UPDATE queues behind it.
In my experience, the lock wait alone can cascade into connection pool exhaustion within seconds. Your API starts returning 503s, your health checks fail, and Kubernetes starts cycling pods, which only makes things worse.
| Operation | Lock Level | Blocks Reads | Blocks Writes | Notes |
|---|---|---|---|---|
ADD COLUMN (no default) | ACCESS EXCLUSIVE | Sub-second | Sub-second | Catalog-only, all PG versions |
ADD COLUMN DEFAULT val (PG 11+) | ACCESS EXCLUSIVE | Sub-second | Sub-second | Catalog-only since PG 11 |
ADD COLUMN DEFAULT val (PG < 11) | ACCESS EXCLUSIVE | Yes, full rewrite | Yes, full rewrite | Rewrites every row |
ALTER COLUMN TYPE | ACCESS EXCLUSIVE | Yes, full rewrite | Yes, full rewrite | Triggers complete table rewrite |
VALIDATE CONSTRAINT | SHARE UPDATE EXCLUSIVE | No | Yes | Scans all rows, blocks writes |
CREATE INDEX CONCURRENTLY | SHARE UPDATE EXCLUSIVE | No | No | Safe but slow |
| Ghost table swap (pg_osc) | Advisory lock only | No | No | Application-transparent |
The operations that genuinely hurt (column type changes, constraint validation on large tables, backfilling computed columns) are exactly where online schema change tooling pays off.
The ghost table strategy: how pg_osc and pgroll work
The core pattern is straightforward:
- Create a shadow (ghost) table mirroring the original schema, plus your desired changes.
- Install a trigger on the original table that replicates every
INSERT,UPDATE, andDELETEto the ghost table in real time. - Backfill existing rows from the original to the ghost in small batches.
- Swap the tables using
ALTER TABLE ... RENAMEinside a brief transaction. - Drop the old table once all connections have drained.
Advisory lock coordination
One thing I see teams get wrong: running two migration workers simultaneously against the same table will corrupt data. Both pg_osc and pgroll use PostgreSQL advisory locks (pg_advisory_lock) to guarantee single-writer semantics.
-- Worker acquires a migration lock scoped to the table OID
SELECT pg_advisory_lock(hashtext('migrations'), 'orders'::regclass::int);
This is non-blocking to application queries. Advisory locks exist in a separate namespace from table locks. If a second migration worker attempts to start, it blocks on the advisory lock, not on the table.
Trigger-based row synchronization
During the backfill phase, the trigger on the source table captures concurrent writes. Below is a simplified example for a migration adding a region_code column:
CREATE FUNCTION ghost_sync() RETURNS trigger AS $$
BEGIN
INSERT INTO orders_ghost SELECT NEW.*
ON CONFLICT (id) DO UPDATE
SET region_code = EXCLUDED.region_code,
updated_at = EXCLUDED.updated_at,
amount = EXCLUDED.amount;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
In practice, pg_osc generates this trigger dynamically to cover all columns. The ON CONFLICT ... DO UPDATE pattern ensures that a row written by the application after the backfill batch already copied it will carry the latest state.
Batched backfill with progress tracking
Backfilling 500 million rows in one transaction would blow out WAL and memory. Instead, the migration processes rows in configurable batches, typically 10,000-50,000 rows per transaction, with a throttle delay between batches to limit replication lag.
// Ktor: migration as a coroutine job
launch(Dispatchers.IO) {
migrationService.backfillInBatches(
sourceTable = "orders",
ghostTable = "orders_ghost",
batchSize = 25_000,
throttleMs = 100
)
}
Progress tracking is as simple as recording the last processed primary key and exposing it via a /migrations/status endpoint. Your CI/CD pipeline can poll this to gate deployments on migration completion.
CI/CD integration: the Ktor pattern
The migration pipeline fits naturally into a deploy-then-migrate workflow. If you’re on Spring Boot, the same principles apply: Flyway or Liquibase can invoke pg_osc as an external process within a migration script. The Ktor-native approach uses coroutines for clean orchestration.
- Deploy new code that is backward-compatible with both old and new schemas.
- Trigger migration via a post-deploy CI step (
./gradlew migrateOnline). - Poll progress until the swap completes.
- Deploy cleanup code that removes backward-compatibility shims.
// Ktor route: migration status for CI/CD polling
routing {
get("/migrations/status") {
val status = migrationService.currentStatus()
call.respond(status) // { "table": "orders", "progress": 0.73, "phase": "backfill" }
}
post("/migrations/trigger") {
migrationService.startOnlineMigration("orders", "V12__widen_order_id.sql")
call.respond(HttpStatusCode.Accepted)
}
}
Automatic rollback
If the backfill encounters errors or replication lag exceeds a threshold, the pipeline drops the ghost table and releases the advisory lock. The original table is untouched. This is the real operational win: failure is always safe.
Operational reality
With ghost table swaps, your p99 API latency during migration stays within single-digit percentage points of normal operation. No maintenance window. No 2 AM deploy. Long-running migrations on massive tables can take hours, but your users never notice.
| Metric | Raw ALTER TABLE | Ghost table swap |
|---|---|---|
| p99 latency during migration | Timeout / 503 | +3-5% above baseline |
| Write availability | 0% (locked) | 100% |
| Rollback safety | Manual, risky | Automatic, drop ghost |
| Total wall time (500M rows) | 8-12 min (locked) | 2-4 hours (online) |
You trade wall-clock time for continuous availability. On any production system serving real users, that tradeoff is not even close.
What to actually do
- Know which operations require ghost-table tooling. Adding a nullable column or a column with a default on PG 11+ is catalog-only and instant. Column type changes, constraint validation, and pre-PG 11 defaults rewrite the table. Use
pg_osc,pgroll, or equivalent for those. - Gate your deployments on migration status. Expose a progress endpoint and integrate it into your CI/CD pipeline so cleanup code only deploys after the swap completes.
- Batch your backfills and throttle aggressively. A 100ms pause between 25K-row batches adds minutes to the total migration time but prevents replication lag spikes that can cascade into replica failovers.