Zero Downtime Database Migrations: A Practical Guide for PostgreSQL
TL;DR
Schema migrations are the #1 cause of unplanned downtime in PostgreSQL-backed services. The fix isn’t avoiding migrations. It’s understanding which DDL operations acquire exclusive locks and restructuring your migrations to dodge them. This guide covers the patterns I use in production: concurrent index creation, safe column additions, view-based renames, and tooling with golang-migrate and Flyway.
The problem most teams ignore
In my experience building production systems, the conversation about zero downtime usually focuses on blue-green deployments and rolling updates. But the database is where things actually break. A single ALTER TABLE ... ADD COLUMN with a volatile default on a 50M-row table can lock writes for minutes. If you’re handling thousands of requests per second, that’s a full outage.
Most teams get this wrong because they treat migrations as atomic events instead of multi-phase rollouts.
The lock problem
PostgreSQL uses an AccessExclusiveLock for many DDL operations. This lock blocks all concurrent queries, reads and writes alike.
| Operation | Lock Type | Blocks Reads? | Blocks Writes? | Safe at Scale? |
|---|---|---|---|---|
ADD COLUMN (no default) | AccessExclusive | Yes (briefly) | Yes (briefly) | Yes |
ADD COLUMN ... DEFAULT x (PG 11+) | AccessExclusive | Yes (briefly) | Yes (briefly) | Yes |
ADD COLUMN ... DEFAULT x (PG <11) | AccessExclusive | Yes (long) | Yes (long) | No |
CREATE INDEX | ShareLock | No | Yes | No |
CREATE INDEX CONCURRENTLY | ShareUpdateExclusive | No | No | Yes |
DROP COLUMN | AccessExclusive | Yes (briefly) | Yes (briefly) | Yes |
ALTER COLUMN TYPE | AccessExclusive | Yes (long) | Yes (long) | No |
The pattern is straightforward: any operation that rewrites the table or scans it fully under an exclusive lock is dangerous.
Pattern 1: Safe column addition
PostgreSQL 11+ stores the default value in the catalog instead of rewriting every row. This makes ADD COLUMN ... DEFAULT nearly instant:
-- Safe on PostgreSQL 11+
ALTER TABLE orders ADD COLUMN status TEXT NOT NULL DEFAULT 'pending';
For PostgreSQL versions below 11, split this into two steps:
ALTER TABLE orders ADD COLUMN status TEXT;
-- Backfill in batches
UPDATE orders SET status = 'pending' WHERE status IS NULL AND id BETWEEN 1 AND 100000;
-- After backfill completes
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';
Pattern 2: Concurrent index creation
Standard CREATE INDEX holds a lock that blocks writes for the entire build duration. On a large table, that can be minutes or hours.
-- Blocks writes -- never use this in production
CREATE INDEX idx_orders_status ON orders(status);
-- Non-blocking -- use this instead
CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);
One thing that trips people up: CREATE INDEX CONCURRENTLY cannot run inside a transaction block. If you’re using golang-migrate, you need to annotate the migration file:
-- +goose NO TRANSACTION
CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);
For Flyway users, place the statement in a non-transactional migration callback or use the executeInTransaction=false flag in your versioned migration.
Pattern 3: Renaming columns via views
Renaming a column directly acquires an exclusive lock and breaks every query referencing the old name. Instead, use an expand-and-contract approach:
-- Phase 1: Add new column, backfill
ALTER TABLE users ADD COLUMN full_name TEXT;
UPDATE users SET full_name = name WHERE full_name IS NULL;
-- Phase 2: Create a view for backward compatibility
CREATE VIEW users_v AS SELECT *, name AS legacy_name FROM users;
-- Phase 3: Migrate application code to use full_name
-- Phase 4: Drop old column after all consumers have migrated
ALTER TABLE users DROP COLUMN name;
It’s more work than a simple rename, but it decouples the schema change from the application deployment. Nobody gets paged.
Pattern 4: Safe type changes
Changing a column type (e.g., INT to BIGINT) rewrites the entire table. The safe approach mirrors the rename pattern:
-- Step 1: Add new column
ALTER TABLE events ADD COLUMN event_id_new BIGINT;
-- Step 2: Dual-write via trigger
CREATE OR REPLACE FUNCTION sync_event_id() RETURNS TRIGGER AS $$
BEGIN
NEW.event_id_new := NEW.event_id;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_sync_event_id BEFORE INSERT OR UPDATE ON events
FOR EACH ROW EXECUTE FUNCTION sync_event_id();
-- Step 3: Backfill in batches
-- Step 4: Swap columns, drop old
Tooling: golang-migrate vs Flyway
| Feature | golang-migrate | Flyway |
|---|---|---|
| Language | Go | Java/JVM |
| Transaction control | Per-file annotations | Config flag |
| Concurrent DDL support | Manual (NO TRANSACTION) | executeInTransaction=false |
| Rollback support | Down migrations | Undo migrations (Teams edition) |
| CI/CD integration | CLI binary, lightweight | Plugin ecosystem (Maven, Gradle) |
Both tools work. I reach for golang-migrate in Go services and Flyway in JVM-heavy stacks. The one thing that actually matters is that your tool supports non-transactional migrations. Without that, CREATE INDEX CONCURRENTLY is impossible.
Setting statement timeouts
Always set a lock timeout to prevent a migration from waiting indefinitely for a lock:
SET lock_timeout = '5s';
ALTER TABLE orders ADD COLUMN priority INT DEFAULT 0;
If the lock can’t be acquired within 5 seconds, the statement fails cleanly instead of queuing behind long-running queries and cascading into an outage. I’ve seen a single unguarded DDL statement take down an entire API. Five seconds is a cheap insurance policy.
What to remember
- Use
CREATE INDEX CONCURRENTLYin production and make sure your migration tool supports non-transactional execution. - Never change column types in place. Use the expand-contract pattern: new column, trigger-based dual writes, batched backfill, final swap.
- Set
lock_timeouton every migration statement. A 5-second ceiling prevents one DDL from becoming a system-wide outage.