MVP Factory
ai startup development

Zero Downtime Database Migrations: A Practical Guide for PostgreSQL

KW
Krystian Wiewiór · · 5 min read

TL;DR

Schema migrations are the #1 cause of unplanned downtime in PostgreSQL-backed services. The fix isn’t avoiding migrations. It’s understanding which DDL operations acquire exclusive locks and restructuring your migrations to dodge them. This guide covers the patterns I use in production: concurrent index creation, safe column additions, view-based renames, and tooling with golang-migrate and Flyway.


The problem most teams ignore

In my experience building production systems, the conversation about zero downtime usually focuses on blue-green deployments and rolling updates. But the database is where things actually break. A single ALTER TABLE ... ADD COLUMN with a volatile default on a 50M-row table can lock writes for minutes. If you’re handling thousands of requests per second, that’s a full outage.

Most teams get this wrong because they treat migrations as atomic events instead of multi-phase rollouts.

The lock problem

PostgreSQL uses an AccessExclusiveLock for many DDL operations. This lock blocks all concurrent queries, reads and writes alike.

OperationLock TypeBlocks Reads?Blocks Writes?Safe at Scale?
ADD COLUMN (no default)AccessExclusiveYes (briefly)Yes (briefly)Yes
ADD COLUMN ... DEFAULT x (PG 11+)AccessExclusiveYes (briefly)Yes (briefly)Yes
ADD COLUMN ... DEFAULT x (PG <11)AccessExclusiveYes (long)Yes (long)No
CREATE INDEXShareLockNoYesNo
CREATE INDEX CONCURRENTLYShareUpdateExclusiveNoNoYes
DROP COLUMNAccessExclusiveYes (briefly)Yes (briefly)Yes
ALTER COLUMN TYPEAccessExclusiveYes (long)Yes (long)No

The pattern is straightforward: any operation that rewrites the table or scans it fully under an exclusive lock is dangerous.

Pattern 1: Safe column addition

PostgreSQL 11+ stores the default value in the catalog instead of rewriting every row. This makes ADD COLUMN ... DEFAULT nearly instant:

-- Safe on PostgreSQL 11+
ALTER TABLE orders ADD COLUMN status TEXT NOT NULL DEFAULT 'pending';

For PostgreSQL versions below 11, split this into two steps:

ALTER TABLE orders ADD COLUMN status TEXT;
-- Backfill in batches
UPDATE orders SET status = 'pending' WHERE status IS NULL AND id BETWEEN 1 AND 100000;
-- After backfill completes
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';

Pattern 2: Concurrent index creation

Standard CREATE INDEX holds a lock that blocks writes for the entire build duration. On a large table, that can be minutes or hours.

-- Blocks writes -- never use this in production
CREATE INDEX idx_orders_status ON orders(status);

-- Non-blocking -- use this instead
CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);

One thing that trips people up: CREATE INDEX CONCURRENTLY cannot run inside a transaction block. If you’re using golang-migrate, you need to annotate the migration file:

-- +goose NO TRANSACTION
CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);

For Flyway users, place the statement in a non-transactional migration callback or use the executeInTransaction=false flag in your versioned migration.

Pattern 3: Renaming columns via views

Renaming a column directly acquires an exclusive lock and breaks every query referencing the old name. Instead, use an expand-and-contract approach:

-- Phase 1: Add new column, backfill
ALTER TABLE users ADD COLUMN full_name TEXT;
UPDATE users SET full_name = name WHERE full_name IS NULL;

-- Phase 2: Create a view for backward compatibility
CREATE VIEW users_v AS SELECT *, name AS legacy_name FROM users;

-- Phase 3: Migrate application code to use full_name

-- Phase 4: Drop old column after all consumers have migrated
ALTER TABLE users DROP COLUMN name;

It’s more work than a simple rename, but it decouples the schema change from the application deployment. Nobody gets paged.

Pattern 4: Safe type changes

Changing a column type (e.g., INT to BIGINT) rewrites the entire table. The safe approach mirrors the rename pattern:

-- Step 1: Add new column
ALTER TABLE events ADD COLUMN event_id_new BIGINT;

-- Step 2: Dual-write via trigger
CREATE OR REPLACE FUNCTION sync_event_id() RETURNS TRIGGER AS $$
BEGIN
  NEW.event_id_new := NEW.event_id;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_sync_event_id BEFORE INSERT OR UPDATE ON events
FOR EACH ROW EXECUTE FUNCTION sync_event_id();

-- Step 3: Backfill in batches
-- Step 4: Swap columns, drop old

Tooling: golang-migrate vs Flyway

Featuregolang-migrateFlyway
LanguageGoJava/JVM
Transaction controlPer-file annotationsConfig flag
Concurrent DDL supportManual (NO TRANSACTION)executeInTransaction=false
Rollback supportDown migrationsUndo migrations (Teams edition)
CI/CD integrationCLI binary, lightweightPlugin ecosystem (Maven, Gradle)

Both tools work. I reach for golang-migrate in Go services and Flyway in JVM-heavy stacks. The one thing that actually matters is that your tool supports non-transactional migrations. Without that, CREATE INDEX CONCURRENTLY is impossible.

Setting statement timeouts

Always set a lock timeout to prevent a migration from waiting indefinitely for a lock:

SET lock_timeout = '5s';
ALTER TABLE orders ADD COLUMN priority INT DEFAULT 0;

If the lock can’t be acquired within 5 seconds, the statement fails cleanly instead of queuing behind long-running queries and cascading into an outage. I’ve seen a single unguarded DDL statement take down an entire API. Five seconds is a cheap insurance policy.

What to remember

  1. Use CREATE INDEX CONCURRENTLY in production and make sure your migration tool supports non-transactional execution.
  2. Never change column types in place. Use the expand-contract pattern: new column, trigger-based dual writes, batched backfill, final swap.
  3. Set lock_timeout on every migration statement. A 5-second ceiling prevents one DDL from becoming a system-wide outage.

Share: Twitter LinkedIn