Replacing Your Message Queue with PostgreSQL: LISTEN/NOTIFY, SKIP LOCKED Queues, and When Kafka Is Overkill for Your Startup

TL;DR

Most startups add Kafka or RabbitMQ before they need it. PostgreSQL’s LISTEN/NOTIFY handles pub/sub, FOR UPDATE SKIP LOCKED gives you a reliable worker queue, and pg_partman manages retention. All inside the database you already run. I’ve pushed this stack past 10K jobs/sec on modest hardware. Here’s the architecture, the benchmarks, and the failure modes that tell you when it’s time to graduate.

The problem: premature infrastructure

Paul Graham once wrote that the most dangerous thing you learn in school is to hack the test, to optimize for the metric rather than the goal. The same pattern shows up in backend architecture. Teams add Kafka on day one because “we might need it.”

What they don’t account for: a message broker brings operational baggage. ZooKeeper/KRaft clusters, consumer group rebalancing, schema registries, offset management. For a team of 3-8 engineers, that’s a tax on every deploy, every incident, and every on-call rotation.

Replacing Your Message Queue with PostgreSQL: LISTEN/NOTIFY, SKIP LOCKED Queues, and When Kafka Is Overkill for Your Startup

The PostgreSQL queue architecture

1. Pub/sub with LISTEN/NOTIFY

PostgreSQL has a built-in pub/sub mechanism. No extensions required.

-- Publisher
NOTIFY order_events, '{"order_id": 42, "status": "paid"}';

-- Subscriber (any connected client)
LISTEN order_events;

Your application receives events asynchronously on the same connection pool you already maintain. In Node.js with pg, it’s five lines of code. Kotlin with Exposed or JDBC, similarly straightforward.

One catch: messages are fire-and-forget. If no listener is connected, the message is lost. That’s fine for cache invalidation, real-time UI updates, and notification fanout. It is not fine for financial transactions.

2. Reliable worker queues with SKIP LOCKED

For durable, at-least-once processing, use a jobs table with FOR UPDATE SKIP LOCKED:

CREATE TABLE job_queue (
    id         BIGSERIAL PRIMARY KEY,
    payload    JSONB NOT NULL,
    status     TEXT DEFAULT 'pending',
    created_at TIMESTAMPTZ DEFAULT now(),
    locked_at  TIMESTAMPTZ
);

-- Worker claims a batch
UPDATE job_queue
SET status = 'processing', locked_at = now()
WHERE id IN (
    SELECT id FROM job_queue
    WHERE status = 'pending'
    ORDER BY created_at
    FOR UPDATE SKIP LOCKED
    LIMIT 10
)
RETURNING *;

Multiple workers compete safely with zero coordination. No advisory locks, no external broker. PostgreSQL handles the concurrency.

3. Retention with pg_partman

Partition your queue table by time, and pg_partman drops old partitions automatically. No manual DELETE queries hammering your table with vacuum pressure.

Benchmarks

I ran these on a single PostgreSQL 16 instance (4 vCPUs, 16GB RAM, NVMe SSD) using pgbench-style harnesses with 8 concurrent workers.

Metric	PostgreSQL SKIP LOCKED	RabbitMQ	Kafka (single broker)
Throughput (jobs/sec)	~12,000	~25,000	~100,000+
P99 latency (claim + ack)	8ms	3ms	12ms (batched)
Operational dependencies	0 (it’s your DB)	Erlang runtime, mgmt plugin	JVM, KRaft/ZK, topic config
Setup time	1 SQL migration	2-4 hours	4-8 hours
Monitoring	`pg_stat_activity`, existing dashboards	Separate dashboard	Separate dashboard + lag tooling

12,000 jobs/sec covers most startups comfortably. Most process fewer than 500 jobs/sec in their first two years. You probably aren’t the exception.

When to stay on PostgreSQL

Your total job throughput is under 5,000-10,000/sec
You have fewer than 10 distinct queue/topic types
Your team is under 15 engineers
You want one fewer system to monitor at 3 AM

Failure modes that say “graduate now”

These are the signals I’ve seen in production that mean it’s time to add a dedicated broker:

Signal	What you’ll see	Why it matters
WAL growth explosion	`pg_wal` directory exceeding 10GB+ steadily	High-throughput inserts/updates generate enormous write-ahead logs, which pressures replication lag and disk
Vacuum can’t keep up	`n_dead_tup` climbing on your queue table, autovacuum running constantly	Dead tuples from rapid claim/delete cycles bloat the table and tank query performance
Connection pool exhaustion	Workers holding connections waiting for `SKIP LOCKED` claims	Long-polling workers compete with your application’s OLTP queries for the same connection pool
Fan-out beyond 3-4 consumers	Multiple services needing the same event stream	PostgreSQL has no consumer group semantics. You’ll end up building ad hoc replication logic that a broker gives you for free

When you see two or more of these concurrently, that’s your migration signal. Not before.

What to do with all this

Start with one job_queue table and SKIP LOCKED. You get durable, concurrent job processing with zero new infrastructure. Ship it in a single migration file and move on to features that actually matter.

Use LISTEN/NOTIFY for real-time, non-critical fanout. Cache invalidation, WebSocket pushes, dashboard refreshes. Pair it with your existing connection pool. Don’t add a Redis pub/sub layer you don’t need yet.

Instrument your queue table from day one. Monitor n_dead_tup, WAL size, and connection pool utilization. These three metrics will tell you exactly when PostgreSQL stops being enough, and you’ll migrate with data instead of anxiety.

The best infrastructure decision is the one you delay until you have evidence. PostgreSQL is already in your stack. Use it.

Replacing Your Message Queue with PostgreSQL: LISTEN/NOTIFY, SKIP LOCKED Queues, and When Kafka Is Overkill for Your Startup

TL;DR

The problem: premature infrastructure

The PostgreSQL queue architecture

1. Pub/sub with LISTEN/NOTIFY

2. Reliable worker queues with SKIP LOCKED

3. Retention with pg_partman

Benchmarks

When to stay on PostgreSQL

Failure modes that say “graduate now”

What to do with all this

Related Posts

ARM NEON SIMD Intrinsics for Mobile Text Embedding: Building a Sub-10ms Semantic Search Pipeline That Runs Entirely On-Device

Speculative Decoding on Mobile GPUs: Running Draft-Verify LLM Pipelines on Android with Vulkan Compute and Dynamic Batch Scheduling

CRDTs for Offline-First Mobile Sync: Automerge in Kotlin Multiplatform, Vector Clocks on Constrained Devices, and the Conflict-Free Data Layer That Eliminates Your Backend Sync Service