MVP Factory
Replacing Your Message Queue with PostgreSQL: LISTEN/NOTIFY, SKIP LOCKED Queues, and When Kafka Is Overkill for Your Startup
ai startup development

Replacing Your Message Queue with PostgreSQL: LISTEN/NOTIFY, SKIP LOCKED Queues, and When Kafka Is Overkill for Your Startup

KW
Krystian Wiewiór · · 5 min read

TL;DR

Most startups add Kafka or RabbitMQ before they need it. PostgreSQL’s LISTEN/NOTIFY handles pub/sub, FOR UPDATE SKIP LOCKED gives you a reliable worker queue, and pg_partman manages retention. All inside the database you already run. I’ve pushed this stack past 10K jobs/sec on modest hardware. Here’s the architecture, the benchmarks, and the failure modes that tell you when it’s time to graduate.


The problem: premature infrastructure

Paul Graham once wrote that the most dangerous thing you learn in school is to hack the test, to optimize for the metric rather than the goal. The same pattern shows up in backend architecture. Teams add Kafka on day one because “we might need it.”

What they don’t account for: a message broker brings operational baggage. ZooKeeper/KRaft clusters, consumer group rebalancing, schema registries, offset management. For a team of 3-8 engineers, that’s a tax on every deploy, every incident, and every on-call rotation.


Replacing Your Message Queue with PostgreSQL: LISTEN/NOTIFY, SKIP LOCKED Queues, and When Kafka Is Overkill for Your Startup

The PostgreSQL queue architecture

1. Pub/sub with LISTEN/NOTIFY

PostgreSQL has a built-in pub/sub mechanism. No extensions required.

-- Publisher
NOTIFY order_events, '{"order_id": 42, "status": "paid"}';

-- Subscriber (any connected client)
LISTEN order_events;

Your application receives events asynchronously on the same connection pool you already maintain. In Node.js with pg, it’s five lines of code. Kotlin with Exposed or JDBC, similarly straightforward.

One catch: messages are fire-and-forget. If no listener is connected, the message is lost. That’s fine for cache invalidation, real-time UI updates, and notification fanout. It is not fine for financial transactions.

2. Reliable worker queues with SKIP LOCKED

For durable, at-least-once processing, use a jobs table with FOR UPDATE SKIP LOCKED:

CREATE TABLE job_queue (
    id         BIGSERIAL PRIMARY KEY,
    payload    JSONB NOT NULL,
    status     TEXT DEFAULT 'pending',
    created_at TIMESTAMPTZ DEFAULT now(),
    locked_at  TIMESTAMPTZ
);

-- Worker claims a batch
UPDATE job_queue
SET status = 'processing', locked_at = now()
WHERE id IN (
    SELECT id FROM job_queue
    WHERE status = 'pending'
    ORDER BY created_at
    FOR UPDATE SKIP LOCKED
    LIMIT 10
)
RETURNING *;

Multiple workers compete safely with zero coordination. No advisory locks, no external broker. PostgreSQL handles the concurrency.

3. Retention with pg_partman

Partition your queue table by time, and pg_partman drops old partitions automatically. No manual DELETE queries hammering your table with vacuum pressure.


Benchmarks

I ran these on a single PostgreSQL 16 instance (4 vCPUs, 16GB RAM, NVMe SSD) using pgbench-style harnesses with 8 concurrent workers.

MetricPostgreSQL SKIP LOCKEDRabbitMQKafka (single broker)
Throughput (jobs/sec)~12,000~25,000~100,000+
P99 latency (claim + ack)8ms3ms12ms (batched)
Operational dependencies0 (it’s your DB)Erlang runtime, mgmt pluginJVM, KRaft/ZK, topic config
Setup time1 SQL migration2-4 hours4-8 hours
Monitoringpg_stat_activity, existing dashboardsSeparate dashboardSeparate dashboard + lag tooling

12,000 jobs/sec covers most startups comfortably. Most process fewer than 500 jobs/sec in their first two years. You probably aren’t the exception.


When to stay on PostgreSQL

  • Your total job throughput is under 5,000-10,000/sec
  • You have fewer than 10 distinct queue/topic types
  • Your team is under 15 engineers
  • You want one fewer system to monitor at 3 AM

Failure modes that say “graduate now”

These are the signals I’ve seen in production that mean it’s time to add a dedicated broker:

SignalWhat you’ll seeWhy it matters
WAL growth explosionpg_wal directory exceeding 10GB+ steadilyHigh-throughput inserts/updates generate enormous write-ahead logs, which pressures replication lag and disk
Vacuum can’t keep upn_dead_tup climbing on your queue table, autovacuum running constantlyDead tuples from rapid claim/delete cycles bloat the table and tank query performance
Connection pool exhaustionWorkers holding connections waiting for SKIP LOCKED claimsLong-polling workers compete with your application’s OLTP queries for the same connection pool
Fan-out beyond 3-4 consumersMultiple services needing the same event streamPostgreSQL has no consumer group semantics. You’ll end up building ad hoc replication logic that a broker gives you for free

When you see two or more of these concurrently, that’s your migration signal. Not before.


What to do with all this

Start with one job_queue table and SKIP LOCKED. You get durable, concurrent job processing with zero new infrastructure. Ship it in a single migration file and move on to features that actually matter.

Use LISTEN/NOTIFY for real-time, non-critical fanout. Cache invalidation, WebSocket pushes, dashboard refreshes. Pair it with your existing connection pool. Don’t add a Redis pub/sub layer you don’t need yet.

Instrument your queue table from day one. Monitor n_dead_tup, WAL size, and connection pool utilization. These three metrics will tell you exactly when PostgreSQL stops being enough, and you’ll migrate with data instead of anxiety.

The best infrastructure decision is the one you delay until you have evidence. PostgreSQL is already in your stack. Use it.


Share: Twitter LinkedIn