Event sourcing with CQRS for mobile backends
Meta description: How to implement event sourcing with CQRS using PostgreSQL and Kafka for mobile app backends, with concrete schema designs and failure analysis.
TL;DR
Event sourcing paired with CQRS gives your mobile backend a complete audit trail, temporal queries, and the ability to rebuild read models from scratch — but it demands disciplined schema design and a clear snapshotting strategy. This post walks through an order system using PostgreSQL as the event store, Kafka for async command processing, and concrete failure mode analysis so you ship this pattern without the usual landmines.
Why event sourcing pays off non-linearly
Paul Graham writes about superlinear returns — how certain investments compound non-linearly over time. Event sourcing is one of those bets. Early on, it feels like overhead. Six months in, when compliance asks for a full audit trail, when a PM wants to know “what did the cart look like before the user removed item X,” or when you need to rebuild an entire read model after a schema migration — that upfront investment pays for itself many times over.
I’ve built a few production systems for mobile-first products, and the order domain is where event sourcing fits most naturally. Orders are inherently state machines with high regulatory and business visibility.
The event store: PostgreSQL schema design
Forget the temptation to reach for a specialized event store database. PostgreSQL handles this workload well up to tens of billions of events with proper partitioning.
CREATE TABLE event_store (
event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_id UUID NOT NULL,
aggregate_type VARCHAR(128) NOT NULL,
event_type VARCHAR(256) NOT NULL,
event_data JSONB NOT NULL,
metadata JSONB DEFAULT '{}',
version INTEGER NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (aggregate_id, version)
);
CREATE INDEX idx_event_store_aggregate ON event_store (aggregate_id, version);
CREATE INDEX idx_event_store_type_time ON event_store (event_type, created_at);
The UNIQUE (aggregate_id, version) constraint is your optimistic concurrency control. Two concurrent commands on the same aggregate will conflict at the database level — no distributed locks required.
Event payload example
{
"event_type": "OrderItemAdded",
"aggregate_id": "ord-881a-...",
"version": 3,
"event_data": {
"sku": "WIDGET-42",
"quantity": 2,
"unit_price_cents": 1499,
"added_by": "user-7f3c..."
},
"metadata": {
"correlation_id": "req-a1b2...",
"device": "iOS/17.4",
"ip": "redacted"
}
}
CQRS: separating writes from reads
Most teams get CQRS wrong in the same way: they commit to “eventual consistency” for their read model without defining what “eventually” actually means for their mobile clients. Pin it down. For us, it was under 200ms.
| Concern | Command side | Query side |
|---|---|---|
| Storage | event_store (append-only) | order_projections (mutable, denormalized) |
| Consistency | Strongly consistent per aggregate | Eventually consistent (typically <200ms) |
| Scaling | Vertical (single writer per aggregate) | Horizontal (read replicas, caching) |
| Schema changes | Never — events are immutable | Rebuild from event log anytime |
| Mobile API pattern | POST commands, return command ID | GET queries with If-None-Match / ETag |
The projection table is purpose-built for your mobile API’s read patterns:
CREATE TABLE order_projections (
order_id UUID PRIMARY KEY,
user_id UUID NOT NULL,
status VARCHAR(32) NOT NULL,
total_cents BIGINT NOT NULL,
item_count INTEGER NOT NULL,
last_event_version INTEGER NOT NULL,
projected_at TIMESTAMPTZ NOT NULL,
snapshot JSONB NOT NULL
);
Kafka integration: async command processing
Commands arrive via your mobile API, get validated, and publish domain events to Kafka. Projectors consume these events to update read models.
Mobile App → API Gateway → Command Handler → event_store (PG)
→ Kafka Topic: order.events
↓
Projector → order_projections (PG)
Notifier → Push Notification Service
Write to PostgreSQL first, then publish to Kafka. Use the transactional outbox pattern: write events and an outbox row in a single PG transaction, then a poller or CDC (Debezium) pushes to Kafka. This avoids dual-write inconsistencies, which is the number one failure mode I’ve seen in production event-sourced systems. Do not skip this step.
Snapshotting for read-model performance
Without snapshots, rebuilding an aggregate with 500 events means replaying all 500 on every command. The numbers speak for themselves:
| Events per aggregate | Replay without snapshot | Replay with snapshot (every 50) |
|---|---|---|
| 50 | ~4ms | ~4ms |
| 200 | ~16ms | ~4ms |
| 1,000 | ~82ms | ~5ms |
| 10,000 | ~810ms | ~6ms |
Benchmarked on PostgreSQL 16, m6i.xlarge, JSONB deserialization in Kotlin/JVM.
CREATE TABLE aggregate_snapshots (
aggregate_id UUID NOT NULL,
aggregate_type VARCHAR(128) NOT NULL,
version INTEGER NOT NULL,
state JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (aggregate_id, version)
);
Snapshot every N events (50 is a solid default). On aggregate load: fetch the latest snapshot, then replay only the events after it.
Failure mode analysis
| Failure | Impact | Mitigation |
|---|---|---|
| Kafka broker down | Projections stall, reads serve stale data | Outbox pattern + retry; mobile client shows cached state with staleness indicator |
| Duplicate event publish | Projector processes event twice | Idempotent projectors keyed on (aggregate_id, version) |
| Projection bug ships | Corrupted read model | Rebuild projection from event log — this is the whole reason you did event sourcing |
| Schema evolution | Old events lack new fields | Upcasters transform events at read time; never mutate stored events |
| Snapshot corruption | Aggregate loads incorrect state | Fallback to full replay; validate snapshot checksums |
What actually matters
Start with the outbox pattern, not direct Kafka writes. Dual-write bugs are subtle and devastating. A single PostgreSQL transaction for events + outbox row eliminates an entire class of consistency failures. I learned this the hard way.
Snapshot aggressively but lazily. Set a threshold (50 events), generate snapshots asynchronously after command processing, and always keep the full event log as your source of truth. The ability to rebuild is worth more than the storage cost.
Design projections for your mobile API, not your domain model. Each screen or endpoint should have a purpose-built projection. The whole point of CQRS is that reads and writes have different shapes — lean into it.
Event sourcing is not free. It adds real complexity to your system, and I wouldn’t use it for a simple CRUD app. But in domains with audit requirements, complex state transitions, and mobile clients that need fast denormalized reads, it has paid off for me every time. The cost is upfront; the value compounds.
TAGS: architecture, backend, kotlin, api, microservices