Deterministic replay testing for Kafka microservices

Meta description: Learn how deterministic replay testing catches breaking schema changes in Kafka consumers before deployment using production stream snapshots and output hashing.

Tags: microservices, backend, architecture, cicd, devops

TL;DR

Schema evolution in event-driven systems is a silent deployment killer. This post covers how to build a deterministic replay test harness that snapshots production Kafka streams, replays them against new consumer versions, and uses content-addressable output hashing to detect semantic regressions. I’ve used this approach to catch breaking changes in CI that would have otherwise shipped as silent data corruption.

The bug that doesn’t throw an exception

As organizations restructure engineering operations, event-driven architectures become the glue between increasingly distributed teams. More teams producing and consuming events means more schema evolution, more implicit contracts, and more ways for things to break quietly.

The most dangerous bugs in Kafka-based systems aren’t the ones that throw exceptions. They’re the ones where a consumer silently produces different output after a schema change: a renamed field, a changed default, a new enum variant that gets swallowed by a fallback branch.

Unit tests won’t catch these. Integration tests with synthetic data won’t catch these. You need production traffic replay with deterministic output comparison.

Architecture of a replay test harness

Most teams get this wrong by trying to mock Kafka. Don’t. Capture real streams and replay them deterministically.

Core components

Component	Responsibility	Storage
Stream Snapshotter	Captures topic partitions with headers, timestamps, keys	Content-addressable blob store (S3/GCS)
Schema Registry Mirror	Pins exact schema versions at capture time	Alongside snapshots as manifest
Replay Engine	Feeds records into consumer with controlled time	Embedded Kafka (Testcontainers)
Output Hasher	Produces deterministic hash of consumer side-effects	CI artifact store
Baseline Comparator	Diffs output hashes against known-good baseline	CI pipeline gate

The snapshot format

Each snapshot is a self-contained replay unit:

data class TopicSnapshot(
    val topicName: String,
    val partitionSnapshots: List<PartitionSnapshot>,
    val schemaManifest: Map<Int, SchemaInfo>, // schema ID -> avro/protobuf definition
    val capturedAt: Instant,
    val contentHash: String // SHA-256 of ordered record bytes
)

data class PartitionSnapshot(
    val partition: Int,
    val records: List<CapturedRecord>,
    val startOffset: Long,
    val endOffset: Long
)

data class CapturedRecord(
    val key: ByteArray,
    val value: ByteArray,
    val headers: Map<String, ByteArray>,
    val timestamp: Long,
    val offset: Long
)

One thing I want to stress: headers and timestamps must be preserved. Many consumers branch on header metadata or use event-time semantics. Strip those and your replay tells you nothing.

Exactly-once replay semantics

Determinism means controlling three things: record order, wall-clock time, and external dependencies.

Record order is straightforward. Replay partition-by-partition at captured offsets. Time is harder. For windowed aggregations, you need a TimeProvider abstraction injected into your consumer:

interface TimeProvider {
    fun now(): Instant
}

class ReplayTimeProvider(
    private val records: Iterator<CapturedRecord>
) : TimeProvider {
    private var current: Instant = Instant.EPOCH
    
    fun advanceTo(record: CapturedRecord) {
        current = Instant.ofEpochMilli(record.timestamp)
    }
    
    override fun now(): Instant = current
}

For windowed aggregations (tumbling, hopping, session windows), the replay engine advances time only when the next record’s timestamp crosses a window boundary. This guarantees that window triggers fire identically to production.

External dependencies like database lookups and API calls get replaced with captured responses stored in the snapshot manifest. This is the one place where recording overhead matters. A snapshot with externalized dependency responses adds roughly 15-30% storage overhead, but that’s a worthwhile trade. External state is the single largest source of replay non-determinism, and you want it gone.

Content-addressable output hashing

After replay completes, every side-effect the consumer produced gets hashed:

Output type	Hashing strategy
Produced records (to output topics)	SHA-256 of ordered (key, value, headers) tuples
Database writes	Deterministic serialization of SQL statements + params
HTTP calls (captured)	SHA-256 of (method, path, body) tuples
State store snapshots	Sorted key-value iteration hash

The composite hash becomes the baseline fingerprint for that consumer version against that snapshot. Store it as a CI artifact.

CI integration that blocks deploys

The pipeline stage is simple, and it has to be a hard gate:

replay-regression-test:
  stage: verify
  script:
    - ./replay-harness run \
        --snapshot s3://snapshots/orders-topic/2026-06-10 \
        --consumer-image $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA \
        --baseline-hash $(cat baseline/orders-consumer.sha256)
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  allow_failure: false

When the output hash diverges, the harness produces a structured diff showing exactly which records produced different output, which fields changed, and the schema versions involved. Engineers review the diff and either update the baseline (intentional change) or fix the regression.

Handling schema evolution during replay

The replay engine resolves schemas through the mirrored registry at capture time, not the live registry. When testing a new consumer version that expects Schema v3 against a snapshot captured under Schema v2, the harness applies the registry’s compatibility rules (BACKWARD, FORWARD, FULL) to verify the evolution is legal, then lets Avro/Protobuf deserialization handle the actual field resolution.

This is where most breaking changes surface. A field marked optional in v3 that was required in v2 suddenly gets null values during replay that never existed in production — yet. I find this class of bug genuinely unnerving because it means your schema registry says “compatible” while your consumer logic says “crash” (or worse, “silently wrong”).

What to take from this

Capture real production streams, not synthetic data. Content-addressable snapshots with full header and timestamp metadata are the only reliable replay source. Budget 5-10GB per critical topic snapshot and rotate weekly.

Inject time as a dependency. If your consumers call Instant.now() or System.currentTimeMillis() directly, deterministic replay of windowed aggregations is impossible. Abstract time from the start. Retrofitting it later is painful — I’ve done it, and it touches more code than you’d expect.

Make baseline hash comparison a deploy gate, not a report. Advisory-only replay tests get ignored. Wire the output hash comparison into your CI pipeline as a hard block on merge, with an explicit approval workflow for intentional baseline updates. If it doesn’t block the build, it doesn’t exist.