How to avoid duplicate data in your kafka streaming application

Trisha
4 min readMar 22, 2024

Recently, I had to solve an interesting issue at work on our Kafka streaming application namely de-duplication.

A part of the pipeline reads messages from Kafka, decodes the messages, and then aggregates them, and uploads the result to object storage.

1) Read protobuf from kafka in batches
2) decode messages in batches
3) some logic with the messages
4) write files locally
5) upload local files
6) kafka consumer commit

Problem:

As you can see our application required no data loss, so we replaced auto-commit with a manual commit which is done after the data has been fully processed and uploaded. (here, the data is processed at least once). At times when the pipeline was unstable, we were seeing more data than expected. Eg: if the process dies somewhere on step 5 before step 6, when the process restarts the records already uploaded will be processed again, resulting in duplicate data.

Causes of Issue: Dynamic Partitioning- Subscribe

from kafka import KafkaConsumer

consumer = KafkaConsumer(
bootstrap_servers = 'hostname:port',
)
consumer.subscribe(topics=[list_of_topic_names])

Initially, we had n consumers running on k8 connected to the Kafka broker via subscribe. During a k8 upgrade (canary deployment) we had pods restart/upgrade one by one, which caused the Kafka broker to repetitively rebalance the partitions on the consumers. Which let the topic…

--

--