Apache Kafka
A distributed event streaming platform built for high-throughput, fault-tolerant, real-time data pipelines. This guide covers every concept from fundamentals to production failure scenarios.
Event-Driven Architecture (EDA)
OrderPlaced, PaymentCompleted, UserSignedUp). Producers emit events to a broker without knowing or caring who (if anyone) will consume them, and consumers subscribe to whichever events they care about. This completely decouples services in time, space, and knowledge — the producer and consumer don't need to be online at the same time, don't need to know each other's network address, and don't share types or APIs beyond the event schema.- Asynchronous: producer fires an event and immediately moves on — no waiting for consumers.
- Loose coupling: producer has zero knowledge of consumers; consumers can be added or removed without producer changes.
- Eventual consistency: state converges across services over milliseconds to seconds, not instantly.
- Replay-ability (log-based): events are persisted so new consumers can rebuild state from the beginning.
- Scalability: each service scales independently based on its own throughput — no "slowest link" bottleneck.
- Push-based Pub/Sub: broker pushes messages to active subscribers. If consumer is down, the message is lost (or dead-lettered). Examples:
RabbitMQ,AWS SNS,Google Pub/Sub,Redis Pub/Sub. - Pull-based Streaming (log-based): events are appended to a durable log and consumers pull at their own pace from any offset. Examples:
Apache Kafka,AWS Kinesis,Apache Pulsar,Redpanda.
Jaeger, OpenTelemetry). Schema evolution becomes critical: a breaking change to an event payload can silently corrupt downstream consumers. Exactly-once delivery requires careful idempotency design.Services don't call each other directly. Instead, a service emits events and other services react to those events.
What is an Event?
- Immutable — once created, it can never be changed. It's a fact that happened.
- Happened in the past — an event represents something that already occurred (OrderPlaced, PaymentCompleted).
- Self-contained — carries all the data needed to understand what happened. No need to query another service.
REST Microservices vs EDA Microservices
REST (Synchronous) Challenges
- Availability — all services must be up. If any one service is down, the entire API call fails.
- Latency — total latency = sum of all service latencies in the chain.
- Cascading Failure — if Service 3 is slow or failing, Service 2 backs up, then Service 1 goes down too.
- Tight Coupling — services know about each other, changes ripple across teams.
- Scaling Issue — scaling Service 3 has no benefit if Service 1 and 2 can only handle the same request rate.
EDA (Asynchronous) Advantages
- Loose Coupling — producer doesn't know or care about consumers.
- Independent Scaling — each service scales on its own based on its throughput.
- Resilience — temporary failures don't break the whole system. Events wait in the broker.
- Replay — you can replay old events for recovery, debugging, or new consumer onboarding.
- Latency Improvement — fire-and-forget pattern. Producer doesn't block waiting for downstream.
Two Types of EDA
Push-Based (Pub/Sub)
- Broker pushes events to active consumers.
- Only cares about delivery.
- You cannot replay old events.
- Events published only to currently active subscribers.
- Examples: RabbitMQ, AWS SNS, Google Pub/Sub.
Pull-Based (Streaming / Kafka)
- Consumer pulls events at its own pace.
- Events are appended to logs — persisted on disk.
- You can replay events from any offset.
- Retention-based: events stay forever or until configured TTL.
- Examples: Apache Kafka, Amazon Kinesis, Redpanda.
What is Apache Kafka?
- Pub/Sub messaging: publish events, subscribe with one or many consumer groups.
- Stream processing: via
Kafka StreamsandksqlDBfor joining, aggregating, and transforming streams in real time. - Storage: durable, replicated, configurable retention (hours, days, forever).
- Integration:
Kafka Connectprovides 100+ prebuilt source/sink connectors (JDBC, S3, Elasticsearch, MongoDB, etc.). - Exactly-once semantics: transactional writes across multiple partitions and topics since Kafka 0.11.
- vs RabbitMQ / ActiveMQ: Those are queue-based brokers — messages are deleted after consumption. Kafka is log-based — messages stay for the entire retention window and can be replayed. RabbitMQ excels at complex routing (exchanges, headers, topics); Kafka excels at raw throughput and replay.
- vs AWS SQS: SQS is a fully managed distributed queue with at-least-once delivery and no replay. Kafka gives you ordered partitions, replay, and far higher throughput, but you manage infrastructure (or use MSK/Confluent Cloud).
- vs AWS Kinesis: Kinesis is Kafka's closest AWS-native competitor — also partitioned, also log-based — but it's managed, has lower throughput per shard, charges per shard-hour, and has 7-day max retention (24 hours by default).
- vs Apache Pulsar: Pulsar separates compute (brokers) from storage (BookKeeper), offers tiered storage natively, and supports geo-replication out of the box. Kafka's ecosystem and tooling are more mature.
- vs NATS / Redis Streams: NATS is lighter and faster for small messages but lacks Kafka's durability guarantees at scale. Redis Streams is simpler but not designed for multi-terabyte workloads.
A distributed event streaming platform that allows you to publish, store, and subscribe to streams of events in real-time.
In simple words: Kafka is like a super-fast, never-forgetting diary. Applications write events (things that happened) into this diary, and other applications can read those events whenever they want — even days later. The diary can handle millions of entries per second and never loses a page.
Publish Events
Producers write events (messages) to Kafka topics. Events are key-value pairs with optional headers and a timestamp. Producers choose which topic and optionally which partition.
Store Events
Events are durably stored on disk across multiple brokers with configurable replication. Kafka can retain events for hours, days, or forever. Storage is sequential and immutable.
Subscribe to Events
Consumers pull events from topics at their own pace. Multiple consumer groups can independently read the same data. Each group tracks its own position (offset) in the log.
Producer
Java (official), with high-quality native clients in librdkafka (C/C++), confluent-kafka-go, kafkajs (Node.js), aiokafka (Python), and more.acks:0(fire-and-forget),1(leader ack),all(all in-sync replicas ack — strongest durability).linger.ms: how long to wait to build a batch before sending (higher = better throughput, higher latency).batch.size: max batch size in bytes per partition (default 16 KB).compression.type:none,gzip,snappy,lz4,zstd— compresses entire batches, reducing network and disk cost by 3–10×.enable.idempotence: deduplicates retries using PID + sequence numbers, ensures exactly-once per partition.max.in.flight.requests.per.connection: affects ordering guarantees with retries.
- vs RabbitMQ publishers: RabbitMQ publishers send to an exchange which routes to queues based on routing keys and bindings. Kafka producers write directly to topic partitions — no routing logic in the broker.
- vs SQS SendMessage: SQS is one-message-at-a-time HTTP calls (or batches of 10); Kafka producers batch thousands of records in a single TCP request.
- vs Kinesis PutRecord: Kinesis has HTTP-based PutRecords with 500 records/batch limit and 1 MB/sec per shard write cap. Kafka has no hard per-partition write cap beyond what the broker can handle.
- vs Pulsar producers: Very similar API but Pulsar producers can optionally use broker-side deduplication without the client tracking sequence numbers.
producer.flush() or producer.close() before shutdown drops unsent batches. Using acks=1 in production can lose data on leader failure. acks=all without min.insync.replicas=2 is a trap — a single replica still "works" but gives no real durability. Sending without a key gives round-robin partitioning, destroying ordering for related events.An application that publishes events to Kafka. The producer does not care who will consume the event. It's fire-and-forward.
In simple words: A producer is like a news reporter. They write the news story and send it to the newspaper office (Kafka). They don't know or care who reads the newspaper — that's not their job. Their job is just to report what happened.
Producer Record Structure
Producer ACK Configurations
The producer can control how many acknowledgements it waits for before considering the write successful.
acks = 0 (Fire & Forget)
Producer sends the record and does NOT wait for any response. Fastest but data can be lost if broker crashes before writing. Use for metrics/logs where loss is acceptable.
acks = 1 (Leader Only)
Producer waits for the leader broker to write to its local log and acknowledge. Balanced between speed and safety. Data can be lost if leader crashes before followers replicate.
acks = all (All ISR)
Producer waits for the leader AND all in-sync replicas (ISR) to write the event. Slowest but safest. No data loss as long as at least one ISR is alive. Production recommended.
Topic
- Name: A string identifier (e.g.,
orders.placed.v1,user.signups). - Partition count: Fixed at creation (can be increased later but not decreased). Determines max consumer parallelism.
- Replication factor: How many brokers hold a copy of each partition — typically
3in production. - Retention: Time-based (
retention.ms) or size-based (retention.bytes), or compacted (keep only the latest value per key). - Cleanup policy:
delete(default) orcompact.
- vs RabbitMQ queue: A RabbitMQ queue is consumed destructively — once a consumer reads a message, it's gone. A Kafka topic is a durable log — all consumers can read all messages independently until retention expires.
- vs RabbitMQ exchange/topic: RabbitMQ's "topic exchange" does pattern-based routing to bound queues. Kafka topics have no routing logic — the topic name IS the destination.
- vs SQS queue: SQS has a single consumer group per queue (roughly). Kafka topics support unlimited independent consumer groups, each with their own offset.
- vs Kinesis stream: Near-identical concept — Kinesis streams are partitioned logs too, but Kinesis caps throughput per shard and has lower max retention.
- vs Pulsar topic: Pulsar topics are more flexible — they can be non-partitioned or partitioned, and support per-topic subscription modes (exclusive, shared, failover).
<domain>.<entity>.<event>.v<version>, e.g., billing.invoice.created.v2. Avoid generic names like events or data. Version the topic name so breaking schema changes can coexist with old consumers. Use dots, hyphens, or underscores consistently.auto.create.topics.enable=true is a production footgun — disable it and use explicit topic creation via Admin API or IaC.ActivityEvent, PageViewEvent. Uber has topics per city/region for dispatch events, totaling tens of thousands across the fleet.A Topic is a logical category or folder to which events are published. Think of it as a named stream. Many producers can write to the same topic, and many consumer groups can read from the same topic independently.
In simple words: A topic is like a TV channel. Channel "order-events" shows all order-related news. Channel "payment-events" shows all payment-related news. Producers broadcast to a channel, and consumers tune into whichever channels they care about.
Topic Properties
| Property | Description | Default |
|---|---|---|
| num.partitions | Number of partitions when topic is created | 1 |
| replication.factor | How many copies of each partition across brokers | 1 (set to 3 in prod) |
| retention.ms | How long to keep events before deletion | 604800000 (7 days) |
| retention.bytes | Max size of partition log before old segments are deleted | -1 (unlimited) |
| cleanup.policy | Whether to delete or compact old events | delete |
| min.insync.replicas | Minimum ISR count for acks=all writes to succeed | 1 (set to 2 in prod) |
| segment.bytes | Size of each segment file in the partition log | 1073741824 (1 GB) |
Internal Topics (created by Kafka itself)
| Topic Name | Partitions | Purpose |
|---|---|---|
| __consumer_offsets | 50 (default) | Stores committed offsets for every consumer group. Keyed by (groupId, topic, partition). |
| __transaction_state | 50 (default) | Stores transaction metadata for exactly-once semantics. |
| _cluster_metadata.log | 1 | KRaft cluster metadata log (replicated across controllers). |
Partitions
N partitions can be consumed in parallel by up to N consumer instances in a group.- Partition count = max consumer parallelism per consumer group.
- Leader-follower replication: one broker is the leader for a partition; followers replicate from it.
- Offsets are per-partition, not per-topic — each consumer tracks (topic, partition, offset) tuples.
- Keyed writes: records with the same key always land in the same partition (via hash partitioning by default), preserving per-key ordering.
- Partitions cannot be decreased — only added. And adding partitions breaks existing key-to-partition mapping.
- vs RabbitMQ queue sharding: RabbitMQ can shard queues via the sharding plugin, but ordering and consumer group semantics are not first-class.
- vs SQS: SQS has no partitions — FIFO queues use "message group IDs" to give per-group ordering but throughput is capped at ~3000 msg/sec per group.
- vs Kinesis shards: Kinesis shards are the direct analog — same idea, same constraints, but each shard caps at 1 MB/sec writes and 2 MB/sec reads, whereas Kafka partitions have no hard cap.
- vs Pulsar: Pulsar partitions are similar but can be dynamically added without the broker doing heavy rebalancing since storage is in BookKeeper.
A partition is a physical, ordered, append-only log that is part of a topic. It's where events are actually stored on disk.
In simple words: If a topic is a book, then partitions are the chapters. The book "order-events" might have 3 chapters (partitions). Each chapter is stored on a different shelf (broker). Events are written at the end of a chapter — you never insert pages in the middle. This makes writing super fast.
Three Key Properties
Physical
Events are actually stored on disk inside the partition as log files. Each partition maps to a directory on the broker's filesystem containing segment files. Example: /data/order-events-0/
Ordered
Events are written sequentially and assigned an incrementing offset (starting from 0). Consumers read in offset order. Within a single partition, order is guaranteed. Kafka never re-orders events inside a partition.
Append-Only
Events are only appended to the end of the log. No insert in the middle. No updates. No deletes of individual records. The log only grows forward. This is what makes Kafka's sequential disk I/O so fast.
Partitioning Strategies
Partitioner class before the record ever reaches the broker. The default Java partitioner uses murmur2 hash of the key modulo partition count. If the key is null, Kafka 2.4+ uses "sticky partitioning" — round-robin in batches for better batching efficiency. Kafka lets you implement custom partitioners by implementing the Partitioner interface.- Key-hash (default when key is set):
partition = hash(key) % num_partitions. Guarantees same key → same partition → preserves ordering. - Round-robin (pre-2.4, null key): spreads records evenly across partitions but creates many small batches.
- Sticky partitioner (2.4+, null key): fills one partition's batch fully, then switches — much better batching throughput.
- Manual: producer explicitly passes a partition number, bypassing the partitioner entirely.
- Custom: implement your own logic — e.g., geographic partitioning, tenant-based, or time-bucketed.
- vs RabbitMQ routing keys: RabbitMQ routes via exchange bindings and pattern matching (server-side). Kafka partitioning is deterministic hash on the client.
- vs SQS FIFO message group ID: SQS groups messages by "message group ID" for per-group ordering — similar intent, but SQS caps per-group throughput.
- vs Kinesis partition key: Essentially identical — both use a hash function on a client-supplied key.
- vs Pulsar: Pulsar has similar partitioning but also supports "key_shared" subscription mode where consumers get consistent per-key routing automatically.
userId as the key when you need per-user ordering (e.g., account updates). Use orderId for order event streams. Use no key (null) for fire-and-forget logs where ordering doesn't matter and you want maximum throughput.Producer chooses which partition of a topic the event goes to. This decision impacts ordering guarantees and load distribution.
Key-Based Partitioning
partition = hash(key) % num_partitions
All events with the same key go to the same partition. This guarantees ordering for related events. Example: all events for orderId "123" always go to the same partition.
Ordering Guaranteed Possible Hotspots
Round Robin (No Key)
When key = null, events are distributed in round-robin across partitions.
E0 → P0, E1 → P1, E2 → P2, E3 → P0 ... Even distribution but ordering is NOT guaranteed. Related events may land in different partitions.
Even Distribution No Ordering
Custom Partitioner
You implement your own Partitioner interface with business logic.
Example: country == "India" → P0, country == "US" → P1. Gives full control over data locality and processing affinity.
Full Control Custom Logic
Segments & Indexing
.log file containing the actual record bytes, plus two index files: .index (offset → file position) and .timeindex (timestamp → offset). When a segment fills up or ages out, Kafka "rolls" to a new segment and the old one becomes immutable and eligible for deletion or compaction. Segmenting is what makes Kafka's log storage practical — you can delete old data by just unlinking files, and you can memory-map individual segments for efficient reads.00000000000000000000.log: the actual records, append-only.00000000000000000000.index: sparse offset index for binary search (one entry every ~4 KB by default).00000000000000000000.timeindex: sparse timestamp index enabling offset-by-time lookups..snapshot(newer versions): producer state snapshot for idempotence.
- vs RabbitMQ mnesia: RabbitMQ's classic queues store messages in memory-backed mnesia with disk spillover — not a flat log file.
- vs database storage: Databases use B-trees or LSM trees with random writes. Kafka is pure sequential append — vastly faster on spinning disk and still very fast on SSD.
- vs Pulsar/BookKeeper: Pulsar stores data in BookKeeper ledgers which are also segmented logs, but with erasure coding and tiered storage built in.
- vs Kinesis: Kinesis abstracts storage entirely — you never see segment files; AWS manages them.
index.interval.bytes (default 4 KB) of log data. To find an exact offset, Kafka does a binary search in the index to find the nearest entry, then scans forward in the log. This keeps indexes small (a few MB per GB of log) and still gives O(log n) lookups.segment.bytes=10MB on a high-traffic topic you create millions of files, exhausting file descriptors. Too-large segments: with segment.bytes=10GB you can't delete expired data until the whole segment rolls. Time-based retention only kicks in after segment roll — so a stale segment can stick around past its TTL.Each partition log file is further divided into segments. Instead of one massive log file, Kafka splits it into smaller, manageable pieces for faster reads and efficient cleanup.
In simple words: Imagine writing a very long diary. Instead of one 10,000-page book, you split it into volumes: Volume 1 (pages 1-500), Volume 2 (pages 501-1000), etc. When someone asks for page 750, you grab Volume 2 directly instead of flipping through the entire diary. That's what segments do for Kafka.
Why Segments?
If a consumer says "read from offset 550 to 800 of Partition 0", Kafka can skip directly to the right segment file instead of scanning from the beginning. Also, retention and compaction operate at the segment level — Kafka deletes or compacts entire segment files.
Sparse Index (.index files)
Even with segments, a 1 GB segment file would still need sequential scanning. So Kafka maintains a sparse index per segment. Not every offset is indexed. Instead, it creates one index entry after every N bytes written (configured by log.index.interval.bytes, default 4096 bytes).
Segment Log (00000000.log)
| Offset | Event Size | File Position |
|---|---|---|
| 0 | 300 bytes | 0 |
| 1 | 500 bytes | 300 |
| 2 | 1000 bytes | 800 |
| 3 | 2000 bytes | 1800 |
| 4 | 500 bytes | 3800 |
Total bytes: 300+500+1000+2000+500 = 4300 bytes
Sparse Index (00000000.index)
| Offset | Position (bytes) |
|---|---|
| 4 | 3800 |
Index interval = 4096 bytes. After 4300 bytes written (> 4096), Kafka records offset 4 at position 3800. To find offset 3, Kafka looks at index, finds nearest lower entry, then scans sequentially from there.
Broker
broker.id. A Kafka cluster is simply a group of brokers that know about each other via ZooKeeper (legacy) or KRaft quorum (modern). Brokers are stateful — they own data on local disk — so they're typically deployed on dedicated machines with fast SSDs.- Host partition replicas: each broker stores some partitions as leader and others as follower.
- Serve producer requests: accept writes to partitions it leads, append to the log, replicate to followers.
- Serve consumer fetch requests: return records from the log (possibly via zero-copy
sendfile()). - Handle replication: followers fetch from leaders continuously.
- Participate in leader election and ISR management via the controller.
- Expose JMX metrics for monitoring (under-replicated partitions, request rate, disk I/O, etc.).
- vs RabbitMQ node: RabbitMQ nodes also hold queues and form a cluster, but don't natively partition a queue across nodes — you need sharding plugins or mirrored queues.
- vs SQS: SQS has no user-visible brokers — AWS fully manages the backend.
- vs Kinesis: Kinesis shards are stored on AWS-managed infrastructure; there's no broker concept for customers.
- vs Pulsar broker: Pulsar brokers are stateless — storage lives in BookKeeper bookies. This separation lets Pulsar scale compute and storage independently.
replication.factor=3 survives one broker failure with zero data loss. More brokers = more partitions = more parallelism. A typical production cluster runs 5–50 brokers; the largest known clusters (LinkedIn, Netflix) run 100s of brokers per cluster.kafka-reassign-partitions.sh or Cruise Control. Disk I/O: Kafka is sequential-write-heavy — use XFS or ext4 on SSD. Page cache: Kafka relies on the OS page cache for read performance — give brokers lots of RAM (JVM heap 6 GB, rest to page cache). JVM GC tuning: use G1GC with small pause targets.A Broker is a single Kafka server instance. It's the one that actually stores data on disk and serves client requests (both producers and consumers).
In simple words: A broker is like a warehouse worker. Each worker (broker) manages some shelves (partitions). Producers give packages (events) to the right worker, and consumers ask the right worker for their packages. Multiple workers together form a team (cluster).
Broker Responsibilities
- Store partition data — persist events to disk in segment files.
- Serve producer writes — accept new events and write to leader partitions.
- Serve consumer reads — return events from leader partitions based on offset.
- Replicate data — if hosting a follower partition, continuously fetch from the leader to stay in sync.
- Send heartbeats — report liveness to the controller.
- Serve metadata requests — tell clients which broker hosts which partition leader.
Broker Networking
Kafka uses a custom binary protocol over TCP. Default port is 9092. Each broker has a unique broker.id. Clients connect to any broker (bootstrap) to discover the full cluster topology, then connect directly to the partition leaders they need.
Kafka Cluster
bootstrap.servers list) and then fetching metadata to learn about all brokers and partition leaders.- Brokers: the data plane — store and serve partitions.
- Controller: one elected broker that manages partition leadership and metadata changes.
- Metadata quorum (KRaft) or ZooKeeper ensemble: coordinates cluster state.
- Bootstrap servers: initial contact points for clients.
- Listeners: network endpoints (internal, external, SASL) brokers advertise.
- vs RabbitMQ cluster: RabbitMQ clusters share metadata via Erlang distribution, and high availability requires mirrored queues or quorum queues. Kafka has replication as a first-class primitive.
- vs Kinesis stream: A Kinesis stream is the closest AWS-native analog to a Kafka cluster, but it's bounded by shard limits and AWS-region scope.
- vs Pulsar cluster: Pulsar clusters separate brokers (stateless) from BookKeeper bookies (storage), making independent scaling easier.
- vs Redis Cluster: Redis Cluster is for key-value sharding, not durable event logs.
replication.factor=3. Scale out when you hit CPU, disk I/O, or network bottlenecks. Rule of thumb: one broker can handle ~100 MB/s of combined produce+consume with typical workloads. Cross-AZ deployment is mandatory for durability in cloud environments (3 brokers across 3 AZs).broker.rack) must be set to ensure replicas land on different AZs. Bootstrap servers should list multiple brokers — if you list only one and it's down, clients can't connect.A Kafka Cluster is a group of brokers working together to provide scalability, fault tolerance, and high availability.
Scalability
Distribute load across multiple servers. More brokers = more partitions = more throughput. Producers and consumers talk to different brokers in parallel.
Fault Tolerance
If a broker goes down, its partitions have replicas on other brokers. Data is never lost (with replication factor >= 2). Automatic failover via leader election.
High Availability
No single point of failure. Controller has standby nodes. Every partition has replicas. Clients automatically reconnect to new leaders. System continues even if minority of brokers fail.
Leader-Follower Partition Replication
replication.factor: number of copies per partition (typically 3).- Leader: handles all client requests. There's exactly one leader per partition at a time.
- Followers: passively replicate via
FetchRequestRPCs to the leader. - High watermark (HW): the highest offset all ISR have replicated — consumers can only read up to HW.
- Log end offset (LEO): the offset of the next record to be written on each replica.
- vs RabbitMQ mirrored queues: RabbitMQ had master-slave queue mirroring (deprecated); its modern alternative is quorum queues (Raft-based). Kafka's replication is more mature and proven at scale.
- vs database replication: Very similar to PostgreSQL streaming replication, except each partition is replicated independently and leaders are distributed across brokers for load balancing.
- vs Kinesis: Kinesis shards are replicated across AZs by AWS transparently — no user-visible leader concept.
- vs Pulsar: Pulsar uses BookKeeper's quorum writes — every write goes to multiple bookies in parallel rather than leader→follower chained replication.
replication.factor=3 and min.insync.replicas=2, Kafka tolerates one broker failure with zero data loss and continues serving reads/writes. Cross-AZ replication tolerates an entire AZ outage.unclean.leader.election.enable=true) trades data loss for availability — avoid in production. Replication throttles are needed when adding brokers or reassigning partitions to avoid saturating the network.replication.factor=3 across AZs for all production topics. LinkedIn uses higher replication factors (4–5) for critical topics. Confluent recommends at minimum RF=3, MinISR=2 for production.For each partition, one broker is the Leader and the rest are Followers. This is the core of Kafka's fault tolerance.
In simple words: Imagine a group project. One person (leader) does the original work and everyone (producers/consumers) talks to them. The other members (followers) silently make copies of the leader's work. If the leader gets sick, one of the followers who has a complete copy can immediately take over. No work is lost.
Leader Responsibilities
- Handle ALL producer writes for that partition.
- Handle ALL consumer reads for that partition.
- Maintain partition log (segments, indexes).
- Coordinate with followers — track their replication progress.
- Manage ISR list — add/remove followers based on sync status.
Follower Responsibilities
- Continuously fetch (pull) data from the leader.
- Replicate data to stay in sync.
- Ready to become leader if current leader fails.
- Do NOT serve any client (producer/consumer) requests directly.
- Acknowledge successful replication back to leader (for acks=all).
ISR (In-Sync Replicas)
replica.lag.time.max.ms (default 30 seconds) of the leader's. Only ISR members are eligible for leader election on a leader failure. When a producer uses acks=all, the write is only considered committed once all ISR members have replicated it, so the ISR defines the durability boundary. If a follower falls too far behind, it's removed from the ISR; when it catches up, it rejoins.min.insync.replicas: minimum ISR size required foracks=allwrites to succeed. Typical production value:2.replica.lag.time.max.ms: how long a follower can lag before being kicked out of ISR.unclean.leader.election.enable: iftrue, allows non-ISR replicas to become leader (data loss risk).
- vs Raft quorum (etcd, Consul): Raft requires a majority (quorum) for writes. Kafka's ISR is "flexible quorum" — the leader decides which replicas count based on recent liveness, not a fixed majority.
- vs RabbitMQ quorum queues: Quorum queues use strict Raft — a write needs majority acks. Kafka ISR can shrink or grow dynamically.
- vs Pulsar/BookKeeper: BookKeeper uses an "ensemble" concept with write/ack quorums, more like strict Raft.
acks=all and min.insync.replicas=2, every committed write is guaranteed to survive on at least 2 replicas before the producer gets an ack — giving you one broker failure tolerance without data loss. Lower min.insync.replicas means more availability but less durability.min.insync.replicas, producers get NotEnoughReplicasException — availability degrades in exchange for durability. Always alert on UnderMinIsrPartitionCount.min.insync.replicas=2 for every topic. Log aggregation clusters sometimes accept min.insync.replicas=1 to keep writing during degraded conditions. Cloudflare famously published details about ISR tuning for their event pipeline.ISR is the list of replicas (including the leader) that are fully caught up with the leader's log. This is cluster metadata maintained by the Controller.
In simple words: Imagine you're a teacher (leader) and you have 2 students (followers) copying your notes. ISR is the list of students who are up-to-date with your notes. If a student falls behind (didn't copy the last 5 pages), they're temporarily removed from the ISR until they catch up.
How Does the Leader Decide a Follower is Out-of-Sync?
- A follower is "in-sync" if its replica offset is close enough to the leader's log-end offset.
- Specifically, if the follower hasn't fetched within replica.lag.time.max.ms (default 30 seconds), it's removed from ISR.
- If a follower stops fetching entirely, or falls too far behind, the leader requests the Controller to shrink the ISR.
- When the follower catches up again, it's added back to the ISR.
Why ISR Matters
When producer uses acks=all, the leader waits for ALL replicas in the ISR (not all replicas, just the in-sync ones) to acknowledge the write. This means:
- If ISR = {Broker1, Broker2, Broker3} and acks=all, leader waits for all 3.
- If Broker3 is removed from ISR, leader only waits for Broker1 + Broker2.
- This prevents a slow follower from blocking all writes.
Controller
- Leader election: when a broker fails, pick a new leader from the ISR of each affected partition.
- ISR updates: track which replicas are in sync and update ZK/KRaft.
- Topic/partition lifecycle: handle create/delete/alter topic requests.
- Partition reassignment: move partitions between brokers during rebalancing.
- Metadata propagation: push
LeaderAndIsrandUpdateMetadataRPCs to brokers.
- vs database cluster master: Similar role to the coordinator in etcd, Consul, or a PostgreSQL failover manager — elected, manages failover.
- vs RabbitMQ: RabbitMQ doesn't have a single controller; cluster state is maintained via Erlang distribution and mnesia replication.
- vs Pulsar: Pulsar has "bundle ownership" managed by ZooKeeper/etcd rather than a dedicated controller node.
- vs Kinesis: AWS hides this layer; customers never see a controller.
ActiveControllerCount — should always be exactly 1 across the cluster.The Controller is a special broker responsible for managing all cluster metadata. It's the brain of the Kafka cluster.
In simple words: The controller is like the manager of the warehouse. Workers (brokers) store packages, but the manager decides which worker handles which shelf, who takes over when a worker is sick, and keeps track of everything. There's only one active manager at a time, but backup managers are ready to step in.
Controller Responsibilities
Topic Management
Creates/deletes topics. Decides how partitions are distributed across brokers when a new topic is created.
Partition Leader Election
Elects leader and follower replicas for every partition. Re-elects leaders when a broker fails.
Failure Detection
Monitors broker heartbeats. Detects when brokers go down and triggers recovery actions.
Metadata Distribution
Notifies all brokers about changes — new topics, leader changes, ISR updates. Brokers cache this metadata to serve client requests.
Controller can have dual roles
- Dual responsibility: Normal Broker + Controller role (common in smaller clusters).
- Dedicated Controller: Only handles controller duties, no partition storage (recommended for large clusters).
KRaft (Kafka Raft Consensus)
__cluster_metadata topic. The rest of the brokers are regular data nodes that subscribe to metadata updates from the quorum. This eliminates the operational burden of running a separate ZooKeeper ensemble and dramatically improves scalability — KRaft can handle 10× the partition count of ZK-based clusters.- Single system to operate: no separate ZooKeeper cluster to install, monitor, upgrade, secure.
- Millions of partitions: scales to much larger clusters than ZK-based Kafka.
- Fast controller failover: seconds instead of minutes.
- Simpler security model: one set of credentials, one TLS config, one ACL system.
- Faster startup and shutdown of controllers.
- vs ZooKeeper mode: ZK uses ZAB (ZooKeeper Atomic Broadcast) protocol, runs as a separate process, has its own client protocol and ACLs. KRaft is embedded, uses Raft, and removes an entire failure mode.
- vs etcd/Consul: etcd and Consul are general-purpose Raft-based KV stores used by Kubernetes, Nomad, etc. KRaft is Raft specialized for Kafka's metadata model.
- vs Pulsar: Pulsar still depends on ZooKeeper for metadata (although Pulsar is working on alternatives).
- vs Redpanda: Redpanda has always used Raft natively (no ZK dependency) — it's one of Redpanda's core design differences from historical Kafka.
__cluster_metadata is a compacted topic — don't mess with its settings.KRaft is the built-in consensus mechanism that replaced ZooKeeper (deprecated in Kafka >= 3.x). It uses the Raft consensus algorithm to coordinate multiple controllers.
In simple words: Imagine 3 managers in a company. Only 1 can be the CEO at a time. When the current CEO leaves, the remaining managers vote to pick a new CEO. That voting process is "consensus." KRaft is Kafka's built-in voting system. Before KRaft, Kafka used an external voting system called ZooKeeper (like hiring an outside agency to run the election).
Why Do We Need Consensus?
- A single controller is a bottleneck — if it dies, the whole cluster is stuck.
- So we run multiple controllers (typically 3 or 5).
- But who decides which controller is the leader? We can't use another controller (infinite loop).
- Answer: the controllers elect a leader among themselves using the Raft consensus algorithm.
KRaft vs ZooKeeper
ZooKeeper (Legacy)
- External distributed system.
- Deployed separately.
- Monitored separately.
- Maintained separately from Kafka.
- Uses ZAB consensus protocol.
- Additional operational overhead.
- Deprecated in Kafka 3.x+.
KRaft (Modern)
- Built-in to Kafka.
- No external dependency.
- Single system to deploy and monitor.
- Uses Raft consensus protocol.
- Less operational overhead.
- Faster failover.
- Default since Kafka 3.3+.
Quorum
Quorum means majority of nodes must agree before any decision is finalized.
| Controller Nodes | Quorum (N/2 + 1) | Can Tolerate Failures |
|---|---|---|
| 3 | 2 | 1 |
| 5 | 3 | 2 |
| 7 | 4 | 3 |
KRaft Metadata Commit Flow
Producer or Admin CLI sends a request (e.g., create topic) to any broker, which forwards it to the active controller.
Creates the topic/partitions, decides leader/followers, creates a metadata record with the next offset (e.g., offset 100). Writes it to its local _cluster_metadata.log but marks it as NOT committed yet.
Active controller sends the new metadata record to standby controllers via heartbeats.
Each standby appends the record to their local _cluster_metadata.log (not committed yet) and sends ACK back.
Once a majority (quorum) of controllers have written successfully, the active controller marks the record as committed. Updates last committed offset to 100.
Active controller sends heartbeats with the last committed offset to all brokers. Brokers fetch and apply the new metadata.
When standby controllers receive the next heartbeat with the committed offset, they also mark the record as committed in their local log.
Consumer & Consumer Groups
group.id and collectively consume a topic such that each partition is read by exactly one consumer in the group. This gives you horizontal scalability: more consumers = more parallelism, up to the partition count. Multiple groups on the same topic are independent — each group reads the full stream independently, enabling a fanout pattern (analytics group, billing group, ML training group all consuming the same events).- group.id: the group identifier. All consumers sharing it cooperate.
- Partition assignment: Kafka assigns partitions to consumers via strategies like
RangeAssignor,RoundRobinAssignor,StickyAssignor, orCooperativeStickyAssignor. - Offsets: each group tracks its own committed offset per (topic, partition) in the internal
__consumer_offsetstopic. - Rebalancing: when consumers join/leave, partitions are redistributed — potentially pausing the group briefly.
- Heartbeats: consumers send heartbeats via
heartbeat.interval.msto stay in the group.
- vs RabbitMQ competing consumers: RabbitMQ supports multiple consumers on one queue with load distribution, but no partition-based ordering and no replay.
- vs SQS consumers: SQS consumers compete for messages from a single queue. There's no concept of partition assignment or per-consumer offset.
- vs Kinesis Client Library (KCL): KCL is the closest analog — assigns shards to workers, tracks checkpoints in DynamoDB.
- vs Pulsar subscription modes: Pulsar offers Exclusive, Shared, Failover, and Key_Shared — more flexible than Kafka's single consumer-per-partition model.
max.poll.interval.ms to process a batch, it gets kicked out of the group, triggering rebalancing. Rebalance storms: frequent consumer restarts can cause constant reassignment — use static membership (group.instance.id) to mitigate. Offset commit ordering: commit AFTER processing, never before.A Consumer is an application that reads (pulls) events from Kafka. Kafka uses a PULL model — the consumer asks for data, Kafka never pushes.
In simple words: A consumer is like a newspaper reader. Kafka doesn't throw the newspaper at your door — YOU go and pick it up when you're ready. A Consumer Group is a team of readers sharing the workload. If there are 3 newspaper sections, 3 readers can each take 1 section and read in parallel. But two readers from the same team never read the same section.
Consumer Groups
Every consumer belongs to a group.id. Within a group, work is divided. Across different groups, the same data is read independently.
Partition Assignment Scenarios
Consumers == Partitions (Ideal)
Topic: 3 partitions, Group: 3 consumers
C2 → P1
C3 → P2
Perfect Balance
Consumers < Partitions
Topic: 6 partitions, Group: 3 consumers
C2 → P1, P4
C3 → P2, P5
Each consumer handles more load
Consumers > Partitions
Topic: 3 partitions, Group: 5 consumers
C2 → P1
C3 → P2
C4 → IDLE
C5 → IDLE
Wasted consumers
Multiple Consumer Groups
The same partition CAN be read by multiple consumers that are in different groups. Each group independently tracks its own offset.
Offset Management
__consumer_offsets, keyed by (group.id, topic, partition). The committed offset represents the next record to read, not the last processed — so committing offset 42 means "I processed 0–41, start me at 42 next time".- Current offset: the position the consumer is currently reading from.
- Committed offset: the persisted position that will be used on restart.
- Log end offset (LEO): the offset of the next record to be written to the partition.
- Consumer lag: LEO − committed offset = how many records behind the consumer is.
- auto.offset.reset: what to do when there's no committed offset —
earliest,latest, ornone.
- vs RabbitMQ ack: RabbitMQ uses per-message acks — once acked, the message is deleted. Kafka uses cumulative offsets — one commit covers everything up to that offset.
- vs SQS visibility timeout: SQS messages become invisible while being processed; if not deleted in time, they reappear. Kafka has no per-message visibility — it's all offset-based.
- vs Kinesis checkpoints: KCL stores checkpoints in DynamoDB per shard — conceptually identical to Kafka committed offsets.
- vs Pulsar cursors: Pulsar has server-side cursors per subscription — similar to Kafka offsets but stored in BookKeeper.
__consumer_offsets. Committing too rarely means more duplicates on restart. Resetting offsets manually via kafka-consumer-groups.sh --reset-offsets can cause consumers to reprocess or skip data. Offset expiration: unused consumer groups have their offsets deleted after offsets.retention.minutes (default 7 days).Each consumer group maintains its partition-wise committed offset independently. This tells Kafka "I've processed up to this point."
In simple words: An offset is like a bookmark in a book. It tells you "I've read up to page 105." If you put the book down and come back later, you start from page 106. Each reader (consumer group) has their own bookmark — so one reader might be on page 105 while another is on page 60.
Offset Tracking Structure
| Consumer Group | Topic | Partition | Committed Offset |
|---|---|---|---|
| notification | order-events | 0 | 105 |
| notification | order-events | 1 | 98 |
| notification | order-events | 2 | 210 |
| analytics | order-events | 0 | 60 |
| analytics | order-events | 1 | 45 |
Where Are Offsets Stored?
Offsets are stored in the __consumer_offsets internal topic (50 partitions by default). The partition is determined by:
Offset Commit Strategies
- Auto-commit:
enable.auto.commit=true— consumer commits everyauto.commit.interval.ms(default 5s) in the background. Simplest, but can commit before processing completes → data loss risk. - Manual sync commit: call
commitSync()after processing a batch. Strong guarantees, blocks until broker confirms, lower throughput. - Manual async commit: call
commitAsync()— non-blocking, higher throughput, but no retry on failure (unless you write the retry yourself). - Hybrid (best of both):
commitAsync()during normal operation,commitSync()on shutdown and rebalance. - Transactional commit: commit offsets as part of a transaction alongside output writes (for exactly-once).
- vs RabbitMQ acks: RabbitMQ has per-message manual or auto-ack. No batched cumulative commits like Kafka.
- vs SQS DeleteMessage: SQS requires deleting each processed message individually (or in batches of 10).
- vs Kinesis checkpoints: KCL checkpoints are typically done manually after processing, similar to Kafka sync commits.
- vs Pulsar acks: Pulsar supports both individual and cumulative acks natively.
ConsumerRebalanceListener). For exactly-once: use Kafka transactions or idempotent downstream writes (database upserts, idempotent REST calls with request IDs).How and when the consumer tells Kafka "I've successfully processed up to this offset" — this determines your delivery guarantee.
Strategy 1: Auto-Commit (Risky)
Events 51-99 are LOST. At-most-once delivery.
Strategy 2: Manual Commit (Safe)
If crash before commit: events reprocessed (duplicates). At-least-once delivery.
Complete Producer Write Flow
producer.send() to being durably committed on disk across multiple brokers. Understanding this flow is critical for tuning throughput, latency, and durability. The path involves serialization, partitioning, batching, compression, network send, broker-side append, replication, and ack. Each stage has its own configuration knobs and failure modes.- 1. Serialize: convert key and value objects to bytes via configured serializers (StringSerializer, Avro, Protobuf, JSON, etc.).
- 2. Partition: the partitioner picks a target partition based on the key hash (or sticky partitioning for null keys).
- 3. Batch: the record accumulator buffers records per partition, waiting for
linger.msor untilbatch.sizeis hit. - 4. Compress: the entire batch is compressed (gzip, snappy, lz4, zstd).
- 5. Send: sender thread pulls batches, wraps them in a
ProduceRequest, sends to the partition leader broker. - 6. Broker append: leader writes to its local log file, updates LEO.
- 7. Replicate: followers fetch the new records and append to their logs.
- 8. Ack: depending on
acks, the leader responds after 0, 1, or ISR-wide ack. - 9. Client callback: the producer invokes the user's callback with success or error.
- vs RabbitMQ publish: RabbitMQ routes via exchanges and bindings — more logic in the broker, less in the client.
- vs SQS SendMessage: SQS is an HTTP API call per message (or 10 messages at most) — no client-side batching buffer like Kafka.
- vs Kinesis: similar client-side batching in the KPL but with stricter shard limits.
linger.ms = low latency but small batches = more RPCs = lower throughput. High linger.ms (20–100ms) = bigger batches = better compression + fewer RPCs = higher throughput but added latency. Production pipelines typically use linger.ms=10-50.send() blocks for max.block.ms — monitor buffer-exhausted-rate. Silent data loss with acks=0 or 1: leader failure can drop records. Retries with in-flight >1: can reorder records unless idempotence is enabled.linger.ms=10 and batch.size=262144 for their ~1M msg/sec producers. Uber uses aggressive batching with compression.type=zstd for their dispatch events.End-to-end flow from event creation to acknowledgement.
{ topic: "order-events", key: "order-123", value: orderJson, acks: all }
Connects to any broker (bootstrap server). Broker either has cached metadata or fetches from the active controller. Returns: which broker is leader for each partition of each topic.
partition = hash("order-123") % 3 = 1
From metadata: Partition 1 leader = Broker 2
No intermediary. Direct TCP connection to the partition leader.
Appends to active segment file. Assigns next offset (e.g., offset 101). Writes to OS page cache (async flush to disk).
Follower brokers continuously poll the leader. They fetch the new event, write to their own partition logs, and ACK back to the leader.
Once all in-sync replicas confirm the write, the leader sends a success response back to the producer.
Complete Consumer Read Flow
consumer.poll(). Kafka consumers pull records — they don't receive push notifications. On each poll() call, the consumer sends a FetchRequest to the partition leaders it's been assigned, receives batches of records (possibly still compressed), deserializes them, and hands them back to the app. Behind the scenes, the consumer coordinates with the group coordinator for heartbeats, partition assignments, and offset commits.- 1. Join group: consumer contacts the group coordinator, triggers rebalance, gets partition assignment.
- 2. Fetch position: for each assigned partition, determine starting offset (from committed offset or
auto.offset.reset). - 3. Send FetchRequest: one request per leader broker, bundling all assigned partitions on that broker.
- 4. Broker serves fetch: reads from log (often via zero-copy
sendfile()), returns up tofetch.max.bytesof data. - 5. Deserialize & decompress: consumer unpacks the batch and deserializes each record.
- 6. Return to app:
poll()returns aConsumerRecordsbatch. - 7. Process & commit: app processes records, then commits offsets (sync or async).
- 8. Heartbeat: background thread sends heartbeats to the coordinator.
- vs RabbitMQ consumer: RabbitMQ pushes messages to consumers via the channel. Kafka's pull model gives consumers control over pace.
- vs SQS ReceiveMessage: SQS is long-poll pull with visibility timeouts — no partitions, no group coordination.
- vs Kinesis: KCL polls shards at a fixed interval (
GetRecords), tracks checkpoints in DynamoDB.
max.poll.interval.ms, the consumer is kicked out → rebalance. Solution: reduce max.poll.records or use separate processing threads. Large fetch sizes: can cause OOM if memory isn't tuned. Fetch latency: tune fetch.min.bytes and fetch.max.wait.ms to balance throughput and latency.End-to-end flow from consumer startup to continuous event processing.
group.id = "notification-service"
hash("notification-service-group-id") % 50 = 23 → Partition 23 of __consumer_offsets
Asks any broker: "Who is the leader of __consumer_offsets Partition 23?" Answer: Broker 3.
Sends JoinGroup request. Broker 3 manages membership, waits for all group members, then assigns partitions. Response: "You handle Partition 2 of order-events."
The assignment is written to __consumer_offsets Partition 23 and replicated to all followers. Internally acts like acks=all (no config option for this).
Asks Group Coordinator: "What's my last offset for order-events Partition 2?" Coordinator looks up key "notificationGroupId_order-events_Partition-2" and returns offset 100.
Checks metadata: leader of order-events Partition 2 = Broker 2. Sends FetchRequest: "Give me events from offset 101, max 200 bytes." Broker 2 returns events 101-501.
Application logic runs on each event sequentially.
Sends OffsetCommit to Group Coordinator: "I've processed up to offset 501 for order-events Partition 2." Coordinator writes to __consumer_offsets.
Consumer repeats from step 7. This is the consumer's poll loop — continuously fetching, processing, and committing.
Log Compaction & Retention
delete (the default) and compact. With delete, Kafka removes entire segments that exceed retention.ms or retention.bytes. With compact, Kafka keeps the latest value per key forever, garbage-collecting older duplicates — essentially turning the topic into a materialized view keyed by the record key. You can even combine them (compact,delete) to compact old records but eventually delete tombstones after a grace period.- Time-based retention:
retention.ms=604800000(7 days) — Kafka deletes segments older than 7 days. - Size-based retention:
retention.bytes=1073741824— keeps up to 1 GB per partition, deletes oldest first. - Log compaction: for each unique key, retains only the latest value. Old values of the same key are removed by a background "cleaner" thread.
- Tombstones: a record with
nullvalue marks a key as deleted — afterdelete.retention.ms, the tombstone itself is removed.
- vs RabbitMQ TTL: RabbitMQ TTLs are per-message or per-queue and drop messages, not compact them by key.
- vs SQS retention: SQS max retention is 14 days, no compaction.
- vs Kinesis: Kinesis retention is 24 hours to 365 days (paid tiers), no compaction feature.
- vs Pulsar: Pulsar supports topic compaction very similarly to Kafka.
min.cleanable.dirty.ratio, min.compaction.lag.ms, and max.compaction.lag.ms control when compaction runs. Tombstones that never delete: if delete.retention.ms is too long, deleted keys hang around forever.Kafka can't keep data forever (disk fills up). Two strategies control how old data is cleaned up.
In simple words: Think of your phone gallery. Delete policy = delete all photos older than 30 days. Compact policy = for each person, keep only the most recent photo and delete older ones. Both free up space, but in different ways.
Policy 1: Delete (Default)
Deletes entire old segment files based on:
- Time-based: Segment created 8 days ago → DELETED. Segment created 6 days ago → KEPT.
- Size-based: If total partition log exceeds retention.bytes, oldest segments are deleted until under the limit.
Deletion is async — does not block writes. Runs periodically in background.
Policy 2: Compact
Keeps only the latest value for each key. Old duplicates are removed.
| Before | After Compaction | |
|---|---|---|
| off:100 user1 → v1 | → | removed |
| off:101 user2 → v1 | → | removed |
| off:102 user1 → v2 | → | KEPT |
| off:103 user3 → v1 | → | KEPT |
| off:104 user2 → v2 | → | KEPT |
Used for: state stores, CDC, config topics. Compaction is async, per-segment, does not block writes.
Why Kafka is Fast (Despite Using Disk)
- Sequential disk writes: append-only log files are faster than random writes — even on spinning disks, sequential writes hit ~600 MB/s, competitive with memory random access.
- Page cache: Kafka doesn't maintain its own buffer cache — reads go through the OS page cache, which is shared across processes and well-optimized.
- Zero-copy (sendfile): consumer fetches use the Linux
sendfile()syscall to transfer data directly from page cache to socket, bypassing user-space copies. - Batching: records are batched on both produce and fetch — amortizes per-request overhead.
- Binary protocol: compact length-prefixed binary format, no JSON or XML parsing.
- Compression: entire batches are compressed together (not per-record), yielding 3–10× reduction.
- vs RabbitMQ: RabbitMQ's Erlang-based broker does more per-message work (routing, acking), capping it at ~50k msg/sec per queue.
- vs ActiveMQ: ActiveMQ uses traditional JMS semantics with persistent stores that don't exploit sequential I/O as aggressively.
- vs Redis Streams: Redis is in-memory so faster for tiny workloads, but caps out at instance memory and has no native sequential disk persistence model.
- vs Pulsar: Pulsar achieves similar throughput via BookKeeper journal + ledger architecture, also using sequential I/O.
Kafka reads and writes to DISK. Disk I/O is slow, right? Then how is Kafka so fast? Two fundamental OS-level optimizations.
Page Cache (Write Optimization)
- Kafka trusts the OS to cache log files efficiently.
- Write goes to RAM (page cache), returns immediately.
- OS flushes to disk asynchronously in the background.
- Because Kafka is append-only, writes are always sequential.
- No insert in middle, no random access, no updates — disk head keeps moving forward.
- Sequential disk writes can be as fast as RAM in many cases.
Zero-Copy (Read Optimization)
- Kafka uses sendfile() system call.
- Data goes directly from disk/page cache to the network socket.
- No copy to Kafka JVM heap. No serialization/deserialization.
- Saves CPU cycles, memory, and time.
- This is why Kafka can serve consumers at network speed, not application speed.
Additional Performance Factors
Batching
Producers batch multiple events into a single network request. Configurable via batch.size and linger.ms. Reduces network overhead dramatically.
Compression
Entire batches can be compressed (gzip, snappy, lz4, zstd). Reduces network bandwidth and disk usage. Decompressed only at the consumer.
Partitioned Parallelism
Multiple partitions allow multiple consumers to read in parallel. More partitions = more parallelism = more throughput. Each partition is an independent log.
Edge Cases & Failure Scenarios
acks, min.insync.replicas), retries, and idempotence. These scenarios drive the design of production-grade producers and consumers.- Leader failure: leader crashes before replicating to followers — with
acks=1, data is lost. Withacks=all+min.isr=2, it's preserved. - Network partition: producer loses connection after sending but before ack — retry may duplicate records (unless idempotence is on).
- Broker disk full: writes fail; existing segments are protected by retention.
- Consumer rebalance during processing: offsets committed for records still in-flight can lead to duplicates.
- Zombie producer: a paused-then-resumed producer can try to send stale records — transactions prevent this via epoch fencing.
- Slow ISR: followers lag and get kicked out, shrinking durability.
- Unclean leader election: if ISR is empty and you allow it, a stale replica becomes leader → data loss.
- vs RabbitMQ: RabbitMQ has similar issues (node failure, split brain) but uses mirrored/quorum queues — different recovery semantics.
- vs SQS: SQS hides most failure modes; you only see "message might arrive more than once" and visibility timeout expirations.
- vs database transactions: databases use strict 2PC/Paxos for consistency — Kafka uses eventual ISR convergence and idempotence for similar but weaker guarantees.
acks=all + min.insync.replicas=2 + replication.factor=3 for durable topics. Enable enable.idempotence=true on producers. Use transactions for cross-partition atomicity. Monitor under-replicated partitions. Use rebalance listeners to commit before losing partitions. Use dead-letter queues for poison pills.What actually happens when things go wrong in a Kafka cluster. These are common interview questions.
1. Active Controller Fails
2. Leader Partition Broker Fails
3. Follower Partition Broker Fails
4. What if a "Topic Fails"?
5. Consumer Fails Before Committing Offset
6. Consumer Processing Takes Too Long
Solutions
- Increase max.poll.interval.ms
- Reduce max.poll.records (process smaller batches)
- Optimize processing logic (faster processing)
- Use a separate thread pool for processing (decouple poll from process)
7. Entire Broker Fails
8. What if ISR List Becomes Empty?
EDA Challenges & When to Use Kafka
- Eventual consistency: reads after writes may return stale data across services.
- Debugging: tracing a user action across 10 async services requires correlation IDs and distributed tracing.
- Schema evolution: breaking changes can silently corrupt consumers; tooling (Schema Registry) is mandatory.
- Ordering: guaranteed only within a partition — cross-partition ordering requires careful key design.
- Idempotency: consumers must handle duplicate delivery for at-least-once semantics.
- Operational cost: running Kafka well needs dedicated SRE/infra expertise.
- Simple task queue: use Celery, Sidekiq, BullMQ, or SQS.
- Request/response RPC: use gRPC or REST — Kafka is async.
- Low-throughput apps: the ops overhead of running Kafka outweighs the benefits.
- Real-time chat/pub-sub with short retention: NATS, Redis Pub/Sub, or WebSockets are lighter.
- Large file transfer: Kafka's max message size is typically 1 MB — use object storage for big payloads.
Challenges in Event-Driven Architecture
Eventually Consistent
Data is not instantly synchronized across services. After an event is published, there's a delay before all consumers process it. Not suitable when strict real-time consistency is required.
Duplicate Events
At-least-once delivery means events can be processed more than once. Consumers must be idempotent — processing the same event twice should produce the same result.
Ordering Problem
Across partitions, ordering is not guaranteed. The 2nd event can be processed before the 1st if they're in different partitions. Use key-based partitioning for related events.
Schema Evolution
If the producer changes the event schema (adds/removes fields), all consumers may break. Use a Schema Registry (Avro, Protobuf) with backward/forward compatibility rules.
Debugging Complexity
No single request-response trace. Events flow through multiple services asynchronously. Requires distributed tracing (correlation IDs, OpenTelemetry) for end-to-end visibility.
Poison Messages
One bad message can block an entire partition's consumer. If processing throws an unhandled exception, the consumer retries infinitely. Solution: Dead Letter Queue (DLQ) for failed events.
Operational Overhead
Must monitor: consumer lag, throughput, partition balance, broker health, disk usage. More infrastructure to manage compared to simple REST calls.
Exactly-Once is Hard
Achieving exactly-once processing requires idempotent producers + transactions + consumer-side deduplication. Kafka supports it but adds complexity and latency overhead.
When to Use Kafka
Long-running workflows
Processes that take minutes/hours. Decouple the trigger from the execution.
Eventual consistency is acceptable
When millisecond consistency isn't required between services.
Real-time analytics & monitoring
Stream processing, dashboards, alerting, metrics aggregation.
Audit logs & event sourcing
Immutable log of everything that happened. Replay for debugging or rebuilding state.
Decoupled microservices
Services communicate without knowing about each other. Independent deployment and scaling.
High-throughput data pipelines
ETL, CDC (Change Data Capture), log aggregation, clickstream processing.
Idempotent Producer
enable.idempotence=true, which guarantees that even if the producer retries a failed send, the record will only appear in the log exactly once. This solves the classic "ack lost, retry duplicated" problem. Under the hood, the broker assigns each producer a Producer ID (PID) and the producer tags each record with a monotonic sequence number per partition. The broker tracks the highest sequence number seen for each (PID, partition) and rejects records with old or out-of-order sequences, preventing duplicates and reordering.- Producer ID (PID): assigned by the broker on first connection. Uniquely identifies a producer session.
- Sequence numbers: monotonic per (PID, partition). Broker checks
expected = last_seq + 1. - Broker deduplication: duplicates (sequence already seen) are silently discarded; out-of-order reject with
OutOfOrderSequenceException. - Required settings:
acks=all,retries>0,max.in.flight.requests.per.connection<=5.
- vs RabbitMQ publisher confirms: RabbitMQ confirms don't deduplicate — a retry after a lost ack still creates a duplicate. Application-side idempotency is your responsibility.
- vs SQS deduplication ID: FIFO SQS queues support deduplication via
MessageDeduplicationId, but with a 5-minute window. - vs Kinesis: Kinesis doesn't deduplicate — clients must handle it.
- vs Pulsar: Pulsar supports broker-side deduplication via
brokerDeduplicationEnabledwith application-supplied sequence IDs.
transactional.id. Not end-to-end: idempotence protects producer → broker only. Consumer side still needs offset management + idempotent processing.What happens when a producer sends an event but the network fails before it gets the ACK? The producer retries — and now Kafka has the same event twice. Idempotent producer solves this.
The Problem — In Simple Words
Imagine you're ordering food online. You click "Place Order" but the page freezes. You're not sure if the order went through, so you click again. Now the restaurant has 2 orders for the same meal. That's exactly what happens with Kafka producers without idempotency.
The Solution — Idempotent Producer
Turn on one config and Kafka handles it for you. Each producer gets a unique ID, and every event gets a sequence number. If the broker sees the same sequence number twice, it ignores the duplicate.
What Configs Change Automatically?
| Config | Value When Idempotence = true | Why |
|---|---|---|
| acks | all | Must confirm write on all ISR replicas |
| retries | Integer.MAX_VALUE | Keep retrying until success |
| max.in.flight.requests.per.connection | 5 (max) | Allows up to 5 unacknowledged requests but maintains order using sequence numbers |
Transactions & Exactly-Once Semantics (EOS)
- transactional.id: a stable, unique ID for the producer. Allows fencing zombies across restarts.
- Producer epoch: each new incarnation of a
transactional.idgets a higher epoch; old epochs are fenced. - beginTransaction / sendOffsetsToTransaction / commitTransaction: the API. Offsets and writes commit atomically.
- Isolation level: consumers set
isolation.level=read_committedto see only committed transactional writes, skipping aborted ones. - Transaction coordinator: a broker-side component that tracks transaction state in
__transaction_statetopic.
- vs database transactions: Kafka transactions don't give you SERIALIZABLE isolation — they just ensure atomicity and "all committed writes visible together". No row locks, no rollback of arbitrary side effects.
- vs 2PC/XA: Kafka transactions are not distributed transactions across Kafka and other systems. For cross-system atomicity, use patterns like Outbox.
- vs RabbitMQ transactions: RabbitMQ has publisher transactions but they're slow and rarely used. Publisher confirms are preferred.
- vs Pulsar transactions: Pulsar has similar transaction support since Pulsar 2.7.
processing.guarantee=exactly_once_v2. For external side effects (DB writes, HTTP calls), use the Outbox pattern: write to DB and to an outbox table in one DB transaction, then a CDC process streams the outbox to Kafka.Idempotent producer prevents duplicates within one partition. But what if you need to write to multiple partitions or topics as ONE atomic operation? That's where Transactions come in.
In Simple Words
Think of a bank transfer: debit from Account A and credit to Account B. Both must happen together — either BOTH succeed or BOTH fail. You can't debit A and then crash before crediting B. Kafka Transactions work the same way — all writes in a transaction either ALL go through or NONE of them do.
When Do You Need Transactions?
Read-Process-Write Pattern
Read from Topic A, process the data, write to Topic B, and commit the offset for Topic A — all as ONE atomic operation. If any step fails, everything rolls back.
Multi-Partition Writes
Writing events to 3 different partitions at once. Either all 3 partitions get the event, or none of them do. No partial writes.
Multi-Topic Writes
Writing an order event to "orders" topic AND a payment event to "payments" topic. Both must succeed together.
How It Works (Step by Step)
Consumer Side — Isolation Levels
read_uncommitted (Default)
- Consumer sees ALL events, even from incomplete transactions.
- Faster but may read data that later gets rolled back.
- Use when you don't care about transactional guarantees.
read_committed
- Consumer only sees events from committed transactions.
- Slightly slower (waits for commit markers).
- Use when you need exactly-once guarantees.
Consumer Rebalancing
- RangeAssignor (default pre-2.4): assigns contiguous ranges of partitions per topic — can be uneven.
- RoundRobinAssignor: distributes partitions round-robin across consumers.
- StickyAssignor: tries to keep existing assignments stable during rebalance, minimizing movement.
- CooperativeStickyAssignor (2.4+): incremental rebalancing — only moves the partitions that need to change, others keep processing.
- vs RabbitMQ: RabbitMQ competing consumers don't rebalance — messages are pushed to whichever consumer has capacity.
- vs Kinesis KCL: KCL uses DynamoDB leases per shard — similar concept but different implementation.
- vs Pulsar: Pulsar's shared subscription mode distributes messages dynamically without rebalancing.
group.instance.id) are the two biggest tools to minimize rebalance impact.max.poll.interval.ms, smaller batches, or separating processing from polling. Lost progress: if you don't commit in onPartitionsRevoked, you may reprocess records. Static group membership (Kafka 2.3+) helps avoid unnecessary rebalances during rolling restarts.When consumers join or leave a group, Kafka redistributes partitions among the remaining consumers. This process is called rebalancing.
In Simple Words
Imagine 3 workers splitting 6 tasks equally (2 each). If one worker goes home sick, the remaining 2 workers must split all 6 tasks (3 each). If a new worker joins, tasks are redistributed again. That's rebalancing.
What Triggers a Rebalance?
New Consumer Joins
A new consumer starts with the same group.id. Partitions are redistributed to include the new member.
Consumer Leaves or Crashes
A consumer calls close() or stops sending heartbeats. Its partitions must be given to someone else.
Topic Changes
New partitions are added to a subscribed topic, or the consumer subscribes to a new topic.
Rebalancing Process (Step by Step)
Group Coordinator detects a change (new member, missed heartbeat, etc.).
Every consumer in the group stops reading. They commit their current offsets and release their partitions.
All consumers send a JoinGroup request to the Group Coordinator. The first consumer to join becomes the Group Leader (not the same as partition leader).
The Group Leader runs the partition assignment strategy (Range, RoundRobin, or Sticky) and sends the new assignment back to the Coordinator.
Coordinator distributes the new partition assignment to all consumers. Each consumer starts reading from their newly assigned partitions.
Assignment Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Range | Divides partitions of each topic into contiguous ranges per consumer. Consumer 1 gets P0-P1, Consumer 2 gets P2-P3. | When you need co-partitioning (same consumer reads same partition number from different topics). |
| RoundRobin | Distributes all partitions one-by-one across consumers. P0 → C1, P1 → C2, P2 → C3, P3 → C1... | When you want even distribution across consumers. |
| Sticky | Like RoundRobin but tries to keep previous assignments. Only moves partitions that must move. | Reducing rebalance disruption. Recommended for most cases. |
| CooperativeSticky | Incremental rebalancing — only revokes partitions that need to move, others keep reading. | Minimizing downtime during rebalancing. Best choice. |
Dead Letter Queue (DLQ)
- Retry topic(s): optional intermediate topics with delay for retryable errors.
- DLQ topic: where terminally failed records end up.
- Headers: enrich the DLQ record with exception, stack trace, attempt count, source metadata.
- DLQ consumer / inspector: tools to browse, replay, or discard DLQ messages.
- vs RabbitMQ DLX: RabbitMQ has native Dead Letter Exchanges — messages are automatically routed to a DLX on nack/reject/TTL. Kafka doesn't have this — you implement it in consumer code.
- vs SQS DLQ: SQS DLQs are configured at the queue level with
maxReceiveCount— messages automatically move after N failed delivery attempts. Cleaner than Kafka's application-level approach. - vs Pulsar DLQ: Pulsar has a
DeadLetterPolicyon subscriptions, similar to SQS.
errors.tolerance=all and errors.deadletterqueue.topic.name. Spring Cloud Stream provides DLQ abstractions. Uber's uReplicator and many internal pipelines use the retry → DLQ pattern.What happens when a consumer can't process an event? A bad event (poison message) can block the entire partition. DLQ is the solution.
In Simple Words
Imagine a post office. Most letters are delivered fine. But some letters have wrong addresses. Instead of throwing them away or blocking all other mail, the post office puts them in a separate "undeliverable" pile. Someone later checks that pile and fixes the addresses. A Dead Letter Queue is that "undeliverable" pile for Kafka events.
The Problem
The Solution — DLQ Pattern
How to Implement DLQ
Schema Registry
- Schema storage: versioned per subject (typically
<topic>-valueand<topic>-key). - Compatibility rules:
BACKWARD,FORWARD,FULL,NONE— enforced at registration time. - REST API: for schema registration, lookup, and management.
- Wire format: 5 bytes (magic + 4-byte schema ID) prefixed to serialized payload.
- Supported formats: Avro (most common), Protobuf, JSON Schema.
- vs raw JSON: JSON has no schema enforcement — consumers can break silently. Schema Registry catches breaking changes at produce time.
- vs ProtoBuf alone: Protobuf has
.protofiles but no central registry — teams duplicate definitions. - vs AWS Glue Schema Registry: Glue is AWS-native, integrates with MSK, Kinesis, Lambda. Confluent's is Kafka-native, more mature.
- vs GraphQL schemas: GraphQL is for request/response; Schema Registry is for streams.
TopicNameStrategy (default), RecordNameStrategy, TopicRecordNameStrategy — pick one and stick with it. Breaking changes by accident: a "nullable" added too late can break backward compatibility.Kafka stores events as raw bytes. It doesn't know or care what's inside the event. But producers and consumers need to agree on the format. Schema Registry solves this.
In Simple Words
Imagine two people writing letters to each other. If person A starts writing in English and person B expects Hindi, they can't understand each other. Schema Registry is like a shared dictionary that both people agree to use. If person A wants to add a new word, the dictionary checks if person B can still understand the letter.
The Problem Without Schema Registry
How Schema Registry Works
When a producer starts, it registers the event schema (Avro, Protobuf, or JSON Schema) with the Schema Registry. The registry assigns a Schema ID (e.g., ID=1).
Instead of sending raw JSON, the producer sends: [Magic Byte][Schema ID=1][Serialized Data]. The data is compact (Avro binary, not JSON text).
Consumer reads the Schema ID from the event, fetches the schema from the Registry, and deserializes the data correctly. Schema is cached locally after the first fetch.
When the producer wants to change the schema, the Registry checks if the new version is compatible with older versions before allowing it.
Compatibility Modes
| Mode | What You Can Do | Simple Explanation |
|---|---|---|
| BACKWARD | Delete fields, add fields with defaults | New consumers can read old events. "I can understand old letters." |
| FORWARD | Add fields, delete fields with defaults | Old consumers can read new events. "Old readers can understand new letters." |
| FULL | Add/delete fields only with defaults | Both old and new consumers can read both old and new events. Safest. |
| NONE | Anything goes | No checks. Dangerous. Can break consumers at any time. |
Kafka Connect
- Distributed runtime: connectors run as tasks across worker nodes with automatic rebalancing.
- REST API: manage connectors via HTTP calls.
- Offset tracking: stored in Kafka topics, giving exactly-once-ish guarantees.
- Transformations (SMTs): simple per-record transforms (rename, drop, mask) inline.
- Converters: Avro, Protobuf, JSON Schema, or raw bytes.
- Dead letter queues: built-in for handling bad records.
- vs writing custom producers/consumers: Connect eliminates boilerplate for common integrations — just install and configure.
- vs Apache NiFi: NiFi is a drag-and-drop data flow tool with a GUI. Kafka Connect is simpler, Kafka-specific, and API-driven.
- vs Airflow: Airflow is for orchestrating batch jobs on a schedule. Kafka Connect is for continuous streaming ingestion/egress.
- vs Fivetran/Stitch: Fivetran is a managed SaaS for ELT to data warehouses. Kafka Connect is self-managed and multi-destination.
- vs AWS DMS: DMS does CDC to AWS-native targets. Debezium (on Kafka Connect) is the OSS equivalent.
A framework for moving data between Kafka and other systems without writing any code. Just configure and run.
In Simple Words
You want to move data from your MySQL database into Kafka. Or from Kafka into Elasticsearch. You could write a custom producer/consumer for each system. But that's a lot of repeated work. Kafka Connect gives you ready-made "connectors" — like USB adapters that plug different systems into Kafka.
Two Types of Connectors
Source Connector (Data INTO Kafka)
- Reads data from external systems.
- Converts it into Kafka events.
- Publishes to Kafka topics.
- Example: Every new row in MySQL automatically becomes a Kafka event.
Sink Connector (Data OUT OF Kafka)
- Reads events from Kafka topics.
- Writes data to external systems.
- Example: Every Kafka event automatically gets indexed in Elasticsearch.
Popular Connectors
| Connector | Type | What It Does |
|---|---|---|
| JDBC Source | Source | Reads rows from any SQL database (MySQL, PostgreSQL, Oracle) into Kafka |
| Debezium | Source | CDC (Change Data Capture) — captures every INSERT/UPDATE/DELETE from databases in real-time |
| S3 Sink | Sink | Writes Kafka events to Amazon S3 as files (Parquet, JSON, Avro) |
| Elasticsearch Sink | Sink | Indexes Kafka events into Elasticsearch for search |
| HDFS Sink | Sink | Writes Kafka events to Hadoop HDFS for batch analytics |
| MongoDB Source/Sink | Both | Reads from and writes to MongoDB collections |
Kafka Streams
- KStream: an unbounded stream of records (events).
- KTable: a changelog stream interpreted as an evolving table (latest value per key).
- GlobalKTable: a fully replicated lookup table across all instances.
- State stores: RocksDB-backed local state, backed up by compacted Kafka topics for fault tolerance.
- Windowing: tumbling, hopping, sliding, session windows for time-based aggregation.
- Joins: stream-stream, stream-table, table-table joins, all natively supported.
- vs Apache Flink: Flink is a full distributed cluster with richer stream processing semantics, better for very complex pipelines. Kafka Streams is lighter and tightly integrated with Kafka.
- vs Spark Structured Streaming: Spark is micro-batch based (latency in hundreds of ms to seconds); Kafka Streams is record-at-a-time with single-digit ms latency.
- vs ksqlDB: ksqlDB is a SQL interface built on top of Kafka Streams.
- vs Apache Storm/Samza: older stream processors; Streams has largely supplanted them for Kafka-native use cases.
groupByKey/join on a non-key column creates intermediate topics.A Java library for processing data in real-time directly from Kafka topics. No separate cluster needed — it runs inside your application.
In Simple Words
Normally, you read from Kafka, process data in your app, and write results back to Kafka. Kafka Streams makes this easier by giving you a powerful API to filter, transform, join, and aggregate events — all in real-time, as events flow through. Think of it like Excel formulas but for streaming data.
Why Not Just Use a Consumer?
Plain Consumer
- You write all processing logic yourself.
- You handle state management yourself.
- You handle fault tolerance yourself.
- You handle time windowing yourself.
- Joining two topics? Build it from scratch.
Kafka Streams
- Built-in operators: filter, map, join, aggregate.
- Automatic state management (RocksDB).
- Automatic fault tolerance and recovery.
- Built-in windowing (tumbling, hopping, sliding).
- Joining topics is one line of code.
Common Operations
Filter
Keep only events that match a condition. Example: only keep orders above $100.
Map / Transform
Change the event shape. Example: extract just the customer ID and amount.
Group & Aggregate
Group events by key and compute running totals. Example: total spending per customer.
Windowed Aggregation
Aggregate events within a time window. Example: count orders per 5-minute window.
Kafka Streams vs Other Stream Processors
| Feature | Kafka Streams | Apache Flink | Apache Spark Streaming |
|---|---|---|---|
| Deployment | Library (runs in your app) | Separate cluster | Separate cluster |
| Complexity | Simple — just add a JAR | Medium — needs cluster setup | Medium — needs cluster setup |
| Scaling | Add more app instances | Add more TaskManagers | Add more executors |
| Best For | Kafka-to-Kafka processing | Complex event processing, multiple sources | Batch + streaming hybrid |
| Language | Java / Kotlin only | Java, Python, SQL | Java, Python, Scala, SQL |
Kafka Security
- TLS/SSL: encrypts traffic between clients ↔ brokers, broker ↔ broker, and (in ZK mode) broker ↔ ZK.
- SASL mechanisms:
PLAIN,SCRAM-SHA-256/512,GSSAPI(Kerberos),OAUTHBEARER— authenticate clients. - mTLS: client certificate authentication — often combined with TLS.
- ACLs: authorize operations (Read/Write/Create/Delete/Describe) on resources (Topic/Group/Cluster) to principals.
- RBAC (Confluent, MSK): role-based permissions on top of ACLs.
- Encryption at rest: typically delegated to the underlying disk (EBS, LUKS) or provided by cloud vendors.
- vs RabbitMQ: RabbitMQ has built-in user/virtual-host-based permissions and supports TLS. Kafka's ACL model is more granular per-resource.
- vs SQS/Kinesis: AWS uses IAM for auth — simpler to manage but tied to AWS.
- vs Pulsar: Pulsar has namespace-level auth and supports similar SASL/TLS/token mechanisms.
kafka.authorizer.logger.allow.everyone.if.no.acl.found=false is critical. Overly broad ACLs: one app with Cluster:All is a silent security hole. Certificate rotation: forgotten rotation schedules cause outages when certs expire.By default, Kafka has no security at all. Anyone who can reach the broker can read/write any topic. In production, you need authentication, authorization, and encryption.
In Simple Words
Think of Kafka as a building. Without security: no doors, no locks, no ID checks. Anyone can walk in, read any document, and write on any whiteboard. Security adds: a front door with a lock (encryption), ID cards for entry (authentication), and rules about who can access which room (authorization).
Three Layers of Security
Encryption (SSL/TLS)
What: Encrypts data in transit between clients and brokers, and between brokers.
Simple: Like sending a letter in a sealed envelope instead of a postcard. Nobody can read the data while it's moving over the network.
Port 9093 (default for SSL)
Authentication (SASL)
What: Verifies WHO is connecting. Clients must prove their identity.
Options:
- SASL/PLAIN — Username + password (simple, use with SSL).
- SASL/SCRAM — Salted password (more secure than PLAIN).
- SASL/GSSAPI — Kerberos (enterprise, Active Directory).
- SASL/OAUTHBEARER — OAuth 2.0 tokens.
Authorization (ACLs)
What: Controls WHO can do WHAT on WHICH resource.
Simple: Even after proving your identity, you can only access what you're allowed to. The payment service can read "payments" topic but not "user-profiles" topic.
Kafka vs Other Message Systems
- vs RabbitMQ: RabbitMQ = flexible routing (exchanges, headers, topics), complex workflows, per-message ack, low-to-mid throughput. Kafka = raw throughput, replay, simple routing.
- vs ActiveMQ/JMS: ActiveMQ is a classic JMS broker — great for legacy Java apps. Kafka's log model and throughput eclipse it.
- vs AWS SQS: SQS = fully managed, zero ops, simple queue semantics, at-least-once, 14-day max retention, low-to-mid throughput. Kafka = you run it, massive throughput, replay.
- vs AWS Kinesis: Kinesis = Kafka-like but managed and bounded per shard. Better for AWS-first shops that don't want to run Kafka.
- vs Google Pub/Sub: Fully managed, push or pull, global, integrates with Dataflow. Kafka has better replay and ordering guarantees.
- vs Apache Pulsar: Pulsar has compute/storage separation, tiered storage, and geo-replication built in. Kafka has a larger ecosystem and more mature tooling.
- vs NATS: NATS is lightweight, simple, perfect for edge/IoT. NATS JetStream adds log-style persistence. Kafka is heavyweight but far more scalable.
- vs Redis Streams: Redis Streams is in-memory with optional persistence — perfect for small workloads in existing Redis stacks.
- vs Redpanda: Redpanda is wire-compatible with Kafka, written in C++ with thread-per-core design, no ZK/KRaft quorum needed, lower tail latency. Trades ecosystem maturity.
Kafka is not the only messaging system. Here's how it compares to other popular choices and when to pick which one.
Comparison Table
| Feature | Apache Kafka | RabbitMQ | AWS SQS | AWS SNS |
|---|---|---|---|---|
| Model | Pull-based log | Push-based queue | Pull-based queue | Push-based pub/sub |
| Message Retention | Configurable (days/forever) | Until consumed | Up to 14 days | No retention |
| Replay | Yes (any offset) | No | No | No |
| Ordering | Per partition | Per queue (with limits) | FIFO queues only | No ordering |
| Throughput | Millions/sec | Tens of thousands/sec | Nearly unlimited (managed) | Nearly unlimited (managed) |
| Consumer Groups | Built-in | Manual setup | One consumer per message | Fan-out to subscribers |
| Operational Cost | High (self-managed) / Medium (managed) | Medium | Low (fully managed) | Low (fully managed) |
| Best For | Event streaming, high throughput, replay needed | Task queues, routing logic, request-reply | Simple job queues, decoupling | Fan-out notifications |
When to Pick What?
Pick Kafka When
- You need to replay old events (debugging, new consumers).
- You need very high throughput (millions of events/sec).
- Multiple services need to read the same data independently.
- You need event sourcing or an immutable audit log.
- You need stream processing (real-time transformations).
Pick RabbitMQ When
- You need complex routing (topic exchanges, header-based routing).
- You need request-reply pattern (RPC over messages).
- You need message priority (urgent messages first).
- Your throughput is moderate (thousands/sec, not millions).
- You want simpler setup for basic task queues.
Pick SQS/SNS When
- You're on AWS and want fully managed (zero ops).
- You need a simple job queue (no streaming features needed).
- You want fan-out notifications (SNS → email, SMS, Lambda).
- You don't need replay, ordering, or consumer groups.
- You want pay-per-use pricing with no cluster management.
Real-World Use Cases
- Activity tracking: clickstreams, page views, user interactions. LinkedIn's original use case.
- Log aggregation: collect logs from thousands of servers into a central pipeline. ELK, Splunk, Datadog all ingest via Kafka.
- Metrics pipeline: time-series metrics streamed into Prometheus, Graphite, Druid.
- CDC (Change Data Capture): stream DB changes via Debezium for cache invalidation, search indexing, data warehousing.
- Event sourcing: store application state as an immutable log of events, rebuild state by replay.
- Stream processing: real-time fraud detection, recommendations, ML feature pipelines.
- Microservices backbone: decoupled async communication between services.
- Messaging: replace RabbitMQ/ActiveMQ for high-throughput async messaging.
- IoT ingestion: high-volume sensor data from millions of devices.
- LinkedIn: activity streams, metrics, logs, CDC from Espresso/Oracle/MySQL. 7+ trillion msg/day.
- Netflix: Keystone pipeline ingests 1+ trillion events/day for analytics, personalization, billing.
- Uber: dispatch events, surge pricing, trip state, fraud detection. 4+ trillion msg/day.
- Airbnb: search indexing, pricing, review fanout, payment events.
- Goldman Sachs: trade events, regulatory reporting pipelines.
- Shopify: order events, inventory updates, cross-service eventing.
- Spotify: play events, recommendation signals, data warehouse loading.
- Pinterest: pin events, notifications, ML features.
- Robinhood: market data and trade events.
- Walmart: inventory updates, supply chain events.
How big companies actually use Apache Kafka in production. These examples help you understand where Kafka fits in real systems.
E-Commerce Order Flow
Problem: When a customer places an order, many things must happen: payment processing, inventory update, shipping notification, email confirmation, analytics tracking. These should not be tightly coupled.
Each service processes the same order event independently. If the email service is slow, it doesn't affect payments.
Real-Time Analytics / Dashboards
Problem: You want a live dashboard showing website activity: page views, clicks, sign-ups — all updated in real-time.
Website events (clicks, views) → Kafka topic → Kafka Streams aggregates (counts per minute) → Results written to another Kafka topic → Dashboard reads from that topic.
Kafka Streams Real-time
Change Data Capture (CDC)
Problem: Your main database is MySQL. You want every change (INSERT/UPDATE/DELETE) to be available for search in Elasticsearch and analytics in Snowflake.
MySQL → Debezium (Kafka Connect Source) → Kafka topics → Elasticsearch Sink Connector → Elasticsearch. No custom code needed.
Kafka Connect Debezium
Log Aggregation
Problem: You have 500 servers generating logs. You need all logs in one place for monitoring and debugging.
Each server sends logs to a "logs" Kafka topic. A central service reads from the topic and stores them in Elasticsearch/Loki/S3. Kafka acts as a buffer so log spikes don't crash your storage system.
High Throughput Buffering
Fraud Detection
Problem: You need to detect suspicious transactions within seconds, not hours.
Transaction events → Kafka topic → Kafka Streams / Flink compares each transaction against rules (unusual amount, unknown location, rapid frequency) → Suspicious events written to "fraud-alerts" topic → Alert service blocks the transaction.
Low Latency Stream Processing
Microservices Communication
Problem: 20+ microservices need to share data without creating a tangled web of REST calls.
Each service publishes events about its domain (UserCreated, OrderPlaced, PaymentCompleted). Other services subscribe only to the events they care about. Services are fully decoupled — you can add, remove, or update services without affecting others.
Loose Coupling Scalable
Companies Using Kafka
| Company | How They Use Kafka | Scale |
|---|---|---|
| Created Kafka. Uses it for activity tracking, metrics, data pipelines. | 7+ trillion messages/day | |
| Netflix | Real-time monitoring, event sourcing, data pipeline between microservices. | 700+ billion messages/day |
| Uber | Real-time pricing, trip matching, driver tracking, analytics. | Trillions of messages/day |
| Spotify | Event delivery, logging, real-time recommendations. | Hundreds of billions/day |
| Walmart | Inventory management, order processing, real-time supply chain. | Billions of messages/day |
Setup: Running Kafka (Docker & Cloud)
- Docker / Docker Compose: spin up a 1-broker or 3-broker cluster in seconds with
confluentinc/cp-kafkaorbitnami/kafkaimages. - Bare metal / VM: unzip Kafka binary, edit
server.properties, runbin/kafka-server-start.sh. - Kubernetes (Strimzi): open-source operator that manages Kafka clusters via CRDs — most popular OSS production option.
- Confluent Cloud: fully managed Kafka from the Kafka creators. Serverless tiers available.
- AWS MSK: managed Kafka on AWS. MSK Serverless abstracts clusters entirely.
- Aiven / Upstash / Instaclustr: alternative managed Kafka providers with multi-cloud support.
- Azure Event Hubs (Kafka endpoint): Azure's Kafka-compatible managed service.
- Self-hosted: full control, lowest per-GB cost at scale, but you own upgrades, patching, scaling, monitoring.
- Managed: zero-ops, pay premium, constrained in tuning options and features.
- vs managed alternatives like SQS/Kinesis: Kinesis/SQS have lower per-unit cost for small workloads; Kafka becomes cheaper at high throughput.
docker compose up with a KRaft single-broker setup for learning. For multi-broker testing, Bitnami's image + Compose file gives you a 3-broker KRaft cluster out of the box. Redpanda is also a great local option — a single binary, Kafka-compatible, boots in seconds. Tools: kcat (formerly kafkacat), kafka-ui, Kowl/Redpanda Console.advertised.listeners must match what clients can actually reach. Port conflicts: 9092 (Kafka), 2181 (ZK), 9093 (TLS). Volume persistence: forgetting to mount data volumes loses your data on container restart. Host resolution: localhost inside the container is not the same as outside.Before writing code, you need a running Kafka cluster. You have two choices: run it locally with Docker, or use a managed cloud service.
Option 1: Cloud Kafka (Zero Setup)
If you don't want to install anything, use a managed Kafka service. They give you a Kafka cluster in the cloud — you just get a connection URL and start coding.
Confluent Cloud
Made by the creators of Kafka. Most popular managed Kafka. Free tier available (no credit card needed).
Free Tier Best Kafka Support
Link: confluent.cloud
AWS MSK (Managed Streaming for Kafka)
Amazon's managed Kafka. Good if you're already on AWS. Supports MSK Serverless (pay per use).
AWS Ecosystem Serverless Option
Link: aws.amazon.com/msk
Upstash Kafka
Serverless Kafka with a REST API. Generous free tier (10,000 messages/day). Perfect for learning and small projects.
Free Tier REST API
Link: upstash.com/kafka
Redpanda Cloud
Kafka-compatible but faster. Drop-in replacement — same Kafka client code works. Free tier available.
Free Tier Fastest
Link: redpanda.com/cloud
Option 2: Docker (Local Setup — Recommended for Learning)
This runs a full Kafka cluster on your machine. You need Docker Desktop installed.
Step 1: Create docker-compose.yml
Create a file called docker-compose.yml in your project root:
Step 2: Start Kafka
Step 3: Verify Kafka is Working
Project Setup (Node.js + TypeScript)
kafkajs: pure JS Kafka client with native TypeScript support.typescript,ts-node,@types/node: TypeScript toolchain.dotenv: for environment-based configuration.avscor@kafkajs/confluent-schema-registry: for Avro + Schema Registry.pinoorwinston: structured logging.
- vs Java: Java has the official Apache Kafka client — most feature-complete, used as the reference. Kafka Streams and Schema Registry have first-class Java support.
- vs Go:
confluent-kafka-go(librdkafka-based) is the gold standard for Go. Sarama is a pure-Go alternative. - vs Python:
confluent-kafka-pythonandaiokafka(async). - vs Rust:
rdkafkacrate provides bindings to librdkafka. - KafkaJS limitation: lacks some advanced features (transactions are available but less mature than Java's).
src/config/kafka.ts (client factory), src/producers/*.ts, src/consumers/*.ts, src/schemas/*.ts (event types), src/index.ts (bootstrap). Use tsconfig.json with strict: true, target: es2020, moduleResolution: node.SIGTERM handlers). Missing await on send: async producer sends need await or promise chains. Running multiple consumers in one process: each consumer should have its own instance. Connection errors: KafkaJS retries by default, but log retry attempts for observability.Setting up a TypeScript project with the KafkaJS library — the most popular Kafka client for Node.js.
Step 1: Initialize the Project
Update tsconfig.json
Step 2: Create the Kafka Client (Shared Config)
Create src/client.ts — This file creates a reusable Kafka connection that all other files will import.
Project Structure (What We'll Build)
Add Run Scripts to package.json
Admin: Create & Manage Topics
AdminClient, and command-line tools like kafka-topics.sh, kafka-configs.sh, kafka-consumer-groups.sh wrap the same RPCs. In production, topic creation should be explicit via Admin API or Infrastructure-as-Code (Terraform, Strimzi CRDs) — never rely on auto-create.- Create topic: specify name, partitions, replication factor, config overrides (retention, cleanup policy).
- Delete topic: asynchronously marks the topic for deletion.
- Alter topic configs: change retention, compression, min.insync.replicas on the fly.
- List/describe topics: metadata discovery.
- List/describe consumer groups: see members, assignments, offsets, lag.
- Reset offsets: rewind a consumer group to earliest, latest, or a specific offset.
- vs RabbitMQ: RabbitMQ has a management HTTP API and CLI (
rabbitmqctl) — similar scope. - vs SQS/Kinesis: AWS management is via AWS SDK / CLI / CloudFormation / Terraform.
- vs database DDL: Kafka's Admin API is like a DDL for streams: create/alter/drop topics, grant ACLs, etc.
auto.create.topics.enable in production. Use Terraform (with the Kafka provider), Strimzi KafkaTopic CRDs, or Confluent Cloud's topic management to declaratively manage topics. Separate dev/staging/prod cluster access. Never delete topics from scripts — it's usually irreversible and data is lost.delete.topic.enable=true on all brokers, and only removes data asynchronously. Alter partition count: can only increase, and breaks key-based hashing. Config override precedence: topic-level configs override broker defaults. ACL changes: must be applied via kafka-acls.sh separately.Before producing or consuming, you need to create topics. The Admin API lets you create, delete, and inspect topics programmatically.
Run It
Build a Producer
send() with (topic, key, value) tuples, handle callbacks or promises for success/failure, and disconnect gracefully on shutdown. Real production producers add serialization (JSON/Avro/Protobuf), error handling, retry logic, metrics, tracing headers, and graceful shutdown. This section walks through building a producer using KafkaJS with TypeScript.- Kafka client factory: a singleton that holds the
Kafkaclient and produces producer instances. - Serializer: JSON.stringify for quick start; Avro/Protobuf for production.
- Keying strategy: pick keys to control partitioning and preserve per-entity ordering.
- Error handling: catch send failures, log with context, retry or DLQ.
- Graceful shutdown: SIGTERM handler that calls
producer.disconnect(). - Metrics hooks: instrument with Prometheus or OpenTelemetry.
- vs REST client: no request/response —
send()returns when the broker acks, but the app doesn't wait for a response from consumers. - vs RabbitMQ publisher: RabbitMQ publishers typically use publisher confirms; Kafka uses acks levels.
- vs SQS SendMessage: SQS is an HTTP call per message (or batches of 10). Kafka batches automatically.
acks: 'all', enable idempotence, set a reasonable retry count (5+), use structured logs with correlation IDs, implement graceful shutdown, wrap send errors for upstream handling, and never create multiple producer instances per process — share one.send().catch(log)) where appropriate.The producer sends events to Kafka topics. We'll build it step-by-step, explaining every line and every option.
Run It
Build a Consumer
run()/eachMessage() (or eachBatch() for batch processing), process each record, and let Kafka handle offset commits — or commit manually for tighter control. A production-grade consumer adds deserialization, idempotency checks, error handling with retry/DLQ, graceful shutdown, and metrics/tracing.- Consumer group configuration: pick a stable, meaningful
groupId. - Subscription: subscribe to one or more topics.
- Message handler: your business logic, ideally idempotent.
- Offset strategy: auto-commit for simple pipelines, manual for stronger guarantees.
- Error handling: retry transient errors, DLQ permanent ones, never let exceptions kill the loop.
- Graceful shutdown: stop the consumer on SIGTERM, commit offsets, then disconnect.
- vs RabbitMQ consumer: RabbitMQ pushes messages; Kafka pulls. Kafka has native consumer groups; RabbitMQ uses competing consumers.
- vs SQS consumer: SQS has long polling + visibility timeout per message. Kafka polls batches.
- vs Kinesis consumer: Kinesis KCL tracks shard leases in DynamoDB — heavier setup than Kafka consumer groups.
eachMessage without catching — it can halt the consumer. Use autoCommit: false with explicit commits for critical pipelines.max.poll.interval.ms or use background worker pattern. Double processing on rebalance: commit before losing partitions via rebalance listeners. Consumer lag growing silently: always monitor lag per partition.The consumer reads events from Kafka. It joins a consumer group, gets partitions assigned, and processes events in a loop.
Run It
Advanced Patterns
- Outbox pattern: write to DB and outbox table in one transaction; CDC streams the outbox to Kafka. Guarantees DB and Kafka stay in sync.
- Transactional produce-consume-commit: Kafka Streams-style exactly-once processing.
- Saga: long-running distributed transactions using a series of events + compensating events.
- Event sourcing: state is derived from a log of all events — rebuild by replay.
- CQRS: separate write model (commands) from read models (materialized views fed by events).
- Dead Letter Queue: isolate poison pills.
- Retry topic: delayed retries with backoff.
- Fanout: one topic feeding many consumer groups independently.
- vs REST-based distributed transactions: REST requires 2PC or Saga across services. Kafka + Outbox is simpler and more reliable.
- vs RabbitMQ: many of these patterns are harder in RabbitMQ because of the lack of replay and the per-message ack model.
- vs Temporal/Zeebe: Temporal/Zeebe are workflow engines with durable state machines. Kafka Saga is more lightweight but requires more manual coordination.
Production-ready patterns: batch sending, Dead Letter Queue, and transactions.
1. Batch Producer (High Throughput)
src/advanced/producer-batch.ts — Send many events at once for better performance.
2. Consumer with Dead Letter Queue
src/advanced/consumer-dlq.ts — Retry failed events and move permanently failed ones to a DLQ topic.
3. Transactions (Exactly-Once Processing)
src/advanced/transaction.ts — Read from one topic, process, write to another topic, and commit offset — all as ONE atomic operation.
Quick Reference: Run Commands
| Command | What It Does | Run Order |
|---|---|---|
| docker compose up -d | Start Kafka + UI containers | First (once) |
| npm run admin | Create topics (order-events, DLQ, confirmations) | Second (once) |
| npm run consumer | Start consumer (keeps running) | Third (Terminal 1) |
| npm run producer | Send order events | Fourth (Terminal 2) |
| npm run batch | Send 100 events at once | Anytime |
| npm run dlq | Consumer with retry + DLQ | Anytime |
| npm run txn | Exactly-once transaction consumer | Anytime |
Multi-Broker Docker Setup (3 Brokers)
cp-kafka. Three brokers is the minimum for replication.factor=3 + min.insync.replicas=2, which gives you real fault-tolerance testing.- 3 broker services in Docker Compose, each with unique
broker.idand advertised listeners. - KRaft controller quorum: in combined mode, all 3 nodes are controller+broker; for production, separate them.
- Internal listeners: for broker-to-broker communication (
INTERNAL://kafka-1:29092). - External listeners: for host access from your laptop (
EXTERNAL://localhost:9092). - Named volumes: for persistent data across restarts.
- Optional: Kafka UI (Redpanda Console, Kowl, kafka-ui, AKHQ) for visualization.
- vs single-broker: single-broker setups can't test replication, ISR behavior, or broker failover.
- vs Kubernetes (Strimzi): Strimzi provides a full operator-driven Kafka on K8s — production-grade, but heavier to run locally.
- vs Confluent Cloud / MSK: managed options skip local Docker entirely but cost money and need internet.
PLAINTEXT://localhost:9092, PLAINTEXT://localhost:9093, etc., for host-visible ports. Host memory: 3 JVM brokers + app can be heavy; tune heap sizes. Clock skew between containers is rare but matters for time-based retention.A single broker is fine for learning, but production needs at least 3 brokers for fault tolerance. If one broker dies, the other two keep serving data with zero downtime.
In Simple Words
One broker = one copy of your data. If it dies, everything is gone. Three brokers = three copies of your data on three different machines. Even if one machine catches fire, the other two have a full copy and keep working. That's why production always uses 3+ brokers.
docker-compose.yml (3 Brokers + KRaft + Kafka UI)
Start the 3-Broker Cluster
Update client.ts for Multi-Broker
The only change in your code — list all 3 broker addresses:
Test Fault Tolerance — Kill a Broker
Production-Grade Producer
acks: 'all'— wait for all ISR.enable.idempotence: true— deduplicate retries.retries: high (10+), with exponential backoff.max.in.flight.requests.per.connection: ≤5 with idempotence.compression.type:zstdorlz4.linger.ms: 10–50 for better batching.batch.size: tune based on record size.request.timeout.msanddelivery.timeout.ms: bound total delivery time.
- Singleton producer: one instance per process, shared across handlers.
- Structured logging: include trace IDs, topic, partition, offset.
- Prometheus metrics: throughput, latency, error rate, buffer usage.
- Distributed tracing: inject trace context into record headers (OpenTelemetry, W3C Trace Context).
- TLS + SASL: encrypted + authenticated connections.
- Graceful shutdown: SIGTERM handler flushes pending sends.
- Circuit breakers: stop sending if Kafka is persistently unreachable.
- vs dev producer: production producers prioritize correctness and durability over simplicity.
- vs RabbitMQ production publisher: RabbitMQ uses publisher confirms and mandatory routing flags — different failure modes.
buffer.memory fills up, sends block or throw. Missing graceful shutdown: loses unsent batches on deploys. Ignoring send errors: fire-and-forget is fine for logs but dangerous for business events. Not monitoring producer metrics: latency regressions go unnoticed.kafka-ruby.The basic producer works, but production needs retry handling, error callbacks, graceful shutdown, and proper batching configs. Here's a production-ready producer.
Production-Grade Consumer
- Manual offset commits: commit after processing, not before.
- Idempotent processing: handle duplicate delivery gracefully.
- Retry + DLQ strategy: distinguish transient from permanent errors.
- Rebalance listeners: commit on partition revocation to avoid double-processing.
- Backpressure: don't fetch faster than you can process.
- Graceful shutdown: finish in-flight work, commit offsets, then disconnect.
- Static membership: reduce rebalancing during rolling deploys.
- Lag monitoring: alert on per-partition lag thresholds.
group.id: stable, descriptive.enable.auto.commit: false: take manual control.max.poll.interval.ms: longer than worst-case processing time.max.poll.records: tune based on per-record processing cost.session.timeout.msandheartbeat.interval.ms: for liveness detection.isolation.level: read_committed: skip aborted transactional records.partition.assignment.strategy: CooperativeStickyAssignor for incremental rebalance.
- vs dev consumer: handles real failure modes instead of happy path.
- vs RabbitMQ consumer: RabbitMQ has per-message acks and no rebalancing — different operational model.
- vs SQS consumer: SQS has visibility timeouts and DLQs at the queue level, more prescriptive.
A production consumer needs manual commits, rebalance listeners, error handling, DLQ integration, and graceful shutdown.
Monitoring & Health Checks
- UnderReplicatedPartitions: should be 0. Non-zero = durability compromised.
- OfflinePartitionsCount: should be 0. Non-zero = unavailable data.
- ActiveControllerCount: should be exactly 1 cluster-wide.
- UnderMinIsrPartitionCount: should be 0 — producers with acks=all will block.
- Consumer lag (per group): alert if lag > threshold or growing.
- Request handler pool utilization: >80% = broker is overloaded.
- Disk usage: alert at 70%, page at 85%.
- JVM GC pauses: long G1 pauses cause heartbeat timeouts.
- Prometheus + Grafana: industry standard. Use
jmx_exporteror the newerkafka-exporter. - Burrow: LinkedIn's consumer lag monitor — de-facto standard for lag.
- Cruise Control: LinkedIn's automated rebalancing + capacity monitoring.
- Confluent Control Center: commercial dashboard for Confluent Platform.
- AKHQ, Kowl, Redpanda Console, kafka-ui: open-source UIs for topic browsing and lag inspection.
- Datadog, New Relic, Splunk: commercial APMs with Kafka integrations.
- vs RabbitMQ monitoring: RabbitMQ has a built-in management plugin with UI and metrics. Kafka relies on external scrapers.
- vs SQS/Kinesis: AWS provides CloudWatch metrics out of the box.
In production, you need to monitor consumer lag, broker health, and throughput. Here's how to build monitoring into your Node.js app.
In Simple Words
Imagine a factory assembly line. "Consumer lag" is how many boxes are piled up on the belt, waiting to be picked up. If the pile grows, your workers (consumers) are too slow. You need to either speed them up or add more workers.
Check Consumer Lag Programmatically
Health Check Endpoint
Add a simple HTTP health check so Kubernetes knows your consumer is alive:
Key Metrics to Monitor
| Metric | What It Means | Alert When |
|---|---|---|
| Consumer Lag | Events waiting to be processed | Lag > 10,000 or growing continuously |
| Rebalance Rate | How often partitions are reassigned | More than 5 rebalances / hour |
| Fetch Latency | Time to fetch a batch from Kafka | > 5 seconds consistently |
| Commit Latency | Time to commit an offset | > 3 seconds consistently |
| Under-Replicated Partitions | Partitions where ISR < replication factor | Any value > 0 |
| Offline Partitions | Partitions with no active leader | Any value > 0 (critical!) |
| DLQ Message Count | Failed events in the Dead Letter Queue | Any increase (investigate immediately) |
| Consumer Group State | Stable, Rebalancing, Empty, Dead | Anything other than "Stable" |
Production Deployment Checklist
- Replication factor ≥ 3 for all production topics.
- min.insync.replicas = 2 for durable topics.
- unclean.leader.election.enable = false.
- Rack awareness (
broker.rack) across AZs. - Disable auto.create.topics.enable; manage topics via IaC.
- KRaft mode (for new clusters) with dedicated controller quorum of 3 or 5.
- TLS + SASL/mTLS for all listeners.
- ACLs defined per service account.
- Monitoring + alerts on URP, controller count, lag, disk.
- Backup/disaster recovery strategy: MirrorMaker 2 or Confluent Replicator for cross-region.
- Producers:
acks=all, idempotence enabled, sensible retries, graceful shutdown. - Consumers: manual commits, idempotent processing, DLQ strategy, rebalance listeners.
- Singletons: one producer/consumer instance per process.
- Structured logging with correlation IDs.
- Metrics scraped and dashboards wired up.
- Distributed tracing integrated (headers propagation).
- Runbooks: for broker failure, partition reassignment, consumer stalls, disk full.
- On-call rotation with clear escalation paths.
- Capacity planning: headroom for 2× peak traffic.
- Chaos testing: kill a broker in staging regularly.
- Schema Registry with compatibility enforcement.
- Regular upgrades on supported Kafka versions.
Everything you need to verify before going to production. This is the summary of everything we've learned, as a checklist.
Cluster Configuration
| Config | Dev Value | Production Value | Why |
|---|---|---|---|
| Broker Count | 1 | 3+ (odd number) | Fault tolerance. Odd for clean quorum majority. |
| replication.factor | 1 | 3 | Survive 2 broker failures without data loss. |
| min.insync.replicas | 1 | 2 | With RF=3, at least 2 must confirm writes. |
| auto.create.topics.enable | true | false | Topics must be created deliberately with correct configs. |
| Controller Nodes | 1 (combined) | 3 dedicated | Dedicated controllers = faster metadata operations. |
| log.retention.hours | 168 (7 days) | Based on need | Balance between replay ability and disk cost. |
Producer Configuration
| Config | Dev Value | Production Value | Why |
|---|---|---|---|
| acks | 1 | -1 (all) | All ISR replicas must confirm. No data loss. |
| idempotent | false | true | Prevents duplicate events on retry. |
| compression | none | GZIP or ZSTD | Reduces network bandwidth and disk usage 60-80%. |
| retries | 0 | MAX_INT | Auto-set by idempotent=true. Never give up. |
| maxInFlightRequests | 5 | 5 | Max for idempotent. Sequence numbers maintain order. |
| Message Key | null | Business key | orderId, userId, etc. Guarantees ordering per key. |
| Graceful Shutdown | None | SIGTERM handler | Flush pending events before exit. |
Consumer Configuration
| Config | Dev Value | Production Value | Why |
|---|---|---|---|
| autoCommit | true | false | Manual commit after processing. Prevents data loss. |
| fromBeginning | true | false | Only read new events. Don't reprocess millions of old events. |
| sessionTimeout | 30s | 30s | Balance between detection speed and false alarms. |
| heartbeatInterval | 3s | 3s (1/3 of session) | Regular heartbeats prevent false death detection. |
| Instances Count | 1 | = partition count | 1 consumer per partition = max parallelism. |
| DLQ | None | topic.DLQ | Failed events go to DLQ, not blocking the queue. |
| Health Check | None | /health + /ready | Kubernetes needs HTTP endpoints to manage pods. |
| Graceful Shutdown | None | SIGTERM handler | Commit offsets and leave group cleanly on deploy. |
Operational Checklist
Before Launch
- All topics created manually with correct partition count and RF=3.
- SSL/TLS enabled for all connections.
- SASL authentication configured per service.
- ACLs restrict topic access (service A can't read service B's topics).
- Schema Registry configured with BACKWARD compatibility.
- DLQ topics created for every consumer topic.
- Monitoring dashboards set up (consumer lag, broker health).
- Alerts configured (lag > threshold, offline partitions, DLQ growth).
During Operation
- Monitor consumer lag trends (growing = problem).
- Watch rebalance frequency (too many = consumers are unstable).
- Review DLQ events daily (fix root causes, replay fixed events).
- Check under-replicated partitions (ISR < RF = broker may be failing).
- Track disk usage on brokers (set up alerts at 80% capacity).
- Test disaster recovery: periodically kill a broker, verify auto-recovery.
Scaling Decisions
- Consumer lag growing? Add more consumer instances (up to partition count).
- Need more parallelism? Add more partitions (can't reduce later!).
- Disk full? Add more brokers or reduce retention period.
- High latency? Check compression, batch size, fetch size configs.
- Too many rebalances? Use CooperativeSticky assignor, tune timeouts.
- Network bottleneck? Enable compression (ZSTD is fastest with good ratio).