Apache Kafka

A distributed event streaming platform built for high-throughput, fault-tolerant, real-time data pipelines. This guide covers every concept from fundamentals to production failure scenarios.

Distributed Event Streaming Fault Tolerant High Throughput Horizontally Scalable Durable Storage
Chapter 01

Event-Driven Architecture (EDA)

What is itEvent-Driven Architecture (EDA) is a software design paradigm where services communicate by producing and reacting to events instead of calling each other synchronously via REST or RPC. An event is an immutable fact that something happened in the past (OrderPlaced, PaymentCompleted, UserSignedUp). Producers emit events to a broker without knowing or caring who (if anyone) will consume them, and consumers subscribe to whichever events they care about. This completely decouples services in time, space, and knowledge — the producer and consumer don't need to be online at the same time, don't need to know each other's network address, and don't share types or APIs beyond the event schema.
Key characteristics
  • Asynchronous: producer fires an event and immediately moves on — no waiting for consumers.
  • Loose coupling: producer has zero knowledge of consumers; consumers can be added or removed without producer changes.
  • Eventual consistency: state converges across services over milliseconds to seconds, not instantly.
  • Replay-ability (log-based): events are persisted so new consumers can rebuild state from the beginning.
  • Scalability: each service scales independently based on its own throughput — no "slowest link" bottleneck.
Two flavors
  • Push-based Pub/Sub: broker pushes messages to active subscribers. If consumer is down, the message is lost (or dead-lettered). Examples: RabbitMQ, AWS SNS, Google Pub/Sub, Redis Pub/Sub.
  • Pull-based Streaming (log-based): events are appended to a durable log and consumers pull at their own pace from any offset. Examples: Apache Kafka, AWS Kinesis, Apache Pulsar, Redpanda.
How it differs from REST microservicesIn a REST microservice chain, Service A calls B calls C calls D — latency adds up, any failure cascades, and scaling is bottlenecked by the slowest service. In EDA, Service A emits an event and B, C, D react independently. Total throughput is decoupled from individual service latency, temporary outages don't break the chain (events wait in the broker), and you can add a new consumer for analytics/audit without touching the producer.
Why use itEDA shines when you need high throughput, independent team velocity, audit trails, replay capability, and resilience to partial failures. It's the de-facto architecture for modern e-commerce (order flow), banking (transaction processing), IoT (sensor ingestion), analytics pipelines, and CQRS/Event Sourcing systems.
Common gotchasEventual consistency is hard to reason about — users may see stale data for seconds. Debugging distributed event flows requires correlation IDs and distributed tracing (Jaeger, OpenTelemetry). Schema evolution becomes critical: a breaking change to an event payload can silently corrupt downstream consumers. Exactly-once delivery requires careful idempotency design.
Real-world examplesUber uses EDA for ride dispatch and surge pricing. Netflix processes 1+ trillion events/day through Kafka for personalization and A/B testing. LinkedIn (Kafka's birthplace) uses EDA for activity feeds and messaging. Airbnb uses it for search indexing and pricing updates. Goldman Sachs runs trade processing pipelines on Kafka.

Services don't call each other directly. Instead, a service emits events and other services react to those events.

What is an Event?

REST Microservices vs EDA Microservices

REST (Synchronous) Challenges
  • Availability — all services must be up. If any one service is down, the entire API call fails.
  • Latency — total latency = sum of all service latencies in the chain.
  • Cascading Failure — if Service 3 is slow or failing, Service 2 backs up, then Service 1 goes down too.
  • Tight Coupling — services know about each other, changes ripple across teams.
  • Scaling Issue — scaling Service 3 has no benefit if Service 1 and 2 can only handle the same request rate.
EDA (Asynchronous) Advantages
  • Loose Coupling — producer doesn't know or care about consumers.
  • Independent Scaling — each service scales on its own based on its throughput.
  • Resilience — temporary failures don't break the whole system. Events wait in the broker.
  • Replay — you can replay old events for recovery, debugging, or new consumer onboarding.
  • Latency Improvement — fire-and-forget pattern. Producer doesn't block waiting for downstream.

Two Types of EDA

Push-Based (Pub/Sub)
  • Broker pushes events to active consumers.
  • Only cares about delivery.
  • You cannot replay old events.
  • Events published only to currently active subscribers.
  • Examples: RabbitMQ, AWS SNS, Google Pub/Sub.
Pull-Based (Streaming / Kafka)
  • Consumer pulls events at its own pace.
  • Events are appended to logs — persisted on disk.
  • You can replay events from any offset.
  • Retention-based: events stay forever or until configured TTL.
  • Examples: Apache Kafka, Amazon Kinesis, Redpanda.
Chapter 02

What is Apache Kafka?

What is itApache Kafka is a distributed, partitioned, replicated, log-based streaming platform — essentially a horizontally scalable append-only commit log that applications read from and write to. Originally built at LinkedIn in 2010 by Jay Kreps, Neha Narkhede, and Jun Rao to handle LinkedIn's activity stream firehose, it was open-sourced via the Apache Software Foundation in 2011. At its core, Kafka stores streams of records in topics, which are split into partitions, which are append-only log files on disk. Producers write to the end; consumers read from any offset they choose. This "dumb broker, smart consumer" model is what enables Kafka to hit millions of messages per second per broker with single-digit millisecond latency.
Key capabilities
  • Pub/Sub messaging: publish events, subscribe with one or many consumer groups.
  • Stream processing: via Kafka Streams and ksqlDB for joining, aggregating, and transforming streams in real time.
  • Storage: durable, replicated, configurable retention (hours, days, forever).
  • Integration: Kafka Connect provides 100+ prebuilt source/sink connectors (JDBC, S3, Elasticsearch, MongoDB, etc.).
  • Exactly-once semantics: transactional writes across multiple partitions and topics since Kafka 0.11.
How it differs from other brokers
  • vs RabbitMQ / ActiveMQ: Those are queue-based brokers — messages are deleted after consumption. Kafka is log-based — messages stay for the entire retention window and can be replayed. RabbitMQ excels at complex routing (exchanges, headers, topics); Kafka excels at raw throughput and replay.
  • vs AWS SQS: SQS is a fully managed distributed queue with at-least-once delivery and no replay. Kafka gives you ordered partitions, replay, and far higher throughput, but you manage infrastructure (or use MSK/Confluent Cloud).
  • vs AWS Kinesis: Kinesis is Kafka's closest AWS-native competitor — also partitioned, also log-based — but it's managed, has lower throughput per shard, charges per shard-hour, and has 7-day max retention (24 hours by default).
  • vs Apache Pulsar: Pulsar separates compute (brokers) from storage (BookKeeper), offers tiered storage natively, and supports geo-replication out of the box. Kafka's ecosystem and tooling are more mature.
  • vs NATS / Redis Streams: NATS is lighter and faster for small messages but lacks Kafka's durability guarantees at scale. Redis Streams is simpler but not designed for multi-terabyte workloads.
Why use itUse Kafka when you need high-throughput event ingestion, durable replayable streams, and a central nervous system for your microservices. Typical workloads: activity tracking, metrics aggregation, log aggregation, stream processing, event sourcing, CDC (Change Data Capture), and real-time ML pipelines.
Common gotchasRunning Kafka yourself is non-trivial — ZooKeeper (pre-3.0) or KRaft quorum management, partition rebalancing, broker tuning, and monitoring all require expertise. Ordering is guaranteed only within a single partition. Consumer lag monitoring is essential. Message size defaults to 1 MB — sending large payloads requires config tuning or external storage.
Real-world scaleLinkedIn processes 7+ trillion messages/day across 100+ clusters. Netflix handles 1+ trillion events/day. Uber runs 4+ trillion messages/day for trip events and surge pricing. Walmart, Cisco, Goldman Sachs, Twitter, Spotify, Pinterest, PayPal, Target, and Box all run Kafka at petabyte scale.

A distributed event streaming platform that allows you to publish, store, and subscribe to streams of events in real-time.

In simple words: Kafka is like a super-fast, never-forgetting diary. Applications write events (things that happened) into this diary, and other applications can read those events whenever they want — even days later. The diary can handle millions of entries per second and never loses a page.

P
Publish Events

Producers write events (messages) to Kafka topics. Events are key-value pairs with optional headers and a timestamp. Producers choose which topic and optionally which partition.

S
Store Events

Events are durably stored on disk across multiple brokers with configurable replication. Kafka can retain events for hours, days, or forever. Storage is sequential and immutable.

C
Subscribe to Events

Consumers pull events from topics at their own pace. Multiple consumer groups can independently read the same data. Each group tracks its own position (offset) in the log.

High-Level Kafka Architecture
Producer 1 ──┐ ┌── Consumer Group 1 Producer 2 ──┤ ┌─────────────────────────────┐ ├── Consumer Group 2 Producer 3 ──┼───▶│ KAFKA CLUSTER │──────▶├── Consumer Group 3 Producer N ──┤ │ Broker 1 Broker 2 Broker 3│ ├── Consumer Group N │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │ └───▶│ │Topic │ │Topic │ │Topic │ │───────┘ │ │P0 P1 │ │P0 P1 │ │P0 P1 │ │ │ └──────┘ └──────┘ └──────┘ │ └─────────────────────────────┘
Millions
Events/sec throughput
ms
End-to-end latency
PB
Scale (Petabytes of data)
99.99%
Availability with replication
Chapter 03

Producer

What is itA Kafka Producer is any client application that publishes (writes) records to Kafka topics. Producers are the "write side" of Kafka — they batch records in memory, serialize them, pick a partition (via the configured partitioner), compress them, and ship them to the partition leader broker over TCP. Kafka producers are thread-safe, reusable, and designed to be long-lived — you create one producer per application and share it across request handlers. The producer library is implemented in Java (official), with high-quality native clients in librdkafka (C/C++), confluent-kafka-go, kafkajs (Node.js), aiokafka (Python), and more.
Key settings
  • acks: 0 (fire-and-forget), 1 (leader ack), all (all in-sync replicas ack — strongest durability).
  • linger.ms: how long to wait to build a batch before sending (higher = better throughput, higher latency).
  • batch.size: max batch size in bytes per partition (default 16 KB).
  • compression.type: none, gzip, snappy, lz4, zstd — compresses entire batches, reducing network and disk cost by 3–10×.
  • enable.idempotence: deduplicates retries using PID + sequence numbers, ensures exactly-once per partition.
  • max.in.flight.requests.per.connection: affects ordering guarantees with retries.
How it differs from other brokers' producers
  • vs RabbitMQ publishers: RabbitMQ publishers send to an exchange which routes to queues based on routing keys and bindings. Kafka producers write directly to topic partitions — no routing logic in the broker.
  • vs SQS SendMessage: SQS is one-message-at-a-time HTTP calls (or batches of 10); Kafka producers batch thousands of records in a single TCP request.
  • vs Kinesis PutRecord: Kinesis has HTTP-based PutRecords with 500 records/batch limit and 1 MB/sec per shard write cap. Kafka has no hard per-partition write cap beyond what the broker can handle.
  • vs Pulsar producers: Very similar API but Pulsar producers can optionally use broker-side deduplication without the client tracking sequence numbers.
Why batch and bufferThe producer's buffering model is the single biggest throughput lever in Kafka. By accumulating records in per-partition batches and compressing them together, a producer can push hundreds of thousands of records/sec per instance. Without batching, the overhead of TCP round trips and broker I/O dominates.
Common gotchasForgetting to call producer.flush() or producer.close() before shutdown drops unsent batches. Using acks=1 in production can lose data on leader failure. acks=all without min.insync.replicas=2 is a trap — a single replica still "works" but gives no real durability. Sending without a key gives round-robin partitioning, destroying ordering for related events.
Real-world examplesUber's dispatch service uses Kafka producers to emit driver location pings. Netflix's keystone pipeline has producers on every microservice emitting 1M+ events/sec. Shopify uses producers to emit order events to feed inventory and shipping services.

An application that publishes events to Kafka. The producer does not care who will consume the event. It's fire-and-forward.

In simple words: A producer is like a news reporter. They write the news story and send it to the newspaper office (Kafka). They don't know or care who reads the newspaper — that's not their job. Their job is just to report what happened.

Producer Record Structure

// A Kafka Producer Record contains: { "topic": "order-events", // Required: which topic "key": "order-123", // Optional: determines partition "value": "{ orderJson }", // Required: the actual event data "headers": { "source": "api-gw" }, // Optional: metadata key-value pairs "timestamp": 1709424000000 // Optional: event time or create time }

Producer ACK Configurations

The producer can control how many acknowledgements it waits for before considering the write successful.

0
acks = 0 (Fire & Forget)

Producer sends the record and does NOT wait for any response. Fastest but data can be lost if broker crashes before writing. Use for metrics/logs where loss is acceptable.

1
acks = 1 (Leader Only)

Producer waits for the leader broker to write to its local log and acknowledge. Balanced between speed and safety. Data can be lost if leader crashes before followers replicate.

all
acks = all (All ISR)

Producer waits for the leader AND all in-sync replicas (ISR) to write the event. Slowest but safest. No data loss as long as at least one ISR is alive. Production recommended.

Chapter 04

Topic

What is itA Topic in Kafka is a named category of events — conceptually similar to a table name in a database or a channel name in Pub/Sub. It's the primary unit of organization: producers publish records to a topic, consumers subscribe to a topic, and Kafka keeps those records durable on disk. Topics are multi-producer, multi-subscriber, and append-only. Importantly, a topic is a logical construct — physically, it's split into one or more partitions, each of which is an append-only log file replicated across brokers.
Key properties
  • Name: A string identifier (e.g., orders.placed.v1, user.signups).
  • Partition count: Fixed at creation (can be increased later but not decreased). Determines max consumer parallelism.
  • Replication factor: How many brokers hold a copy of each partition — typically 3 in production.
  • Retention: Time-based (retention.ms) or size-based (retention.bytes), or compacted (keep only the latest value per key).
  • Cleanup policy: delete (default) or compact.
How it differs
  • vs RabbitMQ queue: A RabbitMQ queue is consumed destructively — once a consumer reads a message, it's gone. A Kafka topic is a durable log — all consumers can read all messages independently until retention expires.
  • vs RabbitMQ exchange/topic: RabbitMQ's "topic exchange" does pattern-based routing to bound queues. Kafka topics have no routing logic — the topic name IS the destination.
  • vs SQS queue: SQS has a single consumer group per queue (roughly). Kafka topics support unlimited independent consumer groups, each with their own offset.
  • vs Kinesis stream: Near-identical concept — Kinesis streams are partitioned logs too, but Kinesis caps throughput per shard and has lower max retention.
  • vs Pulsar topic: Pulsar topics are more flexible — they can be non-partitioned or partitioned, and support per-topic subscription modes (exclusive, shared, failover).
Naming conventionsGood topic names follow a hierarchy: <domain>.<entity>.<event>.v<version>, e.g., billing.invoice.created.v2. Avoid generic names like events or data. Version the topic name so breaking schema changes can coexist with old consumers. Use dots, hyphens, or underscores consistently.
Common gotchasYou cannot delete messages from a topic selectively — only by advancing retention. Partition count can only be increased, and adding partitions breaks ordering for existing key-based routing. Creating topics on-the-fly via auto.create.topics.enable=true is a production footgun — disable it and use explicit topic creation via Admin API or IaC.
Real-world examplesNetflix has 4000+ topics for different event types in its Keystone pipeline. LinkedIn uses topics like ActivityEvent, PageViewEvent. Uber has topics per city/region for dispatch events, totaling tens of thousands across the fleet.

A Topic is a logical category or folder to which events are published. Think of it as a named stream. Many producers can write to the same topic, and many consumer groups can read from the same topic independently.

In simple words: A topic is like a TV channel. Channel "order-events" shows all order-related news. Channel "payment-events" shows all payment-related news. Producers broadcast to a channel, and consumers tune into whichever channels they care about.

Key Points A topic is a logical concept — it does NOT reside on a single machine. Its partitions are distributed across multiple brokers. A topic never "fails" because it's not a physical entity. What fails are the partitions (and they have replicas for recovery).

Topic Properties

PropertyDescriptionDefault
num.partitionsNumber of partitions when topic is created1
replication.factorHow many copies of each partition across brokers1 (set to 3 in prod)
retention.msHow long to keep events before deletion604800000 (7 days)
retention.bytesMax size of partition log before old segments are deleted-1 (unlimited)
cleanup.policyWhether to delete or compact old eventsdelete
min.insync.replicasMinimum ISR count for acks=all writes to succeed1 (set to 2 in prod)
segment.bytesSize of each segment file in the partition log1073741824 (1 GB)

Internal Topics (created by Kafka itself)

Topic NamePartitionsPurpose
__consumer_offsets50 (default)Stores committed offsets for every consumer group. Keyed by (groupId, topic, partition).
__transaction_state50 (default)Stores transaction metadata for exactly-once semantics.
_cluster_metadata.log1KRaft cluster metadata log (replicated across controllers).
Chapter 05

Partitions

What is itA partition is the fundamental unit of parallelism, ordering, and storage in Kafka. Every topic is split into one or more partitions, and each partition is a strictly-ordered, immutable, append-only log file stored on a broker's disk. Records within a partition get monotonically increasing offsets (0, 1, 2, 3, ...) as they're appended. Ordering is guaranteed within a partition but NOT across partitions — this is the single most important constraint in Kafka. A topic with N partitions can be consumed in parallel by up to N consumer instances in a group.
Key properties
  • Partition count = max consumer parallelism per consumer group.
  • Leader-follower replication: one broker is the leader for a partition; followers replicate from it.
  • Offsets are per-partition, not per-topic — each consumer tracks (topic, partition, offset) tuples.
  • Keyed writes: records with the same key always land in the same partition (via hash partitioning by default), preserving per-key ordering.
  • Partitions cannot be decreased — only added. And adding partitions breaks existing key-to-partition mapping.
How it differs
  • vs RabbitMQ queue sharding: RabbitMQ can shard queues via the sharding plugin, but ordering and consumer group semantics are not first-class.
  • vs SQS: SQS has no partitions — FIFO queues use "message group IDs" to give per-group ordering but throughput is capped at ~3000 msg/sec per group.
  • vs Kinesis shards: Kinesis shards are the direct analog — same idea, same constraints, but each shard caps at 1 MB/sec writes and 2 MB/sec reads, whereas Kafka partitions have no hard cap.
  • vs Pulsar: Pulsar partitions are similar but can be dynamically added without the broker doing heavy rebalancing since storage is in BookKeeper.
How to pick partition countRule of thumb: target throughput / per-consumer throughput. If you need to consume 500 MB/s and one consumer handles 50 MB/s, use at least 10 partitions. Add headroom (2–4×) for future growth. Don't go crazy — too many partitions (10,000+) per broker hurts recovery time, increases end-to-end latency, and consumes file handles and memory.
Common gotchasHot partitions: one key gets 80% of traffic (e.g., a celebrity user's events), saturating one broker. Fix with composite keys or random suffixes. Re-partitioning: changing partition count breaks key-based routing — plan ahead. Too few partitions: limits consumer parallelism permanently. Too many: each partition adds metadata overhead, slows rebalancing, and wastes disk I/O on small batches.
Real-world examplesLinkedIn's biggest topics run with 100s of partitions. Netflix topics for trace events run 1000+ partitions. Confluent recommends 4000 partitions per broker as a soft ceiling.

A partition is a physical, ordered, append-only log that is part of a topic. It's where events are actually stored on disk.

In simple words: If a topic is a book, then partitions are the chapters. The book "order-events" might have 3 chapters (partitions). Each chapter is stored on a different shelf (broker). Events are written at the end of a chapter — you never insert pages in the middle. This makes writing super fast.

Three Key Properties

P
Physical

Events are actually stored on disk inside the partition as log files. Each partition maps to a directory on the broker's filesystem containing segment files. Example: /data/order-events-0/

O
Ordered

Events are written sequentially and assigned an incrementing offset (starting from 0). Consumers read in offset order. Within a single partition, order is guaranteed. Kafka never re-orders events inside a partition.

A
Append-Only

Events are only appended to the end of the log. No insert in the middle. No updates. No deletes of individual records. The log only grows forward. This is what makes Kafka's sequential disk I/O so fast.

Topic: order-events — 3 Partitions
Partition 0 (order-events-0)
E0
off:0
E3
off:1
E6
off:2
E9
off:3
next
Partition 1 (order-events-1)
E1
off:0
E4
off:1
E7
off:2
next
Partition 2 (order-events-2)
E2
off:0
E5
off:1
E8
off:2
next
Ordering Guarantee Order is guaranteed within a single partition. In P0, E3 definitely happened after E0. But you cannot compare across partitions — E4 in P1 may have happened before or after E3 in P0. Offsets are per-partition, not global.
Chapter 06

Partitioning Strategies

What is itPartitioning strategy is how the producer decides which partition each record goes to. The strategy is a client-side decision made by the producer's Partitioner class before the record ever reaches the broker. The default Java partitioner uses murmur2 hash of the key modulo partition count. If the key is null, Kafka 2.4+ uses "sticky partitioning" — round-robin in batches for better batching efficiency. Kafka lets you implement custom partitioners by implementing the Partitioner interface.
Built-in strategies
  • Key-hash (default when key is set): partition = hash(key) % num_partitions. Guarantees same key → same partition → preserves ordering.
  • Round-robin (pre-2.4, null key): spreads records evenly across partitions but creates many small batches.
  • Sticky partitioner (2.4+, null key): fills one partition's batch fully, then switches — much better batching throughput.
  • Manual: producer explicitly passes a partition number, bypassing the partitioner entirely.
  • Custom: implement your own logic — e.g., geographic partitioning, tenant-based, or time-bucketed.
How it differs
  • vs RabbitMQ routing keys: RabbitMQ routes via exchange bindings and pattern matching (server-side). Kafka partitioning is deterministic hash on the client.
  • vs SQS FIFO message group ID: SQS groups messages by "message group ID" for per-group ordering — similar intent, but SQS caps per-group throughput.
  • vs Kinesis partition key: Essentially identical — both use a hash function on a client-supplied key.
  • vs Pulsar: Pulsar has similar partitioning but also supports "key_shared" subscription mode where consumers get consistent per-key routing automatically.
Why it mattersThe partition strategy determines ordering guarantees, data locality, and load distribution. Use userId as the key when you need per-user ordering (e.g., account updates). Use orderId for order event streams. Use no key (null) for fire-and-forget logs where ordering doesn't matter and you want maximum throughput.
Common gotchasKey skew: if 1% of your keys account for 90% of records, you get hot partitions. Adding partitions breaks hashing: after adding partitions, the same key may now land on a different partition, breaking per-key ordering. Null-key round-robin means no ordering — be explicit when you need it.
Real-world examplesUber uses rider/driver IDs as partition keys for dispatch events. Stripe partitions by account ID to preserve per-account ordering of ledger events. Twitter has used user ID-based partitioning for timeline fanout.

Producer chooses which partition of a topic the event goes to. This decision impacts ordering guarantees and load distribution.

#
Key-Based Partitioning

partition = hash(key) % num_partitions

All events with the same key go to the same partition. This guarantees ordering for related events. Example: all events for orderId "123" always go to the same partition.

Ordering Guaranteed Possible Hotspots

R
Round Robin (No Key)

When key = null, events are distributed in round-robin across partitions.

E0 → P0, E1 → P1, E2 → P2, E3 → P0 ... Even distribution but ordering is NOT guaranteed. Related events may land in different partitions.

Even Distribution No Ordering

C
Custom Partitioner

You implement your own Partitioner interface with business logic.

Example: country == "India" → P0, country == "US" → P1. Gives full control over data locality and processing affinity.

Full Control Custom Logic

Key-Based Partitioning Example
// Topic: order-events, 3 partitions Publish: { topic: "order-events", key: "order-123", value: orderJson } partition = hash("order-123") % 3 = 1 → always goes to Partition 1 E0: hash("123") % 3 = 1 → Partition 1 E1: hash("456") % 3 = 2 → Partition 2 E2: hash("789") % 3 = 0 → Partition 0 E3: hash("123") % 3 = 1 → Partition 1 (same key = same partition)
Chapter 07

Segments & Indexing

What is itA Kafka partition is not a single giant file — it's a sequence of segment files on disk. Each segment is a fixed-size (default 1 GB or 7 days) .log file containing the actual record bytes, plus two index files: .index (offset → file position) and .timeindex (timestamp → offset). When a segment fills up or ages out, Kafka "rolls" to a new segment and the old one becomes immutable and eligible for deletion or compaction. Segmenting is what makes Kafka's log storage practical — you can delete old data by just unlinking files, and you can memory-map individual segments for efficient reads.
Key files per segment
  • 00000000000000000000.log: the actual records, append-only.
  • 00000000000000000000.index: sparse offset index for binary search (one entry every ~4 KB by default).
  • 00000000000000000000.timeindex: sparse timestamp index enabling offset-by-time lookups.
  • .snapshot (newer versions): producer state snapshot for idempotence.
How it differs
  • vs RabbitMQ mnesia: RabbitMQ's classic queues store messages in memory-backed mnesia with disk spillover — not a flat log file.
  • vs database storage: Databases use B-trees or LSM trees with random writes. Kafka is pure sequential append — vastly faster on spinning disk and still very fast on SSD.
  • vs Pulsar/BookKeeper: Pulsar stores data in BookKeeper ledgers which are also segmented logs, but with erasure coding and tiered storage built in.
  • vs Kinesis: Kinesis abstracts storage entirely — you never see segment files; AWS manages them.
Why sparse indexesA full offset-to-position index on a billion-record partition would be huge. Kafka uses a sparse index — one entry every index.interval.bytes (default 4 KB) of log data. To find an exact offset, Kafka does a binary search in the index to find the nearest entry, then scans forward in the log. This keeps indexes small (a few MB per GB of log) and still gives O(log n) lookups.
Common gotchasToo-small segments: with segment.bytes=10MB on a high-traffic topic you create millions of files, exhausting file descriptors. Too-large segments: with segment.bytes=10GB you can't delete expired data until the whole segment rolls. Time-based retention only kicks in after segment roll — so a stale segment can stick around past its TTL.
Real-world examplesConfluent recommends 1 GB segments for most workloads. LinkedIn tunes segment size per topic based on retention and throughput. Uber uses time-based segment rolls for easier retention management on archival topics.

Each partition log file is further divided into segments. Instead of one massive log file, Kafka splits it into smaller, manageable pieces for faster reads and efficient cleanup.

In simple words: Imagine writing a very long diary. Instead of one 10,000-page book, you split it into volumes: Volume 1 (pages 1-500), Volume 2 (pages 501-1000), etc. When someone asks for page 750, you grab Volume 2 directly instead of flipping through the entire diary. That's what segments do for Kafka.

Why Segments?

If a consumer says "read from offset 550 to 800 of Partition 0", Kafka can skip directly to the right segment file instead of scanning from the beginning. Also, retention and compaction operate at the segment level — Kafka deletes or compacts entire segment files.

Partition 0 → Segment Files
Topic: order-events └── Partition 0: order-events-0/ ├── 00000000000000000000.log ← offsets 0-499 ├── 00000000000000000000.index ← sparse index for segment 1 ├── 00000000000000000000.timeindex ├── 00000000000000000500.log ← offsets 500-999 ├── 00000000000000000500.index ← sparse index for segment 2 ├── 00000000000000000500.timeindex ├── 00000000000000001000.log ← offsets 1000+ (active) ├── 00000000000000001000.index └── 00000000000000001000.timeindex File name = first offset in that segment

Sparse Index (.index files)

Even with segments, a 1 GB segment file would still need sequential scanning. So Kafka maintains a sparse index per segment. Not every offset is indexed. Instead, it creates one index entry after every N bytes written (configured by log.index.interval.bytes, default 4096 bytes).

Index Lookup Example

Segment Log (00000000.log)

OffsetEvent SizeFile Position
0300 bytes0
1500 bytes300
21000 bytes800
32000 bytes1800
4500 bytes3800

Total bytes: 300+500+1000+2000+500 = 4300 bytes

Sparse Index (00000000.index)

OffsetPosition (bytes)
43800

Index interval = 4096 bytes. After 4300 bytes written (> 4096), Kafka records offset 4 at position 3800. To find offset 3, Kafka looks at index, finds nearest lower entry, then scans sequentially from there.

How Offset Lookup Works 1. Identify the correct segment file by comparing the target offset with segment file names.   2. Binary search the .index file to find the nearest lower indexed offset.   3. Sequential scan from that position in the .log file to the exact offset.   This gives O(1) segment lookup + O(log n) index lookup + small sequential scan.
Chapter 08

Broker

What is itA Kafka broker is a single Kafka server — one JVM process running on a machine (physical, VM, or container) that hosts partitions, serves producer writes, serves consumer reads, and participates in replication. Each broker has a unique integer broker.id. A Kafka cluster is simply a group of brokers that know about each other via ZooKeeper (legacy) or KRaft quorum (modern). Brokers are stateful — they own data on local disk — so they're typically deployed on dedicated machines with fast SSDs.
Responsibilities
  • Host partition replicas: each broker stores some partitions as leader and others as follower.
  • Serve producer requests: accept writes to partitions it leads, append to the log, replicate to followers.
  • Serve consumer fetch requests: return records from the log (possibly via zero-copy sendfile()).
  • Handle replication: followers fetch from leaders continuously.
  • Participate in leader election and ISR management via the controller.
  • Expose JMX metrics for monitoring (under-replicated partitions, request rate, disk I/O, etc.).
How it differs
  • vs RabbitMQ node: RabbitMQ nodes also hold queues and form a cluster, but don't natively partition a queue across nodes — you need sharding plugins or mirrored queues.
  • vs SQS: SQS has no user-visible brokers — AWS fully manages the backend.
  • vs Kinesis: Kinesis shards are stored on AWS-managed infrastructure; there's no broker concept for customers.
  • vs Pulsar broker: Pulsar brokers are stateless — storage lives in BookKeeper bookies. This separation lets Pulsar scale compute and storage independently.
Why use multiple brokersYou run multiple brokers for horizontal scalability, fault tolerance, and throughput. A 3-broker cluster with replication.factor=3 survives one broker failure with zero data loss. More brokers = more partitions = more parallelism. A typical production cluster runs 5–50 brokers; the largest known clusters (LinkedIn, Netflix) run 100s of brokers per cluster.
Common gotchasUneven partition distribution: if partitions aren't balanced, some brokers get hot. Use kafka-reassign-partitions.sh or Cruise Control. Disk I/O: Kafka is sequential-write-heavy — use XFS or ext4 on SSD. Page cache: Kafka relies on the OS page cache for read performance — give brokers lots of RAM (JVM heap 6 GB, rest to page cache). JVM GC tuning: use G1GC with small pause targets.
Real-world examplesLinkedIn runs some clusters with 300+ brokers. Netflix operates multiple 100+ broker clusters. Confluent Cloud abstracts brokers away behind a serverless API.

A Broker is a single Kafka server instance. It's the one that actually stores data on disk and serves client requests (both producers and consumers).

In simple words: A broker is like a warehouse worker. Each worker (broker) manages some shelves (partitions). Producers give packages (events) to the right worker, and consumers ask the right worker for their packages. Multiple workers together form a team (cluster).

Critical Understanding Topics and partitions are distributed across multiple brokers. A single broker does NOT hold all topics. It also does NOT hold all partitions of the same topic. Each broker stores some partitions of some topics.

Broker Responsibilities

Broker Networking

Kafka uses a custom binary protocol over TCP. Default port is 9092. Each broker has a unique broker.id. Clients connect to any broker (bootstrap) to discover the full cluster topology, then connect directly to the partition leaders they need.

Chapter 09

Kafka Cluster

What is itA Kafka cluster is a group of Kafka brokers working together to provide a single logical message system. Brokers in a cluster share metadata (topic list, partition assignments, ISR state) via either ZooKeeper (pre-Kafka 3.0) or KRaft (Kafka 3.0+ native quorum). One broker acts as the controller, responsible for leader election and partition state management. Clients (producers/consumers) discover the cluster by connecting to any subset of brokers (the bootstrap.servers list) and then fetching metadata to learn about all brokers and partition leaders.
Key components
  • Brokers: the data plane — store and serve partitions.
  • Controller: one elected broker that manages partition leadership and metadata changes.
  • Metadata quorum (KRaft) or ZooKeeper ensemble: coordinates cluster state.
  • Bootstrap servers: initial contact points for clients.
  • Listeners: network endpoints (internal, external, SASL) brokers advertise.
How it differs
  • vs RabbitMQ cluster: RabbitMQ clusters share metadata via Erlang distribution, and high availability requires mirrored queues or quorum queues. Kafka has replication as a first-class primitive.
  • vs Kinesis stream: A Kinesis stream is the closest AWS-native analog to a Kafka cluster, but it's bounded by shard limits and AWS-region scope.
  • vs Pulsar cluster: Pulsar clusters separate brokers (stateless) from BookKeeper bookies (storage), making independent scaling easier.
  • vs Redis Cluster: Redis Cluster is for key-value sharding, not durable event logs.
Cluster sizingStart with at least 3 brokers for fault tolerance with replication.factor=3. Scale out when you hit CPU, disk I/O, or network bottlenecks. Rule of thumb: one broker can handle ~100 MB/s of combined produce+consume with typical workloads. Cross-AZ deployment is mandatory for durability in cloud environments (3 brokers across 3 AZs).
Common gotchasSplit brain can happen if the controller loses ZooKeeper/KRaft connectivity — modern Kafka handles this, but misconfigured clusters can get stuck. Rack awareness (broker.rack) must be set to ensure replicas land on different AZs. Bootstrap servers should list multiple brokers — if you list only one and it's down, clients can't connect.
Real-world examplesLinkedIn runs 100+ Kafka clusters. Uber operates multi-region Kafka clusters with MirrorMaker 2 for cross-region replication. Confluent Cloud, AWS MSK, Aiven, and Redpanda Cloud offer managed Kafka-compatible clusters.

A Kafka Cluster is a group of brokers working together to provide scalability, fault tolerance, and high availability.

S
Scalability

Distribute load across multiple servers. More brokers = more partitions = more throughput. Producers and consumers talk to different brokers in parallel.

F
Fault Tolerance

If a broker goes down, its partitions have replicas on other brokers. Data is never lost (with replication factor >= 2). Automatic failover via leader election.

H
High Availability

No single point of failure. Controller has standby nodes. Every partition has replicas. Clients automatically reconnect to new leaders. System continues even if minority of brokers fail.

3-Broker Cluster with Replication Factor = 2
Broker 1
P0 Leader
P1 Follower
Broker 2
P1 Leader
P2 Follower
Broker 3
P2 Leader
P0 Follower
Chapter 10

Leader-Follower Partition Replication

What is itKafka replicates each partition across multiple brokers for durability. For each partition, one broker is the leader and the others are followers. All reads and writes go through the leader — producers write to the leader, followers continuously fetch from the leader to keep their copies in sync. If the leader crashes, one of the in-sync followers is promoted to leader. This is classic single-leader replication, similar to PostgreSQL streaming replication but applied per-partition, so different partitions have different leaders spread across brokers.
Key mechanics
  • replication.factor: number of copies per partition (typically 3).
  • Leader: handles all client requests. There's exactly one leader per partition at a time.
  • Followers: passively replicate via FetchRequest RPCs to the leader.
  • High watermark (HW): the highest offset all ISR have replicated — consumers can only read up to HW.
  • Log end offset (LEO): the offset of the next record to be written on each replica.
How it differs
  • vs RabbitMQ mirrored queues: RabbitMQ had master-slave queue mirroring (deprecated); its modern alternative is quorum queues (Raft-based). Kafka's replication is more mature and proven at scale.
  • vs database replication: Very similar to PostgreSQL streaming replication, except each partition is replicated independently and leaders are distributed across brokers for load balancing.
  • vs Kinesis: Kinesis shards are replicated across AZs by AWS transparently — no user-visible leader concept.
  • vs Pulsar: Pulsar uses BookKeeper's quorum writes — every write goes to multiple bookies in parallel rather than leader→follower chained replication.
Why use replicationWithout replication, losing one broker means losing all partitions on its disk. With replication.factor=3 and min.insync.replicas=2, Kafka tolerates one broker failure with zero data loss and continues serving reads/writes. Cross-AZ replication tolerates an entire AZ outage.
Common gotchasUnder-replicated partitions: if followers fall behind, they drop out of ISR and durability guarantees weaken — monitor this closely. Unclean leader election (unclean.leader.election.enable=true) trades data loss for availability — avoid in production. Replication throttles are needed when adding brokers or reassigning partitions to avoid saturating the network.
Real-world examplesNetflix runs replication.factor=3 across AZs for all production topics. LinkedIn uses higher replication factors (4–5) for critical topics. Confluent recommends at minimum RF=3, MinISR=2 for production.

For each partition, one broker is the Leader and the rest are Followers. This is the core of Kafka's fault tolerance.

In simple words: Imagine a group project. One person (leader) does the original work and everyone (producers/consumers) talks to them. The other members (followers) silently make copies of the leader's work. If the leader gets sick, one of the followers who has a complete copy can immediately take over. No work is lost.

Leader Responsibilities
  • Handle ALL producer writes for that partition.
  • Handle ALL consumer reads for that partition.
  • Maintain partition log (segments, indexes).
  • Coordinate with followers — track their replication progress.
  • Manage ISR list — add/remove followers based on sync status.
Follower Responsibilities
  • Continuously fetch (pull) data from the leader.
  • Replicate data to stay in sync.
  • Ready to become leader if current leader fails.
  • Do NOT serve any client (producer/consumer) requests directly.
  • Acknowledge successful replication back to leader (for acks=all).
Who Decides Leader/Follower? The Controller (a special broker) decides which broker hosts which partition replica, which replica is the leader, and which are followers. When a leader fails, the Controller elects a new leader from the ISR list.
Chapter 11

ISR (In-Sync Replicas)

What is itThe In-Sync Replica set (ISR) is the set of replicas for a partition that are currently caught up to the leader — both the leader itself and all followers whose log end offset is within replica.lag.time.max.ms (default 30 seconds) of the leader's. Only ISR members are eligible for leader election on a leader failure. When a producer uses acks=all, the write is only considered committed once all ISR members have replicated it, so the ISR defines the durability boundary. If a follower falls too far behind, it's removed from the ISR; when it catches up, it rejoins.
Key settings
  • min.insync.replicas: minimum ISR size required for acks=all writes to succeed. Typical production value: 2.
  • replica.lag.time.max.ms: how long a follower can lag before being kicked out of ISR.
  • unclean.leader.election.enable: if true, allows non-ISR replicas to become leader (data loss risk).
How it differs
  • vs Raft quorum (etcd, Consul): Raft requires a majority (quorum) for writes. Kafka's ISR is "flexible quorum" — the leader decides which replicas count based on recent liveness, not a fixed majority.
  • vs RabbitMQ quorum queues: Quorum queues use strict Raft — a write needs majority acks. Kafka ISR can shrink or grow dynamically.
  • vs Pulsar/BookKeeper: BookKeeper uses an "ensemble" concept with write/ack quorums, more like strict Raft.
Why it mattersThe ISR is the key to Kafka's durability model. With acks=all and min.insync.replicas=2, every committed write is guaranteed to survive on at least 2 replicas before the producer gets an ack — giving you one broker failure tolerance without data loss. Lower min.insync.replicas means more availability but less durability.
Common gotchasISR shrinking during load spikes: a slow follower can get kicked out, reducing effective replication. Write unavailability: if ISR shrinks below min.insync.replicas, producers get NotEnoughReplicasException — availability degrades in exchange for durability. Always alert on UnderMinIsrPartitionCount.
Real-world examplesFinancial services running Kafka (Goldman Sachs, Capital One) prefer min.insync.replicas=2 for every topic. Log aggregation clusters sometimes accept min.insync.replicas=1 to keep writing during degraded conditions. Cloudflare famously published details about ISR tuning for their event pipeline.

ISR is the list of replicas (including the leader) that are fully caught up with the leader's log. This is cluster metadata maintained by the Controller.

In simple words: Imagine you're a teacher (leader) and you have 2 students (followers) copying your notes. ISR is the list of students who are up-to-date with your notes. If a student falls behind (didn't copy the last 5 pages), they're temporarily removed from the ISR until they catch up.

ISR Example — All In Sync
Partition 1 (Replication Factor = 3) Broker1 (Leader): [0][1][2][3][4][5][6][7][8][9] ← Latest offset: 9 Broker2 (Follower): [0][1][2][3][4][5][6][7][8][9] ← In-Sync ✓ Broker3 (Follower): [0][1][2][3][4][5][6][7][8][9] ← In-Sync ✓ ISR = {Broker1, Broker2, Broker3}
ISR Example — Broker 3 Lagging
Broker1 (Leader): [0][1][2][3][4][5][6][7][8][9] ← Latest offset: 9 Broker2 (Follower): [0][1][2][3][4][5][6][7][8][9] ← In-Sync ✓ Broker3 (Follower): [0][1][2][3][4][5][6] ← Out-of-Sync ✗ (lagging) ISR = {Broker1, Broker2} ← Broker3 temporarily REMOVED from ISR

How Does the Leader Decide a Follower is Out-of-Sync?

Why ISR Matters

When producer uses acks=all, the leader waits for ALL replicas in the ISR (not all replicas, just the in-sync ones) to acknowledge the write. This means:

min.insync.replicas This config sets the minimum number of replicas that must be in the ISR for a write (with acks=all) to succeed. If ISR count drops below this, the partition becomes UNAVAILABLE for writes. Typical production setting: min.insync.replicas=2 with replication factor 3.
Chapter 12

Controller

What is itThe Controller is a special role held by exactly one broker in a Kafka cluster at any given time. The controller is responsible for managing cluster-wide metadata and coordination: detecting broker failures via ZooKeeper/KRaft heartbeats, electing new leaders when a broker fails, managing the ISR (adding/removing replicas), handling partition reassignment, and propagating metadata changes to all brokers. Only one controller exists per cluster — if the controller itself fails, a new one is elected almost instantly (tens of milliseconds in KRaft mode).
Responsibilities
  • Leader election: when a broker fails, pick a new leader from the ISR of each affected partition.
  • ISR updates: track which replicas are in sync and update ZK/KRaft.
  • Topic/partition lifecycle: handle create/delete/alter topic requests.
  • Partition reassignment: move partitions between brokers during rebalancing.
  • Metadata propagation: push LeaderAndIsr and UpdateMetadata RPCs to brokers.
How it differs
  • vs database cluster master: Similar role to the coordinator in etcd, Consul, or a PostgreSQL failover manager — elected, manages failover.
  • vs RabbitMQ: RabbitMQ doesn't have a single controller; cluster state is maintained via Erlang distribution and mnesia replication.
  • vs Pulsar: Pulsar has "bundle ownership" managed by ZooKeeper/etcd rather than a dedicated controller node.
  • vs Kinesis: AWS hides this layer; customers never see a controller.
Why it mattersThe controller is the brain of the cluster. A slow controller means slow failovers — in older Kafka (pre-2.4) with tens of thousands of partitions, controller failovers could take minutes. KRaft (Kafka 3.0+) drastically improves this, bringing failover times down to seconds or less even on massive clusters.
Common gotchasController overload: in ZK mode, the controller does a lot of serial work — huge clusters sometimes have a broker permanently pinned as "controller-only". Split-brain on network partitions was possible in older versions; KRaft's Raft-based controller quorum eliminates this. Metric to watch: ActiveControllerCount — should always be exactly 1 across the cluster.
Real-world examplesLinkedIn and Confluent drove the KRaft rewrite specifically to scale controller performance for 100,000+ partition clusters. Uber published case studies on controller tuning for large clusters.

The Controller is a special broker responsible for managing all cluster metadata. It's the brain of the Kafka cluster.

In simple words: The controller is like the manager of the warehouse. Workers (brokers) store packages, but the manager decides which worker handles which shelf, who takes over when a worker is sick, and keeps track of everything. There's only one active manager at a time, but backup managers are ready to step in.

Controller Responsibilities

Topic Management

Creates/deletes topics. Decides how partitions are distributed across brokers when a new topic is created.

Partition Leader Election

Elects leader and follower replicas for every partition. Re-elects leaders when a broker fails.

Failure Detection

Monitors broker heartbeats. Detects when brokers go down and triggers recovery actions.

Metadata Distribution

Notifies all brokers about changes — new topics, leader changes, ISR updates. Brokers cache this metadata to serve client requests.

Single Point of Failure? Only 1 controller is active (leader) at any time. Others are standby (followers). If the active controller fails, a new one is elected using consensus. This is where KRaft comes in.

Controller can have dual roles

Chapter 13

KRaft (Kafka Raft Consensus)

What is itKRaft (short for "Kafka Raft") is Kafka's native consensus protocol introduced in Kafka 2.8 (preview) and made production-ready in 3.3, fully replacing ZooKeeper as the metadata store. Under KRaft, a small set of brokers (typically 3 or 5) run as controller nodes that form a Raft quorum, storing cluster metadata in an internal __cluster_metadata topic. The rest of the brokers are regular data nodes that subscribe to metadata updates from the quorum. This eliminates the operational burden of running a separate ZooKeeper ensemble and dramatically improves scalability — KRaft can handle 10× the partition count of ZK-based clusters.
Key benefits
  • Single system to operate: no separate ZooKeeper cluster to install, monitor, upgrade, secure.
  • Millions of partitions: scales to much larger clusters than ZK-based Kafka.
  • Fast controller failover: seconds instead of minutes.
  • Simpler security model: one set of credentials, one TLS config, one ACL system.
  • Faster startup and shutdown of controllers.
How it differs
  • vs ZooKeeper mode: ZK uses ZAB (ZooKeeper Atomic Broadcast) protocol, runs as a separate process, has its own client protocol and ACLs. KRaft is embedded, uses Raft, and removes an entire failure mode.
  • vs etcd/Consul: etcd and Consul are general-purpose Raft-based KV stores used by Kubernetes, Nomad, etc. KRaft is Raft specialized for Kafka's metadata model.
  • vs Pulsar: Pulsar still depends on ZooKeeper for metadata (although Pulsar is working on alternatives).
  • vs Redpanda: Redpanda has always used Raft natively (no ZK dependency) — it's one of Redpanda's core design differences from historical Kafka.
Migration pathKafka 3.5 added ZK-to-KRaft migration tooling. Kafka 4.0 (2025) removes ZooKeeper entirely — KRaft is the only supported mode. New deployments should always use KRaft. Tools like Strimzi, Confluent Operator, and MSK now default to KRaft.
Common gotchasController quorum sizing: use 3 or 5 controllers (never even numbers). Dedicated vs combined mode: production should use dedicated controllers; combined mode (controller + broker on same node) is for dev/test only. Metadata topic retention: __cluster_metadata is a compacted topic — don't mess with its settings.
Real-world examplesConfluent Cloud migrated to KRaft first. AWS MSK now supports KRaft clusters. Strimzi (Kubernetes operator) supports KRaft deployments. LinkedIn drove much of the KRaft design to scale past ZK-imposed limits.

KRaft is the built-in consensus mechanism that replaced ZooKeeper (deprecated in Kafka >= 3.x). It uses the Raft consensus algorithm to coordinate multiple controllers.

In simple words: Imagine 3 managers in a company. Only 1 can be the CEO at a time. When the current CEO leaves, the remaining managers vote to pick a new CEO. That voting process is "consensus." KRaft is Kafka's built-in voting system. Before KRaft, Kafka used an external voting system called ZooKeeper (like hiring an outside agency to run the election).

Why Do We Need Consensus?

KRaft vs ZooKeeper

ZooKeeper (Legacy)
  • External distributed system.
  • Deployed separately.
  • Monitored separately.
  • Maintained separately from Kafka.
  • Uses ZAB consensus protocol.
  • Additional operational overhead.
  • Deprecated in Kafka 3.x+.
KRaft (Modern)
  • Built-in to Kafka.
  • No external dependency.
  • Single system to deploy and monitor.
  • Uses Raft consensus protocol.
  • Less operational overhead.
  • Faster failover.
  • Default since Kafka 3.3+.

Quorum

Quorum means majority of nodes must agree before any decision is finalized.

Controller NodesQuorum (N/2 + 1)Can Tolerate Failures
321
532
743

KRaft Metadata Commit Flow

1
Request arrives

Producer or Admin CLI sends a request (e.g., create topic) to any broker, which forwards it to the active controller.

2
Active Controller processes

Creates the topic/partitions, decides leader/followers, creates a metadata record with the next offset (e.g., offset 100). Writes it to its local _cluster_metadata.log but marks it as NOT committed yet.

3
Initiates Quorum

Active controller sends the new metadata record to standby controllers via heartbeats.

4
Standby Controllers Write

Each standby appends the record to their local _cluster_metadata.log (not committed yet) and sends ACK back.

5
Majority ACK = Commit

Once a majority (quorum) of controllers have written successfully, the active controller marks the record as committed. Updates last committed offset to 100.

6
Propagate to Brokers

Active controller sends heartbeats with the last committed offset to all brokers. Brokers fetch and apply the new metadata.

7
Standby Controllers Commit

When standby controllers receive the next heartbeat with the committed offset, they also mark the record as committed in their local log.

Chapter 14

Consumer & Consumer Groups

What is itA Kafka Consumer is a client application that reads records from one or more topics by pulling them from brokers (unlike push-based systems). A Consumer Group is a set of consumer instances that share a common group.id and collectively consume a topic such that each partition is read by exactly one consumer in the group. This gives you horizontal scalability: more consumers = more parallelism, up to the partition count. Multiple groups on the same topic are independent — each group reads the full stream independently, enabling a fanout pattern (analytics group, billing group, ML training group all consuming the same events).
Key concepts
  • group.id: the group identifier. All consumers sharing it cooperate.
  • Partition assignment: Kafka assigns partitions to consumers via strategies like RangeAssignor, RoundRobinAssignor, StickyAssignor, or CooperativeStickyAssignor.
  • Offsets: each group tracks its own committed offset per (topic, partition) in the internal __consumer_offsets topic.
  • Rebalancing: when consumers join/leave, partitions are redistributed — potentially pausing the group briefly.
  • Heartbeats: consumers send heartbeats via heartbeat.interval.ms to stay in the group.
How it differs
  • vs RabbitMQ competing consumers: RabbitMQ supports multiple consumers on one queue with load distribution, but no partition-based ordering and no replay.
  • vs SQS consumers: SQS consumers compete for messages from a single queue. There's no concept of partition assignment or per-consumer offset.
  • vs Kinesis Client Library (KCL): KCL is the closest analog — assigns shards to workers, tracks checkpoints in DynamoDB.
  • vs Pulsar subscription modes: Pulsar offers Exclusive, Shared, Failover, and Key_Shared — more flexible than Kafka's single consumer-per-partition model.
Why consumer groupsConsumer groups give you load balancing + fault tolerance + parallelism in a single abstraction. If one consumer crashes, its partitions get reassigned to others. If you need more throughput, just add more consumers (up to partition count). And unlike queues, multiple applications can read the same topic independently.
Common gotchasConsumer > partitions: extra consumers sit idle — a common waste. Long processing: if a consumer takes longer than max.poll.interval.ms to process a batch, it gets kicked out of the group, triggering rebalancing. Rebalance storms: frequent consumer restarts can cause constant reassignment — use static membership (group.instance.id) to mitigate. Offset commit ordering: commit AFTER processing, never before.
Real-world examplesUber runs consumer groups with hundreds of instances for dispatch event processing. Netflix uses consumer groups for real-time ML feature extraction. Airbnb uses multiple consumer groups per topic for independent search indexing, pricing, and analytics pipelines.

A Consumer is an application that reads (pulls) events from Kafka. Kafka uses a PULL model — the consumer asks for data, Kafka never pushes.

In simple words: A consumer is like a newspaper reader. Kafka doesn't throw the newspaper at your door — YOU go and pick it up when you're ready. A Consumer Group is a team of readers sharing the workload. If there are 3 newspaper sections, 3 readers can each take 1 section and read in parallel. But two readers from the same team never read the same section.

Pull Model Consumer says: "Give me data from topic X, partition Y, starting from offset Z, max N bytes." This gives consumers full control over their read rate and position.

Consumer Groups

Every consumer belongs to a group.id. Within a group, work is divided. Across different groups, the same data is read independently.

Golden Rule Inside one consumer group: the same partition is NEVER read by multiple consumers. This guarantees that events in a partition are processed in order by exactly one consumer within the group.

Partition Assignment Scenarios

Consumers == Partitions (Ideal)

Topic: 3 partitions, Group: 3 consumers

C1 → P0
C2 → P1
C3 → P2

Perfect Balance

Consumers < Partitions

Topic: 6 partitions, Group: 3 consumers

C1 → P0, P3
C2 → P1, P4
C3 → P2, P5

Each consumer handles more load

Consumers > Partitions

Topic: 3 partitions, Group: 5 consumers

C1 → P0
C2 → P1
C3 → P2
C4 → IDLE
C5 → IDLE

Wasted consumers

Multiple Consumer Groups

The same partition CAN be read by multiple consumers that are in different groups. Each group independently tracks its own offset.

Same Partition, Multiple Groups
Topic: order-events, Partition 0 Consumer Group 1: notification-service └── Consumer 1 → reads from Partition 0 (offset: 105) Consumer Group 2: analytics-service └── Consumer 2 → reads from Partition 0 (offset: 60) Consumer Group 3: audit-service └── Consumer 3 → reads from Partition 0 (offset: 200) Each group has its own committed offset — completely independent.
Chapter 15

Offset Management

What is itAn offset is a monotonically increasing integer assigned to each record within a partition (0, 1, 2, ...). Offset management is how Kafka consumers keep track of "where they are" in each partition so they can resume after restart or rebalance. Committed offsets are stored in the internal compacted topic __consumer_offsets, keyed by (group.id, topic, partition). The committed offset represents the next record to read, not the last processed — so committing offset 42 means "I processed 0–41, start me at 42 next time".
Key concepts
  • Current offset: the position the consumer is currently reading from.
  • Committed offset: the persisted position that will be used on restart.
  • Log end offset (LEO): the offset of the next record to be written to the partition.
  • Consumer lag: LEO − committed offset = how many records behind the consumer is.
  • auto.offset.reset: what to do when there's no committed offset — earliest, latest, or none.
How it differs
  • vs RabbitMQ ack: RabbitMQ uses per-message acks — once acked, the message is deleted. Kafka uses cumulative offsets — one commit covers everything up to that offset.
  • vs SQS visibility timeout: SQS messages become invisible while being processed; if not deleted in time, they reappear. Kafka has no per-message visibility — it's all offset-based.
  • vs Kinesis checkpoints: KCL stores checkpoints in DynamoDB per shard — conceptually identical to Kafka committed offsets.
  • vs Pulsar cursors: Pulsar has server-side cursors per subscription — similar to Kafka offsets but stored in BookKeeper.
Why it mattersOffsets are the foundation of Kafka's delivery semantics. Commit before processing = at-most-once (can lose messages). Commit after processing = at-least-once (can duplicate on retry). Combine with idempotent writes or transactional processing = exactly-once.
Common gotchasCommitting too often floods __consumer_offsets. Committing too rarely means more duplicates on restart. Resetting offsets manually via kafka-consumer-groups.sh --reset-offsets can cause consumers to reprocess or skip data. Offset expiration: unused consumer groups have their offsets deleted after offsets.retention.minutes (default 7 days).
Real-world examplesStream processing apps (Kafka Streams, Flink) use offsets as part of their checkpoint state. Burrow (by LinkedIn) is a popular consumer lag monitoring tool. Prometheus exporters for Kafka scrape lag metrics per group per partition.

Each consumer group maintains its partition-wise committed offset independently. This tells Kafka "I've processed up to this point."

In simple words: An offset is like a bookmark in a book. It tells you "I've read up to page 105." If you put the book down and come back later, you start from page 106. Each reader (consumer group) has their own bookmark — so one reader might be on page 105 while another is on page 60.

Offset Tracking Structure

Consumer GroupTopicPartitionCommitted Offset
notificationorder-events0105
notificationorder-events198
notificationorder-events2210
analyticsorder-events060
analyticsorder-events145

Where Are Offsets Stored?

Offsets are stored in the __consumer_offsets internal topic (50 partitions by default). The partition is determined by:

partition = hash(group_id) % 50 // Example: hash("notification-service") % 50 = 23 // All offset updates for "notification-service" go to Partition 23 of __consumer_offsets // The broker that is LEADER of this partition becomes the Group Coordinator
Group Coordinator The broker that is leader of the relevant __consumer_offsets partition becomes the Group Coordinator for that consumer group. It handles: group membership, partition assignment, and offset commits. This is NOT cluster metadata — it's stored like normal topic data, NOT on controller nodes.
Chapter 16

Offset Commit Strategies

What is itOffset commit strategy is the pattern you use to persist consumer progress. There are three fundamental choices: auto-commit, sync commit, and async commit. Each has different tradeoffs between latency, throughput, and delivery guarantees. The choice interacts deeply with how you handle processing failures, idempotency, and rebalancing.
Strategies
  • Auto-commit: enable.auto.commit=true — consumer commits every auto.commit.interval.ms (default 5s) in the background. Simplest, but can commit before processing completes → data loss risk.
  • Manual sync commit: call commitSync() after processing a batch. Strong guarantees, blocks until broker confirms, lower throughput.
  • Manual async commit: call commitAsync() — non-blocking, higher throughput, but no retry on failure (unless you write the retry yourself).
  • Hybrid (best of both): commitAsync() during normal operation, commitSync() on shutdown and rebalance.
  • Transactional commit: commit offsets as part of a transaction alongside output writes (for exactly-once).
How it differs
  • vs RabbitMQ acks: RabbitMQ has per-message manual or auto-ack. No batched cumulative commits like Kafka.
  • vs SQS DeleteMessage: SQS requires deleting each processed message individually (or in batches of 10).
  • vs Kinesis checkpoints: KCL checkpoints are typically done manually after processing, similar to Kafka sync commits.
  • vs Pulsar acks: Pulsar supports both individual and cumulative acks natively.
Best practicesFor at-least-once: process first, then commit. Use async commits inside the loop and sync commit on shutdown/rebalance (via ConsumerRebalanceListener). For exactly-once: use Kafka transactions or idempotent downstream writes (database upserts, idempotent REST calls with request IDs).
Common gotchasAuto-commit in production: fine for read-only analytics but dangerous for anything that writes to a downstream system. Committing out of order: async commits that fail can silently rewind progress — always use sync on shutdown. Forgetting to commit on rebalance: without the rebalance listener, you lose the in-flight progress.
Real-world examplesKafka Streams commits offsets alongside state store changelogs atomically via transactions. Flink Kafka connector commits offsets only at successful checkpoints. Airflow, dbt Cloud, and Fivetran use after-processing sync commits for ELT pipelines.

How and when the consumer tells Kafka "I've successfully processed up to this offset" — this determines your delivery guarantee.

Strategy 1: Auto-Commit (Risky)
enable.auto.commit=true auto.commit.interval.ms=5000
T=0s: Consumer polls, gets events 0-99
T=1s: Consumer processing events...
T=5s: Auto-commit triggers → commits offset 99
T=6s: Consumer CRASHES (only processed 50 events)
T=7s: Consumer restarts from offset 100

Events 51-99 are LOST. At-most-once delivery.

Strategy 2: Manual Commit (Safe)
enable.auto.commit=false // Consumer commits explicitly after processing
Step 1: Poll events 0-99
Step 2: Process ALL events
Step 3: Commit offset 99 (wait for ACK)
Step 4: Kafka confirms commit
Step 5: Continue to next batch

If crash before commit: events reprocessed (duplicates). At-least-once delivery.

Chapter 17

Complete Producer Write Flow

What is itThe producer write flow is the end-to-end path a record takes from producer.send() to being durably committed on disk across multiple brokers. Understanding this flow is critical for tuning throughput, latency, and durability. The path involves serialization, partitioning, batching, compression, network send, broker-side append, replication, and ack. Each stage has its own configuration knobs and failure modes.
Stages
  • 1. Serialize: convert key and value objects to bytes via configured serializers (StringSerializer, Avro, Protobuf, JSON, etc.).
  • 2. Partition: the partitioner picks a target partition based on the key hash (or sticky partitioning for null keys).
  • 3. Batch: the record accumulator buffers records per partition, waiting for linger.ms or until batch.size is hit.
  • 4. Compress: the entire batch is compressed (gzip, snappy, lz4, zstd).
  • 5. Send: sender thread pulls batches, wraps them in a ProduceRequest, sends to the partition leader broker.
  • 6. Broker append: leader writes to its local log file, updates LEO.
  • 7. Replicate: followers fetch the new records and append to their logs.
  • 8. Ack: depending on acks, the leader responds after 0, 1, or ISR-wide ack.
  • 9. Client callback: the producer invokes the user's callback with success or error.
How it differs
  • vs RabbitMQ publish: RabbitMQ routes via exchanges and bindings — more logic in the broker, less in the client.
  • vs SQS SendMessage: SQS is an HTTP API call per message (or 10 messages at most) — no client-side batching buffer like Kafka.
  • vs Kinesis: similar client-side batching in the KPL but with stricter shard limits.
Latency vs throughputLow linger.ms = low latency but small batches = more RPCs = lower throughput. High linger.ms (20–100ms) = bigger batches = better compression + fewer RPCs = higher throughput but added latency. Production pipelines typically use linger.ms=10-50.
Common gotchasBuffer full: if the producer accumulator fills up, send() blocks for max.block.ms — monitor buffer-exhausted-rate. Silent data loss with acks=0 or 1: leader failure can drop records. Retries with in-flight >1: can reorder records unless idempotence is enabled.
Real-world examplesNetflix's Keystone pipeline tunes linger.ms=10 and batch.size=262144 for their ~1M msg/sec producers. Uber uses aggressive batching with compression.type=zstd for their dispatch events.

End-to-end flow from event creation to acknowledgement.

1
Producer creates event

{ topic: "order-events", key: "order-123", value: orderJson, acks: all }

2
Producer requests metadata (first time or periodically)

Connects to any broker (bootstrap server). Broker either has cached metadata or fetches from the active controller. Returns: which broker is leader for each partition of each topic.

3
Producer calculates partition

partition = hash("order-123") % 3 = 1

4
Looks up leader broker for Partition 1

From metadata: Partition 1 leader = Broker 2

5
Producer sends event directly to Broker 2

No intermediary. Direct TCP connection to the partition leader.

6
Broker 2 writes event to Partition 1 log

Appends to active segment file. Assigns next offset (e.g., offset 101). Writes to OS page cache (async flush to disk).

7
Followers replicate (since acks=all)

Follower brokers continuously poll the leader. They fetch the new event, write to their own partition logs, and ACK back to the leader.

8
All ISR acknowledged

Once all in-sync replicas confirm the write, the leader sends a success response back to the producer.

Chapter 18

Complete Consumer Read Flow

What is itThe consumer read flow is the end-to-end path records take from the broker log back to application code via consumer.poll(). Kafka consumers pull records — they don't receive push notifications. On each poll() call, the consumer sends a FetchRequest to the partition leaders it's been assigned, receives batches of records (possibly still compressed), deserializes them, and hands them back to the app. Behind the scenes, the consumer coordinates with the group coordinator for heartbeats, partition assignments, and offset commits.
Stages
  • 1. Join group: consumer contacts the group coordinator, triggers rebalance, gets partition assignment.
  • 2. Fetch position: for each assigned partition, determine starting offset (from committed offset or auto.offset.reset).
  • 3. Send FetchRequest: one request per leader broker, bundling all assigned partitions on that broker.
  • 4. Broker serves fetch: reads from log (often via zero-copy sendfile()), returns up to fetch.max.bytes of data.
  • 5. Deserialize & decompress: consumer unpacks the batch and deserializes each record.
  • 6. Return to app: poll() returns a ConsumerRecords batch.
  • 7. Process & commit: app processes records, then commits offsets (sync or async).
  • 8. Heartbeat: background thread sends heartbeats to the coordinator.
How it differs
  • vs RabbitMQ consumer: RabbitMQ pushes messages to consumers via the channel. Kafka's pull model gives consumers control over pace.
  • vs SQS ReceiveMessage: SQS is long-poll pull with visibility timeouts — no partitions, no group coordination.
  • vs Kinesis: KCL polls shards at a fixed interval (GetRecords), tracks checkpoints in DynamoDB.
Why pullPull-based consumption lets each consumer control its own consumption rate, handle backpressure naturally, and fetch batches efficiently. Slow consumers don't overwhelm themselves — they just fall behind on offsets. Push-based systems have to implement explicit flow control to avoid overwhelming consumers.
Common gotchaspoll() not called often enough: if processing one batch takes longer than max.poll.interval.ms, the consumer is kicked out → rebalance. Solution: reduce max.poll.records or use separate processing threads. Large fetch sizes: can cause OOM if memory isn't tuned. Fetch latency: tune fetch.min.bytes and fetch.max.wait.ms to balance throughput and latency.
Real-world examplesKafka Streams processes records in a tight poll-process-commit loop. Apache Flink uses Kafka consumer with checkpointed offsets. Debezium CDC uses Kafka Connect which internally uses consumers to read source topics.

End-to-end flow from consumer startup to continuous event processing.

1
Consumer starts, wants to join group

group.id = "notification-service"

2
Finds the Group Coordinator partition

hash("notification-service-group-id") % 50 = 23 → Partition 23 of __consumer_offsets

3
Consumer requests metadata

Asks any broker: "Who is the leader of __consumer_offsets Partition 23?" Answer: Broker 3.

4
Invokes Broker 3 (Group Coordinator)

Sends JoinGroup request. Broker 3 manages membership, waits for all group members, then assigns partitions. Response: "You handle Partition 2 of order-events."

5
Group Coordinator replicates assignment

The assignment is written to __consumer_offsets Partition 23 and replicated to all followers. Internally acts like acks=all (no config option for this).

6
Consumer fetches last committed offset

Asks Group Coordinator: "What's my last offset for order-events Partition 2?" Coordinator looks up key "notificationGroupId_order-events_Partition-2" and returns offset 100.

7
Consumer fetches events from partition leader

Checks metadata: leader of order-events Partition 2 = Broker 2. Sends FetchRequest: "Give me events from offset 101, max 200 bytes." Broker 2 returns events 101-501.

8
Consumer processes events

Application logic runs on each event sequentially.

9
Consumer commits offset (manual batch)

Sends OffsetCommit to Group Coordinator: "I've processed up to offset 501 for order-events Partition 2." Coordinator writes to __consumer_offsets.

Continuous Polling Loop

Consumer repeats from step 7. This is the consumer's poll loop — continuously fetching, processing, and committing.

Chapter 19

Log Compaction & Retention

What is itKafka offers two cleanup policies for topics: delete (the default) and compact. With delete, Kafka removes entire segments that exceed retention.ms or retention.bytes. With compact, Kafka keeps the latest value per key forever, garbage-collecting older duplicates — essentially turning the topic into a materialized view keyed by the record key. You can even combine them (compact,delete) to compact old records but eventually delete tombstones after a grace period.
Policies explained
  • Time-based retention: retention.ms=604800000 (7 days) — Kafka deletes segments older than 7 days.
  • Size-based retention: retention.bytes=1073741824 — keeps up to 1 GB per partition, deletes oldest first.
  • Log compaction: for each unique key, retains only the latest value. Old values of the same key are removed by a background "cleaner" thread.
  • Tombstones: a record with null value marks a key as deleted — after delete.retention.ms, the tombstone itself is removed.
How it differs
  • vs RabbitMQ TTL: RabbitMQ TTLs are per-message or per-queue and drop messages, not compact them by key.
  • vs SQS retention: SQS max retention is 14 days, no compaction.
  • vs Kinesis: Kinesis retention is 24 hours to 365 days (paid tiers), no compaction feature.
  • vs Pulsar: Pulsar supports topic compaction very similarly to Kafka.
Why compaction mattersCompaction is how Kafka supports changelog topics and event sourcing. Kafka Streams state stores are backed by compacted topics — the latest value per key is enough to rebuild the store. Kafka Connect uses compacted topics for source connector offsets. CDC pipelines use compacted topics so downstream consumers see the latest row state even if they just joined.
Common gotchasCompaction lag: the cleaner thread runs periodically and may leave duplicate old values for hours. Tuning: min.cleanable.dirty.ratio, min.compaction.lag.ms, and max.compaction.lag.ms control when compaction runs. Tombstones that never delete: if delete.retention.ms is too long, deleted keys hang around forever.
Real-world examples__consumer_offsets and __cluster_metadata are compacted topics. Debezium CDC topics are typically compacted so downstream systems always see the latest row state. Kafka Streams KTables are backed by compacted changelog topics.

Kafka can't keep data forever (disk fills up). Two strategies control how old data is cleaned up.

In simple words: Think of your phone gallery. Delete policy = delete all photos older than 30 days. Compact policy = for each person, keep only the most recent photo and delete older ones. Both free up space, but in different ways.

Policy 1: Delete (Default)
cleanup.policy=delete retention.ms=604800000 // 7 days retention.bytes=1073741824 // 1 GB

Deletes entire old segment files based on:

  • Time-based: Segment created 8 days ago → DELETED. Segment created 6 days ago → KEPT.
  • Size-based: If total partition log exceeds retention.bytes, oldest segments are deleted until under the limit.

Deletion is async — does not block writes. Runs periodically in background.

Policy 2: Compact
cleanup.policy=compact

Keeps only the latest value for each key. Old duplicates are removed.

BeforeAfter Compaction
off:100 user1 → v1removed
off:101 user2 → v1removed
off:102 user1 → v2KEPT
off:103 user3 → v1KEPT
off:104 user2 → v2KEPT

Used for: state stores, CDC, config topics. Compaction is async, per-segment, does not block writes.

Chapter 20

Why Kafka is Fast (Despite Using Disk)

What is itKafka is famous for being blazingly fast (millions of msg/sec per broker) despite persisting everything to disk. The "magic" isn't magic — it's a collection of carefully chosen low-level optimizations: sequential I/O, page cache reliance, zero-copy transfers, batching everywhere, and avoiding unnecessary object allocation. Kafka's designers deliberately chose to let the OS do most of the work (page cache, sendfile) instead of building an in-JVM buffer pool.
Key optimizations
  • Sequential disk writes: append-only log files are faster than random writes — even on spinning disks, sequential writes hit ~600 MB/s, competitive with memory random access.
  • Page cache: Kafka doesn't maintain its own buffer cache — reads go through the OS page cache, which is shared across processes and well-optimized.
  • Zero-copy (sendfile): consumer fetches use the Linux sendfile() syscall to transfer data directly from page cache to socket, bypassing user-space copies.
  • Batching: records are batched on both produce and fetch — amortizes per-request overhead.
  • Binary protocol: compact length-prefixed binary format, no JSON or XML parsing.
  • Compression: entire batches are compressed together (not per-record), yielding 3–10× reduction.
How it differs
  • vs RabbitMQ: RabbitMQ's Erlang-based broker does more per-message work (routing, acking), capping it at ~50k msg/sec per queue.
  • vs ActiveMQ: ActiveMQ uses traditional JMS semantics with persistent stores that don't exploit sequential I/O as aggressively.
  • vs Redis Streams: Redis is in-memory so faster for tiny workloads, but caps out at instance memory and has no native sequential disk persistence model.
  • vs Pulsar: Pulsar achieves similar throughput via BookKeeper journal + ledger architecture, also using sequential I/O.
Why page cacheThe page cache is the OS's file system cache. By relying on it, Kafka gets: automatic eviction, no duplicate data in JVM heap + OS buffers, no GC pressure from caching, and zero-copy on reads. Modern Linux boxes with 64+ GB RAM can cache multiple TB of recent Kafka data essentially for free.
Common gotchasTLS/SASL disables zero-copy: encrypted connections break sendfile because the kernel can't encrypt page cache directly — throughput drops 2–3×. JVM heap too large: leaves less room for page cache — keep heap at ~6 GB. Random reads (consumers far behind): cause disk seeks and kill performance.
Real-world examplesLinkedIn benchmarks show 2M+ msg/sec per broker on commodity hardware. Cloudflare published benchmarks running Kafka on NVMe with custom kernel tuning hitting multi-GB/s throughput. Confluent regularly demos Kafka running on Kubernetes with SSD-backed persistent volumes at similar rates.

Kafka reads and writes to DISK. Disk I/O is slow, right? Then how is Kafka so fast? Two fundamental OS-level optimizations.

Page Cache (Write Optimization)
Producer Event │ ▼ Kafka Broker "writes to disk" │ ▼ OS Page Cache (RAM) ← instant return │ ▼ (async) Physical Disk ← no blocking wait
  • Kafka trusts the OS to cache log files efficiently.
  • Write goes to RAM (page cache), returns immediately.
  • OS flushes to disk asynchronously in the background.
  • Because Kafka is append-only, writes are always sequential.
  • No insert in middle, no random access, no updates — disk head keeps moving forward.
  • Sequential disk writes can be as fast as RAM in many cases.
Zero-Copy (Read Optimization)
Normal Read: Disk → Kernel → User Space → Kernel → Network (4 copies, 4 context switches) Kafka Zero-Copy (sendfile syscall): Disk → Kernel Page Cache → Network (0 copies to user space, 2 context switches)
  • Kafka uses sendfile() system call.
  • Data goes directly from disk/page cache to the network socket.
  • No copy to Kafka JVM heap. No serialization/deserialization.
  • Saves CPU cycles, memory, and time.
  • This is why Kafka can serve consumers at network speed, not application speed.

Additional Performance Factors

Batching

Producers batch multiple events into a single network request. Configurable via batch.size and linger.ms. Reduces network overhead dramatically.

Compression

Entire batches can be compressed (gzip, snappy, lz4, zstd). Reduces network bandwidth and disk usage. Decompressed only at the consumer.

Partitioned Parallelism

Multiple partitions allow multiple consumers to read in parallel. More partitions = more parallelism = more throughput. Each partition is an independent log.

Chapter 21

Edge Cases & Failure Scenarios

What is itKafka is a distributed system — which means it inherits distributed systems problems: network partitions, broker crashes, disk failures, slow replicas, split brain, client timeouts, and out-of-order delivery. Understanding failure modes is essential because the difference between "Kafka guarantees" and "what actually happens" comes down to how you configure durability (acks, min.insync.replicas), retries, and idempotence. These scenarios drive the design of production-grade producers and consumers.
Key failure modes
  • Leader failure: leader crashes before replicating to followers — with acks=1, data is lost. With acks=all + min.isr=2, it's preserved.
  • Network partition: producer loses connection after sending but before ack — retry may duplicate records (unless idempotence is on).
  • Broker disk full: writes fail; existing segments are protected by retention.
  • Consumer rebalance during processing: offsets committed for records still in-flight can lead to duplicates.
  • Zombie producer: a paused-then-resumed producer can try to send stale records — transactions prevent this via epoch fencing.
  • Slow ISR: followers lag and get kicked out, shrinking durability.
  • Unclean leader election: if ISR is empty and you allow it, a stale replica becomes leader → data loss.
How it differs
  • vs RabbitMQ: RabbitMQ has similar issues (node failure, split brain) but uses mirrored/quorum queues — different recovery semantics.
  • vs SQS: SQS hides most failure modes; you only see "message might arrive more than once" and visibility timeout expirations.
  • vs database transactions: databases use strict 2PC/Paxos for consistency — Kafka uses eventual ISR convergence and idempotence for similar but weaker guarantees.
Mitigation patternsAlways use acks=all + min.insync.replicas=2 + replication.factor=3 for durable topics. Enable enable.idempotence=true on producers. Use transactions for cross-partition atomicity. Monitor under-replicated partitions. Use rebalance listeners to commit before losing partitions. Use dead-letter queues for poison pills.
Common gotchasacks=1 in production: silent data loss on leader failover. Forgetting idempotence: retries cause duplicates. Not monitoring URP: under-replicated partitions silently erode durability. Consumer poison pills: one bad record can halt the whole consumer group indefinitely.
Real-world examplesConfluent's post-mortems document real production incidents around ISR shrinkage and controller failover. Jepsen (Kyle Kingsbury) has analyzed Kafka's consistency claims under network partitions. Cloudflare's blog describes how they handle Kafka failures in their logging pipeline.

What actually happens when things go wrong in a Kafka cluster. These are common interview questions.

1. Active Controller Fails

Initial State: Controller1 = Leader, Controller2 = Follower, Controller3 = Follower
Controller1 crashes
Controller2 & 3: No heartbeat from leader detected.
Election starts: Controller2 requests votes from Controller3.
Controller3 votes for Controller2.
Controller2 wins with 2/3 majority (quorum met). Becomes new active controller.
Impact: New leader elected. Cluster continues. Brief metadata unavailability during election (typically milliseconds to seconds).

2. Leader Partition Broker Fails

Setup: Partition 0 (RF=3) — Leader: Broker1, Followers: Broker2, Broker3
Broker 1 (leader) crashes
Controller detects no heartbeat from Broker 1.
Controller triggers leader election. Candidates: Broker 2, Broker 3 (both in ISR).
New Leader: Broker 2 (selected from ISR). ISR = {Broker 2, Broker 3}.
Clients receive metadata update. Producers and consumers reconnect to Broker 2.
Impact: NO data loss (new leader has all committed data). Automatic recovery.

3. Follower Partition Broker Fails

Setup: Partition 0 (RF=3) — Leader: Broker1, Followers: Broker2, Broker3
Broker 3 crashes
Leader (Broker 1) detects: no timely fetch calls from Broker 3.
Controller removes Broker 3 from ISR. ISR: {B1, B2, B3} → {B1, B2}
System continues normally. Producers still write (acks=all waits for 2 replicas). Consumers still read. NO downtime.
Later: Broker 3 recovers, catches up with leader, added back to ISR. {B1, B2} → {B1, B2, B3}
Impact: NO data loss. NO downtime.

4. What if a "Topic Fails"?

Interview Answer A topic never fails. A topic is a logical category/folder. It's the partitions that fail (and they have replicas). The topic remains as long as at least one replica of each partition is alive.

5. Consumer Fails Before Committing Offset

Consumer polls events (offsets 100-199)
Consumer processes events 100-150
Consumer CRASHES (before committing offset)
Consumer restarts. Asks Kafka: "What's my last committed offset?" Kafka: "99."
Consumer reprocesses events 100-150 (DUPLICATES)
Result: At-least-once delivery. No data loss but duplicates possible. Make processing idempotent to handle this.

6. Consumer Processing Takes Too Long

Consumer polls events
Consumer starts processing (takes 6 minutes)
max.poll.interval.ms (5 min) exceeded. Group Coordinator thinks consumer is DEAD (no heartbeat during busy processing).
Rebalance triggered. Partitions reassigned to other consumers.
Original consumer finishes processing, tries to commit offset.
Group Coordinator REJECTS commit — consumer is no longer in the group.
Result: Wasted processing.

Solutions

7. Entire Broker Fails

Setup: 3 brokers, Topic: order-events (6 partitions, RF=3)
Broker 1 crashes
Controller detects failure (no heartbeat).
For each partition where Broker 1 was LEADER: new leader elected from ISR, metadata updated.
For each partition where Broker 1 was FOLLOWER: removed from ISR, partition continues with remaining replicas.
Clients receive metadata update. Reconnect to new leaders.
Impact: NO data loss. Automatic recovery.

8. What if ISR List Becomes Empty?

min.insync.replicas Kafka maintains min.insync.replicas as a safety net. If the ISR count drops below this threshold, the partition becomes UNAVAILABLE for writes (producers get NotEnoughReplicasException). This prevents data loss by refusing to accept writes that can't be sufficiently replicated. Reads may still work from the remaining leader.
Chapter 22

EDA Challenges & When to Use Kafka

What is itEvent-driven architectures are powerful but introduce a new class of problems that synchronous REST systems don't have: eventual consistency, debugging difficulty, schema evolution, ordering guarantees, idempotency, and operational complexity. This section is about honestly weighing those tradeoffs to decide whether Kafka (or any EDA) is the right tool. Kafka is not a drop-in replacement for REST, a database, a cache, or a simple task queue — it's a long-lived commit to a new operational and architectural model.
Challenges
  • Eventual consistency: reads after writes may return stale data across services.
  • Debugging: tracing a user action across 10 async services requires correlation IDs and distributed tracing.
  • Schema evolution: breaking changes can silently corrupt consumers; tooling (Schema Registry) is mandatory.
  • Ordering: guaranteed only within a partition — cross-partition ordering requires careful key design.
  • Idempotency: consumers must handle duplicate delivery for at-least-once semantics.
  • Operational cost: running Kafka well needs dedicated SRE/infra expertise.
When to use KafkaUse Kafka when you have: (1) high event throughput (>10k msg/sec sustained), (2) multiple consumers of the same data (fanout), (3) need for replay/rewind, (4) stream processing requirements (joins, aggregations), (5) CDC or log aggregation use cases, or (6) you're building a long-lived event backbone for microservices.
When NOT to use Kafka
  • Simple task queue: use Celery, Sidekiq, BullMQ, or SQS.
  • Request/response RPC: use gRPC or REST — Kafka is async.
  • Low-throughput apps: the ops overhead of running Kafka outweighs the benefits.
  • Real-time chat/pub-sub with short retention: NATS, Redis Pub/Sub, or WebSockets are lighter.
  • Large file transfer: Kafka's max message size is typically 1 MB — use object storage for big payloads.
AlternativesRabbitMQ for complex routing and low-latency task queues. NATS/NATS JetStream for lightweight pub/sub and edge scenarios. AWS SQS/SNS for fully managed simple queuing. AWS Kinesis for Kafka-like streaming in AWS-first shops. Apache Pulsar for multi-tenant and geo-replication needs. Redpanda for Kafka-wire-compatible with better latency on smaller clusters.
Real-world examplesShopify uses Kafka for cross-domain eventing between microservices. Robinhood uses Kafka for trade events and market data. Slack uses Kafka for search indexing and metrics. But many teams have also successfully replaced Kafka with RabbitMQ or SQS when they didn't actually need replay or stream processing.

Challenges in Event-Driven Architecture

Eventually Consistent

Data is not instantly synchronized across services. After an event is published, there's a delay before all consumers process it. Not suitable when strict real-time consistency is required.

Duplicate Events

At-least-once delivery means events can be processed more than once. Consumers must be idempotent — processing the same event twice should produce the same result.

Ordering Problem

Across partitions, ordering is not guaranteed. The 2nd event can be processed before the 1st if they're in different partitions. Use key-based partitioning for related events.

Schema Evolution

If the producer changes the event schema (adds/removes fields), all consumers may break. Use a Schema Registry (Avro, Protobuf) with backward/forward compatibility rules.

Debugging Complexity

No single request-response trace. Events flow through multiple services asynchronously. Requires distributed tracing (correlation IDs, OpenTelemetry) for end-to-end visibility.

Poison Messages

One bad message can block an entire partition's consumer. If processing throws an unhandled exception, the consumer retries infinitely. Solution: Dead Letter Queue (DLQ) for failed events.

Operational Overhead

Must monitor: consumer lag, throughput, partition balance, broker health, disk usage. More infrastructure to manage compared to simple REST calls.

Exactly-Once is Hard

Achieving exactly-once processing requires idempotent producers + transactions + consumer-side deduplication. Kafka supports it but adds complexity and latency overhead.

When to Use Kafka

Long-running workflows

Processes that take minutes/hours. Decouple the trigger from the execution.

Eventual consistency is acceptable

When millisecond consistency isn't required between services.

Real-time analytics & monitoring

Stream processing, dashboards, alerting, metrics aggregation.

Audit logs & event sourcing

Immutable log of everything that happened. Replay for debugging or rebuilding state.

Decoupled microservices

Services communicate without knowing about each other. Independent deployment and scaling.

High-throughput data pipelines

ETL, CDC (Change Data Capture), log aggregation, clickstream processing.

Chapter 23

Idempotent Producer

What is itAn idempotent producer is a Kafka producer with enable.idempotence=true, which guarantees that even if the producer retries a failed send, the record will only appear in the log exactly once. This solves the classic "ack lost, retry duplicated" problem. Under the hood, the broker assigns each producer a Producer ID (PID) and the producer tags each record with a monotonic sequence number per partition. The broker tracks the highest sequence number seen for each (PID, partition) and rejects records with old or out-of-order sequences, preventing duplicates and reordering.
Key mechanics
  • Producer ID (PID): assigned by the broker on first connection. Uniquely identifies a producer session.
  • Sequence numbers: monotonic per (PID, partition). Broker checks expected = last_seq + 1.
  • Broker deduplication: duplicates (sequence already seen) are silently discarded; out-of-order reject with OutOfOrderSequenceException.
  • Required settings: acks=all, retries>0, max.in.flight.requests.per.connection<=5.
How it differs
  • vs RabbitMQ publisher confirms: RabbitMQ confirms don't deduplicate — a retry after a lost ack still creates a duplicate. Application-side idempotency is your responsibility.
  • vs SQS deduplication ID: FIFO SQS queues support deduplication via MessageDeduplicationId, but with a 5-minute window.
  • vs Kinesis: Kinesis doesn't deduplicate — clients must handle it.
  • vs Pulsar: Pulsar supports broker-side deduplication via brokerDeduplicationEnabled with application-supplied sequence IDs.
Why enable itWithout idempotence, any network blip that causes a retry can duplicate records in the topic — consumers have to handle this downstream, which is often error-prone. With idempotence, you get exactly-once semantics per (producer session, partition) "for free" with minimal performance cost. Since Kafka 3.0, idempotence is enabled by default.
Common gotchasIdempotence is per-session: if the producer restarts, it gets a new PID — a record retried across restarts can still duplicate. Solution: use transactions with transactional.id. Not end-to-end: idempotence protects producer → broker only. Consumer side still needs offset management + idempotent processing.
Real-world examplesVirtually every modern production Kafka deployment enables idempotence. Confluent documentation recommends it as the baseline. Kafka Streams and Kafka Connect use idempotent producers internally.

What happens when a producer sends an event but the network fails before it gets the ACK? The producer retries — and now Kafka has the same event twice. Idempotent producer solves this.

The Problem — In Simple Words

Imagine you're ordering food online. You click "Place Order" but the page freezes. You're not sure if the order went through, so you click again. Now the restaurant has 2 orders for the same meal. That's exactly what happens with Kafka producers without idempotency.

Duplicate Problem Without Idempotency
Producer Broker (Leader) │ │ │── Send Event "order-123" ──▶ │ ← Broker writes it (offset 50) │ │ │ ✗ ACK lost (network) ✗ │ │ │ │── RETRY same event ────────▶ │ ← Broker writes AGAIN (offset 51) │ │ │◀──── ACK (offset 51) ────────│ │ │ Result: Same event stored TWICE at offset 50 and 51!

The Solution — Idempotent Producer

Turn on one config and Kafka handles it for you. Each producer gets a unique ID, and every event gets a sequence number. If the broker sees the same sequence number twice, it ignores the duplicate.

// Just one config change: enable.idempotence=true // What happens behind the scenes: // 1. Kafka assigns this producer a unique Producer ID (PID) // 2. Each event gets a Sequence Number (per partition) // 3. Broker tracks: PID + Partition + Sequence Number // 4. If it sees the same combo again → silently ignores (no duplicate)
With Idempotent Producer
Producer (PID=5) Broker (Leader) │ │ │── Event (PID=5, Seq=0) ────▶ │ ← Writes it (offset 50) ✓ │ │ │ ✗ ACK lost ✗ │ │ │ │── RETRY (PID=5, Seq=0) ───▶ │ ← Sees PID=5 Seq=0 already exists │ │ SKIPS the write, returns ACK │◀──── ACK (offset 50) ────────│ │ │ Result: Event stored ONCE. No duplicates!
Simple Rule Idempotent producer = no duplicates within a single partition. It does NOT prevent duplicates across partitions or across different producer instances. For that, you need Transactions (next chapter).

What Configs Change Automatically?

ConfigValue When Idempotence = trueWhy
acksallMust confirm write on all ISR replicas
retriesInteger.MAX_VALUEKeep retrying until success
max.in.flight.requests.per.connection5 (max)Allows up to 5 unacknowledged requests but maintains order using sequence numbers
Chapter 24

Transactions & Exactly-Once Semantics (EOS)

What is itKafka transactions allow a producer to atomically write to multiple partitions (possibly across multiple topics) and commit consumer offsets in the same atomic unit. This is what enables exactly-once semantics (EOS) in stream-processing applications: a record is read from an input topic, processed, written to an output topic, and its source offset committed — all or nothing. Transactions were added in Kafka 0.11 (2017) and are the foundation of Kafka Streams' exactly-once guarantees.
Key mechanics
  • transactional.id: a stable, unique ID for the producer. Allows fencing zombies across restarts.
  • Producer epoch: each new incarnation of a transactional.id gets a higher epoch; old epochs are fenced.
  • beginTransaction / sendOffsetsToTransaction / commitTransaction: the API. Offsets and writes commit atomically.
  • Isolation level: consumers set isolation.level=read_committed to see only committed transactional writes, skipping aborted ones.
  • Transaction coordinator: a broker-side component that tracks transaction state in __transaction_state topic.
How it differs
  • vs database transactions: Kafka transactions don't give you SERIALIZABLE isolation — they just ensure atomicity and "all committed writes visible together". No row locks, no rollback of arbitrary side effects.
  • vs 2PC/XA: Kafka transactions are not distributed transactions across Kafka and other systems. For cross-system atomicity, use patterns like Outbox.
  • vs RabbitMQ transactions: RabbitMQ has publisher transactions but they're slow and rarely used. Publisher confirms are preferred.
  • vs Pulsar transactions: Pulsar has similar transaction support since Pulsar 2.7.
EOS patternsThe canonical EOS pattern is consume-transform-produce: read from topic A, process, write to topic B, commit offset, all in one transaction. Kafka Streams supports this transparently with processing.guarantee=exactly_once_v2. For external side effects (DB writes, HTTP calls), use the Outbox pattern: write to DB and to an outbox table in one DB transaction, then a CDC process streams the outbox to Kafka.
Common gotchasPerformance cost: transactions add ~10–20% latency overhead. Consumer must use read_committed: otherwise it reads aborted records. Transactional.id stability: must be the same across restarts for fencing to work. Not for side effects: writing to databases or calling HTTP APIs inside a transaction does NOT roll back on abort.
Real-world examplesKafka Streams uses transactions for exactly-once state store updates. ksqlDB inherits this from Streams. Financial systems at Goldman Sachs and Capital One use Kafka transactions for trade/transaction processing where duplicates would be catastrophic.

Idempotent producer prevents duplicates within one partition. But what if you need to write to multiple partitions or topics as ONE atomic operation? That's where Transactions come in.

In Simple Words

Think of a bank transfer: debit from Account A and credit to Account B. Both must happen together — either BOTH succeed or BOTH fail. You can't debit A and then crash before crediting B. Kafka Transactions work the same way — all writes in a transaction either ALL go through or NONE of them do.

When Do You Need Transactions?

1
Read-Process-Write Pattern

Read from Topic A, process the data, write to Topic B, and commit the offset for Topic A — all as ONE atomic operation. If any step fails, everything rolls back.

2
Multi-Partition Writes

Writing events to 3 different partitions at once. Either all 3 partitions get the event, or none of them do. No partial writes.

3
Multi-Topic Writes

Writing an order event to "orders" topic AND a payment event to "payments" topic. Both must succeed together.

How It Works (Step by Step)

// Producer config enable.idempotence=true transactional.id="order-processor-1" // unique ID for this transactional producer // Code flow: producer.initTransactions(); // one-time setup producer.beginTransaction(); // START transaction producer.send(record1); // write to topic-A partition-0 producer.send(record2); // write to topic-B partition-1 producer.sendOffsetsToTransaction(...) // commit consumer offsets too producer.commitTransaction(); // COMMIT — all or nothing // If anything fails: producer.abortTransaction(); // ROLLBACK everything
Transaction Flow
Producer Transaction Coordinator Brokers │ │ │ │── initTransactions() ────▶ │ │ │◀── PID assigned ──────────│ │ │ │ │ │── beginTransaction() ────▶ │ │ │ │ │ │── send(record1) ────────────────────────────────────▶ │ P0 │── send(record2) ────────────────────────────────────▶ │ P1 │ │ │ │── commitTransaction() ──▶ │── write COMMIT marker ─▶ │ │ │ │ │◀── success ────────────────│ │ Consumer (isolation.level=read_committed): Only sees record1 and record2 AFTER the COMMIT marker is written. If aborted, consumer never sees them.

Consumer Side — Isolation Levels

read_uncommitted (Default)
  • Consumer sees ALL events, even from incomplete transactions.
  • Faster but may read data that later gets rolled back.
  • Use when you don't care about transactional guarantees.
read_committed
  • Consumer only sees events from committed transactions.
  • Slightly slower (waits for commit markers).
  • Use when you need exactly-once guarantees.
Exactly-Once = Idempotent Producer + Transactions + read_committed Consumer All three pieces together give you exactly-once semantics. The producer won't create duplicates (idempotent), writes are atomic (transactions), and the consumer only reads committed data (read_committed). This is the strongest guarantee Kafka offers.
Chapter 25

Consumer Rebalancing

What is itRebalancing is the process of redistributing partition assignments among consumers in a group whenever group membership or topic metadata changes — e.g., a consumer joins, leaves, crashes, or new partitions are added. During a classic "stop-the-world" rebalance, all consumers stop processing, rejoin the group, receive a new assignment, and resume. Rebalancing is coordinated by the group coordinator — a broker that tracks group membership and assignments.
Assignment strategies
  • RangeAssignor (default pre-2.4): assigns contiguous ranges of partitions per topic — can be uneven.
  • RoundRobinAssignor: distributes partitions round-robin across consumers.
  • StickyAssignor: tries to keep existing assignments stable during rebalance, minimizing movement.
  • CooperativeStickyAssignor (2.4+): incremental rebalancing — only moves the partitions that need to change, others keep processing.
How it differs
  • vs RabbitMQ: RabbitMQ competing consumers don't rebalance — messages are pushed to whichever consumer has capacity.
  • vs Kinesis KCL: KCL uses DynamoDB leases per shard — similar concept but different implementation.
  • vs Pulsar: Pulsar's shared subscription mode distributes messages dynamically without rebalancing.
Why it mattersRebalancing is when consumer groups are most fragile. A long rebalance means consumer lag grows, downstream latency spikes, and potential duplicate processing if offsets weren't committed beforehand. Cooperative rebalancing and static membership (group.instance.id) are the two biggest tools to minimize rebalance impact.
Common gotchasRebalance storms: consumers that repeatedly leave and rejoin (e.g., due to long processing or GC pauses) trigger constant rebalances. Fix with max.poll.interval.ms, smaller batches, or separating processing from polling. Lost progress: if you don't commit in onPartitionsRevoked, you may reprocess records. Static group membership (Kafka 2.3+) helps avoid unnecessary rebalances during rolling restarts.
Real-world examplesConfluent extensively documented "incremental cooperative rebalancing" as a major improvement in 2.4+. Uber tuned static membership and rebalance timeouts for their dispatch consumer groups to minimize downtime. Netflix uses custom partition assignors for specialized workloads.

When consumers join or leave a group, Kafka redistributes partitions among the remaining consumers. This process is called rebalancing.

In Simple Words

Imagine 3 workers splitting 6 tasks equally (2 each). If one worker goes home sick, the remaining 2 workers must split all 6 tasks (3 each). If a new worker joins, tasks are redistributed again. That's rebalancing.

What Triggers a Rebalance?

+
New Consumer Joins

A new consumer starts with the same group.id. Partitions are redistributed to include the new member.

-
Consumer Leaves or Crashes

A consumer calls close() or stops sending heartbeats. Its partitions must be given to someone else.

T
Topic Changes

New partitions are added to a subscribed topic, or the consumer subscribes to a new topic.

Rebalancing Process (Step by Step)

1
Trigger detected

Group Coordinator detects a change (new member, missed heartbeat, etc.).

2
All consumers revoke current partitions

Every consumer in the group stops reading. They commit their current offsets and release their partitions.

3
JoinGroup phase

All consumers send a JoinGroup request to the Group Coordinator. The first consumer to join becomes the Group Leader (not the same as partition leader).

4
Group Leader assigns partitions

The Group Leader runs the partition assignment strategy (Range, RoundRobin, or Sticky) and sends the new assignment back to the Coordinator.

5
SyncGroup phase

Coordinator distributes the new partition assignment to all consumers. Each consumer starts reading from their newly assigned partitions.

Assignment Strategies

StrategyHow It WorksBest For
RangeDivides partitions of each topic into contiguous ranges per consumer. Consumer 1 gets P0-P1, Consumer 2 gets P2-P3.When you need co-partitioning (same consumer reads same partition number from different topics).
RoundRobinDistributes all partitions one-by-one across consumers. P0 → C1, P1 → C2, P2 → C3, P3 → C1...When you want even distribution across consumers.
StickyLike RoundRobin but tries to keep previous assignments. Only moves partitions that must move.Reducing rebalance disruption. Recommended for most cases.
CooperativeStickyIncremental rebalancing — only revokes partitions that need to move, others keep reading.Minimizing downtime during rebalancing. Best choice.
Rebalancing is Expensive! During a rebalance, all consumers in the group stop reading (with Eager protocol). This causes a processing pause. For a group with 100 consumers and 500 partitions, a rebalance can take seconds to minutes. Use CooperativeSticky assignor and tune session.timeout.ms and heartbeat.interval.ms to minimize unnecessary rebalances.
Chapter 26

Dead Letter Queue (DLQ)

What is itA Dead Letter Queue (DLQ) is a special topic where unprocessable records are sent after repeated processing failures. This prevents "poison pill" records — malformed JSON, schema mismatches, missing dependencies, or logic errors — from blocking the entire consumer group indefinitely. The pattern: consumer tries to process a record, catches the error, writes the record (plus metadata like the exception, retry count, original topic/partition/offset) to a DLQ topic, then commits the offset and moves on. DLQs can later be replayed manually or programmatically after bugs are fixed.
Key components
  • Retry topic(s): optional intermediate topics with delay for retryable errors.
  • DLQ topic: where terminally failed records end up.
  • Headers: enrich the DLQ record with exception, stack trace, attempt count, source metadata.
  • DLQ consumer / inspector: tools to browse, replay, or discard DLQ messages.
How it differs
  • vs RabbitMQ DLX: RabbitMQ has native Dead Letter Exchanges — messages are automatically routed to a DLX on nack/reject/TTL. Kafka doesn't have this — you implement it in consumer code.
  • vs SQS DLQ: SQS DLQs are configured at the queue level with maxReceiveCount — messages automatically move after N failed delivery attempts. Cleaner than Kafka's application-level approach.
  • vs Pulsar DLQ: Pulsar has a DeadLetterPolicy on subscriptions, similar to SQS.
Why use DLQWithout a DLQ, a single bad message can halt your entire pipeline — or worse, cause infinite retry loops. DLQs give you a graceful degradation pattern: the pipeline keeps moving, bad records are isolated, and you can investigate later without emergency interventions.
Common gotchasSilently losing data: if nobody monitors the DLQ, bugs pile up unnoticed. Always alert on DLQ non-zero size. Infinite DLQ loops: if the DLQ writer itself fails, you need fallback (log + skip). Retry vs DLQ boundary: distinguish transient (retry) from permanent (DLQ) errors — schema mismatch → DLQ, network timeout → retry.
Real-world examplesKafka Connect has built-in DLQ support via errors.tolerance=all and errors.deadletterqueue.topic.name. Spring Cloud Stream provides DLQ abstractions. Uber's uReplicator and many internal pipelines use the retry → DLQ pattern.

What happens when a consumer can't process an event? A bad event (poison message) can block the entire partition. DLQ is the solution.

In Simple Words

Imagine a post office. Most letters are delivered fine. But some letters have wrong addresses. Instead of throwing them away or blocking all other mail, the post office puts them in a separate "undeliverable" pile. Someone later checks that pile and fixes the addresses. A Dead Letter Queue is that "undeliverable" pile for Kafka events.

The Problem

Without DLQ — One Bad Event Blocks Everything
Partition 0: [E1] [E2] [E3-BAD] [E4] [E5] [E6] ▲ Consumer reads E3 → Processing FAILS → Retry → FAILS → Retry... E4, E5, E6 are STUCK waiting. Consumer is blocked forever.

The Solution — DLQ Pattern

With DLQ — Bad Events Move Aside
Main Topic: [E1] [E2] [E3-BAD] [E4] [E5] [E6] │ Consumer reads E3 → FAILS │ Retry 1 → FAILS │ Retry 2 → FAILS │ Max retries reached → Move E3 ──┘ ▼ DLQ Topic: [E3-BAD] ← stored with error details Consumer moves on → E4 → E5 → E6 ← no longer blocked!

How to Implement DLQ

// Pseudo-code for DLQ pattern: for each event in poll(): retries = 0 while retries < MAX_RETRIES: try: process(event) break // success, move to next event catch Exception: retries += 1 if retries == MAX_RETRIES: // Send failed event to DLQ topic with error info producer.send( topic: "order-events.DLQ", key: event.key, value: event.value, headers: { "error": exception.message, "original-topic": "order-events", "original-partition": event.partition, "original-offset": event.offset, "retry-count": MAX_RETRIES } ) commit_offset()
DLQ Best Practices 1. Name your DLQ topic clearly: {original-topic}.DLQ.   2. Include the original topic, partition, offset, and error reason in the DLQ event headers.   3. Set up monitoring/alerts when events land in the DLQ.   4. Build a tool or process to review DLQ events, fix them, and replay them back to the original topic.
Chapter 27

Schema Registry

What is itThe Confluent Schema Registry (and alternatives like Apicurio, AWS Glue Schema Registry, and Karapace) is a centralized service that stores and versions schemas for Kafka records, typically in Avro, Protobuf, or JSON Schema. Producers register their schema once, the registry returns a schema ID, and the producer only ships the schema ID + payload bytes — not the full schema — on every record. Consumers look up the schema by ID and deserialize. This gives you space-efficient wire format, schema validation, compatibility checks, and a central catalog of all event types in your organization.
Key features
  • Schema storage: versioned per subject (typically <topic>-value and <topic>-key).
  • Compatibility rules: BACKWARD, FORWARD, FULL, NONE — enforced at registration time.
  • REST API: for schema registration, lookup, and management.
  • Wire format: 5 bytes (magic + 4-byte schema ID) prefixed to serialized payload.
  • Supported formats: Avro (most common), Protobuf, JSON Schema.
How it differs
  • vs raw JSON: JSON has no schema enforcement — consumers can break silently. Schema Registry catches breaking changes at produce time.
  • vs ProtoBuf alone: Protobuf has .proto files but no central registry — teams duplicate definitions.
  • vs AWS Glue Schema Registry: Glue is AWS-native, integrates with MSK, Kinesis, Lambda. Confluent's is Kafka-native, more mature.
  • vs GraphQL schemas: GraphQL is for request/response; Schema Registry is for streams.
Why use itWithout a schema registry, every team reinvents schemas, makes breaking changes, and downstream consumers crash. With it, you get contract-first event design, automatic compatibility checks, and a single source of truth. It's essential once you have more than a handful of event topics and multiple teams.
Common gotchasRegistry downtime: makes your producers and consumers fail — run it highly available. Wrong subject naming strategy: TopicNameStrategy (default), RecordNameStrategy, TopicRecordNameStrategy — pick one and stick with it. Breaking changes by accident: a "nullable" added too late can break backward compatibility.
Real-world examplesConfluent Cloud includes a managed Schema Registry. Airbnb runs a schema registry for all of its internal Kafka topics. Netflix, Uber, LinkedIn all run variants of schema registries to enforce event contracts at scale.

Kafka stores events as raw bytes. It doesn't know or care what's inside the event. But producers and consumers need to agree on the format. Schema Registry solves this.

In Simple Words

Imagine two people writing letters to each other. If person A starts writing in English and person B expects Hindi, they can't understand each other. Schema Registry is like a shared dictionary that both people agree to use. If person A wants to add a new word, the dictionary checks if person B can still understand the letter.

The Problem Without Schema Registry

Schema Mismatch Breaks Consumers
Producer (v1): { "orderId": "123", "amount": 500 } │ Kafka Topic │ Consumer expects: { "orderId": "123", "amount": 500 } ← Works fine ✓ --- Later, producer changes schema --- Producer (v2): { "order_id": "123", "total": 500, "currency": "INR" } │ Kafka Topic │ Consumer expects: { "orderId": "123", "amount": 500 } ← CRASHES! ✗ "orderId" is now "order_id", "amount" is now "total"

How Schema Registry Works

1
Register the schema

When a producer starts, it registers the event schema (Avro, Protobuf, or JSON Schema) with the Schema Registry. The registry assigns a Schema ID (e.g., ID=1).

2
Producer sends data with Schema ID

Instead of sending raw JSON, the producer sends: [Magic Byte][Schema ID=1][Serialized Data]. The data is compact (Avro binary, not JSON text).

3
Consumer reads and looks up schema

Consumer reads the Schema ID from the event, fetches the schema from the Registry, and deserializes the data correctly. Schema is cached locally after the first fetch.

4
Schema evolution with compatibility checks

When the producer wants to change the schema, the Registry checks if the new version is compatible with older versions before allowing it.

Compatibility Modes

ModeWhat You Can DoSimple Explanation
BACKWARDDelete fields, add fields with defaultsNew consumers can read old events. "I can understand old letters."
FORWARDAdd fields, delete fields with defaultsOld consumers can read new events. "Old readers can understand new letters."
FULLAdd/delete fields only with defaultsBoth old and new consumers can read both old and new events. Safest.
NONEAnything goesNo checks. Dangerous. Can break consumers at any time.
Serialization Formats Avro is the most popular choice with Kafka. It's compact (binary), has a built-in schema, and works great with Schema Registry. Protobuf is also good, especially if you already use gRPC. JSON Schema is human-readable but less efficient. Avoid plain JSON in production — it's wasteful (field names repeated in every event) and has no schema enforcement.
Chapter 28

Kafka Connect

What is itKafka Connect is a framework and runtime for moving data into and out of Kafka without writing any producer/consumer code. It provides a pluggable architecture of source connectors (pull data from external systems into Kafka) and sink connectors (push data from Kafka into external systems), plus a distributed runtime that handles scaling, fault tolerance, offset tracking, and configuration. Hundreds of prebuilt connectors exist for databases (PostgreSQL, MySQL via Debezium), object storage (S3, GCS), search (Elasticsearch, OpenSearch), warehouses (Snowflake, BigQuery, Redshift), and SaaS APIs (Salesforce, Zendesk).
Key features
  • Distributed runtime: connectors run as tasks across worker nodes with automatic rebalancing.
  • REST API: manage connectors via HTTP calls.
  • Offset tracking: stored in Kafka topics, giving exactly-once-ish guarantees.
  • Transformations (SMTs): simple per-record transforms (rename, drop, mask) inline.
  • Converters: Avro, Protobuf, JSON Schema, or raw bytes.
  • Dead letter queues: built-in for handling bad records.
How it differs
  • vs writing custom producers/consumers: Connect eliminates boilerplate for common integrations — just install and configure.
  • vs Apache NiFi: NiFi is a drag-and-drop data flow tool with a GUI. Kafka Connect is simpler, Kafka-specific, and API-driven.
  • vs Airflow: Airflow is for orchestrating batch jobs on a schedule. Kafka Connect is for continuous streaming ingestion/egress.
  • vs Fivetran/Stitch: Fivetran is a managed SaaS for ELT to data warehouses. Kafka Connect is self-managed and multi-destination.
  • vs AWS DMS: DMS does CDC to AWS-native targets. Debezium (on Kafka Connect) is the OSS equivalent.
Why use itYou use Connect when you want to ingest database changes (CDC via Debezium), archive topics to S3/GCS, stream Kafka into Elasticsearch or data warehouses, or sync SaaS systems. It's the bridge between Kafka and the rest of your data stack.
Common gotchasConnector quality varies wildly: some community connectors are buggy or unmaintained. Schema evolution: source schema changes can break downstream sinks — use Schema Registry. Scaling: task count is bounded by source partitions/tables — more workers doesn't always mean more throughput. Monitoring: Connect has its own REST status endpoint and JMX metrics — monitor failed tasks.
Real-world examplesDebezium is the most popular Kafka Connect CDC source for PostgreSQL/MySQL/MongoDB/SQL Server. Confluent Hub lists 200+ connectors. Shopify uses Debezium to stream MySQL changes into Kafka. WePay (now Chase) pioneered Debezium in production.

A framework for moving data between Kafka and other systems without writing any code. Just configure and run.

In Simple Words

You want to move data from your MySQL database into Kafka. Or from Kafka into Elasticsearch. You could write a custom producer/consumer for each system. But that's a lot of repeated work. Kafka Connect gives you ready-made "connectors" — like USB adapters that plug different systems into Kafka.

Two Types of Connectors

Source Connector (Data INTO Kafka)
MySQL ──────┐ PostgreSQL ─┤ MongoDB ────┤── Source ──▶ Kafka Topics S3 Files ───┤ Connectors REST API ───┘
  • Reads data from external systems.
  • Converts it into Kafka events.
  • Publishes to Kafka topics.
  • Example: Every new row in MySQL automatically becomes a Kafka event.
Sink Connector (Data OUT OF Kafka)
┌── Elasticsearch ├── HDFS / S3 Kafka Topics ──▶ ├── PostgreSQL Sink Connectors ├── Redis └── Snowflake
  • Reads events from Kafka topics.
  • Writes data to external systems.
  • Example: Every Kafka event automatically gets indexed in Elasticsearch.

Popular Connectors

ConnectorTypeWhat It Does
JDBC SourceSourceReads rows from any SQL database (MySQL, PostgreSQL, Oracle) into Kafka
DebeziumSourceCDC (Change Data Capture) — captures every INSERT/UPDATE/DELETE from databases in real-time
S3 SinkSinkWrites Kafka events to Amazon S3 as files (Parquet, JSON, Avro)
Elasticsearch SinkSinkIndexes Kafka events into Elasticsearch for search
HDFS SinkSinkWrites Kafka events to Hadoop HDFS for batch analytics
MongoDB Source/SinkBothReads from and writes to MongoDB collections
Key Advantage Kafka Connect handles fault tolerance, offset tracking, scaling, and retries for you. You just write a JSON config file. No custom producer/consumer code needed. It also supports distributed mode where multiple Connect workers share the load.
Chapter 29

Kafka Streams

What is itKafka Streams is a Java library (no separate cluster required) for building stream-processing applications on top of Kafka. Your application reads from input topics, processes records using a fluent DSL or a low-level Processor API, and writes to output topics — all while Kafka Streams handles state management, fault tolerance, exactly-once semantics, rebalancing, and scaling. Unlike Spark Streaming or Flink, Kafka Streams has no separate cluster — it's just a JAR you add to your app, making deployment identical to any other JVM microservice.
Key abstractions
  • KStream: an unbounded stream of records (events).
  • KTable: a changelog stream interpreted as an evolving table (latest value per key).
  • GlobalKTable: a fully replicated lookup table across all instances.
  • State stores: RocksDB-backed local state, backed up by compacted Kafka topics for fault tolerance.
  • Windowing: tumbling, hopping, sliding, session windows for time-based aggregation.
  • Joins: stream-stream, stream-table, table-table joins, all natively supported.
How it differs
  • vs Apache Flink: Flink is a full distributed cluster with richer stream processing semantics, better for very complex pipelines. Kafka Streams is lighter and tightly integrated with Kafka.
  • vs Spark Structured Streaming: Spark is micro-batch based (latency in hundreds of ms to seconds); Kafka Streams is record-at-a-time with single-digit ms latency.
  • vs ksqlDB: ksqlDB is a SQL interface built on top of Kafka Streams.
  • vs Apache Storm/Samza: older stream processors; Streams has largely supplanted them for Kafka-native use cases.
Why use itUse Kafka Streams when you need stateful stream processing with exactly-once semantics in a Kafka-first architecture and want to avoid operating a separate stream-processing cluster. Common use cases: real-time aggregations, joining multiple event streams, enriching events with reference data, anomaly detection, and materialized views.
Common gotchasJVM only: no native Python, Go, or Node clients — use Faust (Python) or KSQL/Flink alternatives. State store restoration: after a rebalance, new instances must restore state from changelog topics — can be slow for large state. Punctuators and time semantics: event-time vs processing-time is tricky. Repartitioning cost: groupByKey/join on a non-key column creates intermediate topics.
Real-world examplesThe New York Times uses Kafka Streams for their content platform. Pinterest uses it for real-time analytics. Trivago uses it for hotel search event processing. Walmart uses it for real-time inventory updates.

A Java library for processing data in real-time directly from Kafka topics. No separate cluster needed — it runs inside your application.

In Simple Words

Normally, you read from Kafka, process data in your app, and write results back to Kafka. Kafka Streams makes this easier by giving you a powerful API to filter, transform, join, and aggregate events — all in real-time, as events flow through. Think of it like Excel formulas but for streaming data.

Why Not Just Use a Consumer?

Plain Consumer
  • You write all processing logic yourself.
  • You handle state management yourself.
  • You handle fault tolerance yourself.
  • You handle time windowing yourself.
  • Joining two topics? Build it from scratch.
Kafka Streams
  • Built-in operators: filter, map, join, aggregate.
  • Automatic state management (RocksDB).
  • Automatic fault tolerance and recovery.
  • Built-in windowing (tumbling, hopping, sliding).
  • Joining topics is one line of code.

Common Operations

Filter

Keep only events that match a condition. Example: only keep orders above $100.

orders.filter((key, order) -> order.amount > 100)
Map / Transform

Change the event shape. Example: extract just the customer ID and amount.

orders.mapValues(order -> new Summary(order.customerId, order.amount))
Group & Aggregate

Group events by key and compute running totals. Example: total spending per customer.

orders.groupByKey() .reduce((a, b) -> a.amount + b.amount)
Windowed Aggregation

Aggregate events within a time window. Example: count orders per 5-minute window.

orders.groupByKey() .windowedBy(TimeWindows .ofSizeWithNoGrace( Duration.ofMinutes(5))) .count()

Kafka Streams vs Other Stream Processors

FeatureKafka StreamsApache FlinkApache Spark Streaming
DeploymentLibrary (runs in your app)Separate clusterSeparate cluster
ComplexitySimple — just add a JARMedium — needs cluster setupMedium — needs cluster setup
ScalingAdd more app instancesAdd more TaskManagersAdd more executors
Best ForKafka-to-Kafka processingComplex event processing, multiple sourcesBatch + streaming hybrid
LanguageJava / Kotlin onlyJava, Python, SQLJava, Python, Scala, SQL
Chapter 30

Kafka Security

What is itKafka security covers three pillars: encryption (in transit and at rest), authentication (who are you), and authorization (what can you do). Out of the box, Kafka offers nothing — by default, any client that can TCP-connect can read/write any topic. Production Kafka deployments enable a combination of TLS for encryption, SASL or mTLS for authentication, and ACLs for authorization. Confluent and cloud providers also add role-based access control (RBAC), audit logs, and managed encryption at rest.
Security layers
  • TLS/SSL: encrypts traffic between clients ↔ brokers, broker ↔ broker, and (in ZK mode) broker ↔ ZK.
  • SASL mechanisms: PLAIN, SCRAM-SHA-256/512, GSSAPI (Kerberos), OAUTHBEARER — authenticate clients.
  • mTLS: client certificate authentication — often combined with TLS.
  • ACLs: authorize operations (Read/Write/Create/Delete/Describe) on resources (Topic/Group/Cluster) to principals.
  • RBAC (Confluent, MSK): role-based permissions on top of ACLs.
  • Encryption at rest: typically delegated to the underlying disk (EBS, LUKS) or provided by cloud vendors.
How it differs
  • vs RabbitMQ: RabbitMQ has built-in user/virtual-host-based permissions and supports TLS. Kafka's ACL model is more granular per-resource.
  • vs SQS/Kinesis: AWS uses IAM for auth — simpler to manage but tied to AWS.
  • vs Pulsar: Pulsar has namespace-level auth and supports similar SASL/TLS/token mechanisms.
Best practicesAlways enable TLS for inter-broker and client-broker traffic. Use SCRAM-SHA-512 or mTLS in production — avoid SASL/PLAIN unless used over TLS. Define ACLs per application via dedicated service accounts. Rotate credentials regularly. Audit broker access with kafka.authorizer.logger.
Common gotchasTLS kills zero-copy: throughput drops 2–3× — budget for it. Default ACLs deny: setting allow.everyone.if.no.acl.found=false is critical. Overly broad ACLs: one app with Cluster:All is a silent security hole. Certificate rotation: forgotten rotation schedules cause outages when certs expire.
Real-world examplesCapital One uses mTLS + ACLs per-application. Goldman Sachs built custom tooling on top of Confluent RBAC. AWS MSK supports IAM-based auth for Kafka clients. Confluent Cloud provides managed RBAC and audit logs.

By default, Kafka has no security at all. Anyone who can reach the broker can read/write any topic. In production, you need authentication, authorization, and encryption.

In Simple Words

Think of Kafka as a building. Without security: no doors, no locks, no ID checks. Anyone can walk in, read any document, and write on any whiteboard. Security adds: a front door with a lock (encryption), ID cards for entry (authentication), and rules about who can access which room (authorization).

Three Layers of Security

E
Encryption (SSL/TLS)

What: Encrypts data in transit between clients and brokers, and between brokers.

Simple: Like sending a letter in a sealed envelope instead of a postcard. Nobody can read the data while it's moving over the network.

Port 9093 (default for SSL)

A
Authentication (SASL)

What: Verifies WHO is connecting. Clients must prove their identity.

Options:

  • SASL/PLAIN — Username + password (simple, use with SSL).
  • SASL/SCRAM — Salted password (more secure than PLAIN).
  • SASL/GSSAPI — Kerberos (enterprise, Active Directory).
  • SASL/OAUTHBEARER — OAuth 2.0 tokens.
Z
Authorization (ACLs)

What: Controls WHO can do WHAT on WHICH resource.

Simple: Even after proving your identity, you can only access what you're allowed to. The payment service can read "payments" topic but not "user-profiles" topic.

// Allow user "order-svc" to write to "orders" topic kafka-acls --add --allow-principal User:order-svc \ --operation Write --topic orders // Allow user "analytics" to read from "orders" topic kafka-acls --add --allow-principal User:analytics \ --operation Read --topic orders --group analytics-group
Production Checklist 1. Enable SSL/TLS for all client-broker and broker-broker communication.   2. Enable SASL authentication (SCRAM-SHA-256 or SCRAM-SHA-512 recommended).   3. Set up ACLs to restrict topic access per service.   4. Disable anonymous access.   5. Use separate credentials for each service (not one shared password for everything).
Chapter 31

Kafka vs Other Message Systems

What is itKafka is often lumped together with other "message brokers", but it's meaningfully different from most. The messaging/streaming landscape can be divided roughly into queues (destructive reads), pub/sub (push-based), and logs (pull-based, replay-capable). Kafka is a log-based system with pub/sub semantics layered on top. Picking the right tool for your workload depends on throughput requirements, ordering needs, retention, operational comfort, and ecosystem fit.
Comparison matrix
  • vs RabbitMQ: RabbitMQ = flexible routing (exchanges, headers, topics), complex workflows, per-message ack, low-to-mid throughput. Kafka = raw throughput, replay, simple routing.
  • vs ActiveMQ/JMS: ActiveMQ is a classic JMS broker — great for legacy Java apps. Kafka's log model and throughput eclipse it.
  • vs AWS SQS: SQS = fully managed, zero ops, simple queue semantics, at-least-once, 14-day max retention, low-to-mid throughput. Kafka = you run it, massive throughput, replay.
  • vs AWS Kinesis: Kinesis = Kafka-like but managed and bounded per shard. Better for AWS-first shops that don't want to run Kafka.
  • vs Google Pub/Sub: Fully managed, push or pull, global, integrates with Dataflow. Kafka has better replay and ordering guarantees.
  • vs Apache Pulsar: Pulsar has compute/storage separation, tiered storage, and geo-replication built in. Kafka has a larger ecosystem and more mature tooling.
  • vs NATS: NATS is lightweight, simple, perfect for edge/IoT. NATS JetStream adds log-style persistence. Kafka is heavyweight but far more scalable.
  • vs Redis Streams: Redis Streams is in-memory with optional persistence — perfect for small workloads in existing Redis stacks.
  • vs Redpanda: Redpanda is wire-compatible with Kafka, written in C++ with thread-per-core design, no ZK/KRaft quorum needed, lower tail latency. Trades ecosystem maturity.
When Kafka winsHigh throughput (>100k msg/sec), replay capability, fanout to many consumers, event sourcing, stream processing, CDC pipelines, massive retention (days/weeks/months).
When alternatives winSimple task queues → RabbitMQ, SQS. Request/response RPC → gRPC. Edge/IoT → NATS. Fully managed, AWS-native → SQS or Kinesis. Existing Redis stack → Redis Streams. Multi-region geo-replication out of the box → Pulsar.
Common gotchasForcing Kafka onto every workload is the #1 antipattern — small apps don't need the ops burden. Comparing throughput numbers naively ignores durability tradeoffs.
Real-world examplesShopify runs Kafka for core events but uses RabbitMQ for workflow tasks. Stripe used both Kinesis and Kafka historically. Many startups begin with SQS and migrate to Kafka only when throughput/replay demands grow.

Kafka is not the only messaging system. Here's how it compares to other popular choices and when to pick which one.

Comparison Table

FeatureApache KafkaRabbitMQAWS SQSAWS SNS
ModelPull-based logPush-based queuePull-based queuePush-based pub/sub
Message RetentionConfigurable (days/forever)Until consumedUp to 14 daysNo retention
ReplayYes (any offset)NoNoNo
OrderingPer partitionPer queue (with limits)FIFO queues onlyNo ordering
ThroughputMillions/secTens of thousands/secNearly unlimited (managed)Nearly unlimited (managed)
Consumer GroupsBuilt-inManual setupOne consumer per messageFan-out to subscribers
Operational CostHigh (self-managed) / Medium (managed)MediumLow (fully managed)Low (fully managed)
Best ForEvent streaming, high throughput, replay neededTask queues, routing logic, request-replySimple job queues, decouplingFan-out notifications

When to Pick What?

Pick Kafka When
  • You need to replay old events (debugging, new consumers).
  • You need very high throughput (millions of events/sec).
  • Multiple services need to read the same data independently.
  • You need event sourcing or an immutable audit log.
  • You need stream processing (real-time transformations).
Pick RabbitMQ When
  • You need complex routing (topic exchanges, header-based routing).
  • You need request-reply pattern (RPC over messages).
  • You need message priority (urgent messages first).
  • Your throughput is moderate (thousands/sec, not millions).
  • You want simpler setup for basic task queues.
Pick SQS/SNS When
  • You're on AWS and want fully managed (zero ops).
  • You need a simple job queue (no streaming features needed).
  • You want fan-out notifications (SNS → email, SMS, Lambda).
  • You don't need replay, ordering, or consumer groups.
  • You want pay-per-use pricing with no cluster management.
Chapter 32

Real-World Use Cases

What is itKafka's flexibility means it shows up in an enormous variety of real-world applications, from activity tracking to bank ledgers. This section surveys the most common use case categories and the specific patterns companies apply. Understanding the "shapes" of Kafka deployments helps you pick the right architecture for your own problem — and avoid using Kafka where a simpler tool would do.
Core use case categories
  • Activity tracking: clickstreams, page views, user interactions. LinkedIn's original use case.
  • Log aggregation: collect logs from thousands of servers into a central pipeline. ELK, Splunk, Datadog all ingest via Kafka.
  • Metrics pipeline: time-series metrics streamed into Prometheus, Graphite, Druid.
  • CDC (Change Data Capture): stream DB changes via Debezium for cache invalidation, search indexing, data warehousing.
  • Event sourcing: store application state as an immutable log of events, rebuild state by replay.
  • Stream processing: real-time fraud detection, recommendations, ML feature pipelines.
  • Microservices backbone: decoupled async communication between services.
  • Messaging: replace RabbitMQ/ActiveMQ for high-throughput async messaging.
  • IoT ingestion: high-volume sensor data from millions of devices.
Specific company examples
  • LinkedIn: activity streams, metrics, logs, CDC from Espresso/Oracle/MySQL. 7+ trillion msg/day.
  • Netflix: Keystone pipeline ingests 1+ trillion events/day for analytics, personalization, billing.
  • Uber: dispatch events, surge pricing, trip state, fraud detection. 4+ trillion msg/day.
  • Airbnb: search indexing, pricing, review fanout, payment events.
  • Goldman Sachs: trade events, regulatory reporting pipelines.
  • Shopify: order events, inventory updates, cross-service eventing.
  • Spotify: play events, recommendation signals, data warehouse loading.
  • Pinterest: pin events, notifications, ML features.
  • Robinhood: market data and trade events.
  • Walmart: inventory updates, supply chain events.
How it differsUnlike traditional message queues limited to task distribution, Kafka's log+replay model unlocks use cases that other brokers can't handle: event sourcing, CDC with replay, stream-table joins, and multi-consumer fanout.
Common gotchasChoosing Kafka for the wrong reason: sometimes a simple queue, a database, or a webhook is the right answer. Over-engineering: event sourcing sounds cool but adds significant complexity to your codebase. Assuming "Kafka = unlimited throughput": real clusters still need capacity planning.
Learning resourcesConfluent blog and Current conference talks have in-depth company case studies. Jay Kreps' "The Log" article is foundational. Designing Data-Intensive Applications (Martin Kleppmann) covers the theory.

How big companies actually use Apache Kafka in production. These examples help you understand where Kafka fits in real systems.

1
E-Commerce Order Flow

Problem: When a customer places an order, many things must happen: payment processing, inventory update, shipping notification, email confirmation, analytics tracking. These should not be tightly coupled.

User places order │ Order Service ──▶ "order-events" topic │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ Payment Service Inventory Service Email Service (Consumer Grp 1) (Consumer Grp 2) (Consumer Grp 3)

Each service processes the same order event independently. If the email service is slow, it doesn't affect payments.

2
Real-Time Analytics / Dashboards

Problem: You want a live dashboard showing website activity: page views, clicks, sign-ups — all updated in real-time.

Website events (clicks, views) → Kafka topic → Kafka Streams aggregates (counts per minute) → Results written to another Kafka topic → Dashboard reads from that topic.

Kafka Streams Real-time

3
Change Data Capture (CDC)

Problem: Your main database is MySQL. You want every change (INSERT/UPDATE/DELETE) to be available for search in Elasticsearch and analytics in Snowflake.

MySQL → Debezium (Kafka Connect Source) → Kafka topics → Elasticsearch Sink Connector → Elasticsearch. No custom code needed.

Kafka Connect Debezium

4
Log Aggregation

Problem: You have 500 servers generating logs. You need all logs in one place for monitoring and debugging.

Each server sends logs to a "logs" Kafka topic. A central service reads from the topic and stores them in Elasticsearch/Loki/S3. Kafka acts as a buffer so log spikes don't crash your storage system.

High Throughput Buffering

5
Fraud Detection

Problem: You need to detect suspicious transactions within seconds, not hours.

Transaction events → Kafka topic → Kafka Streams / Flink compares each transaction against rules (unusual amount, unknown location, rapid frequency) → Suspicious events written to "fraud-alerts" topic → Alert service blocks the transaction.

Low Latency Stream Processing

6
Microservices Communication

Problem: 20+ microservices need to share data without creating a tangled web of REST calls.

Each service publishes events about its domain (UserCreated, OrderPlaced, PaymentCompleted). Other services subscribe only to the events they care about. Services are fully decoupled — you can add, remove, or update services without affecting others.

Loose Coupling Scalable

Companies Using Kafka

CompanyHow They Use KafkaScale
LinkedInCreated Kafka. Uses it for activity tracking, metrics, data pipelines.7+ trillion messages/day
NetflixReal-time monitoring, event sourcing, data pipeline between microservices.700+ billion messages/day
UberReal-time pricing, trip matching, driver tracking, analytics.Trillions of messages/day
SpotifyEvent delivery, logging, real-time recommendations.Hundreds of billions/day
WalmartInventory management, order processing, real-time supply chain.Billions of messages/day
Key Takeaway Kafka is not just for "big companies." If your system produces events and multiple services need to react to them independently — Kafka is a great fit. Start small with 3 brokers and scale as needed. Managed services like Confluent Cloud, AWS MSK, and Redpanda Cloud make it even easier to get started.
Chapter 33

Setup: Running Kafka (Docker & Cloud)

What is itThere are many ways to run Kafka for development and production: local Docker, Docker Compose (single broker or multi-broker with KRaft), bare-metal/VM installs, Kubernetes via Strimzi or Confluent Operator, and managed cloud services (Confluent Cloud, AWS MSK, Aiven, Azure Event Hubs for Kafka, Redpanda Cloud). The right choice depends on scale, compliance, operational expertise, and budget. For learning and local development, Docker Compose with KRaft mode is the fastest path.
Deployment options
  • Docker / Docker Compose: spin up a 1-broker or 3-broker cluster in seconds with confluentinc/cp-kafka or bitnami/kafka images.
  • Bare metal / VM: unzip Kafka binary, edit server.properties, run bin/kafka-server-start.sh.
  • Kubernetes (Strimzi): open-source operator that manages Kafka clusters via CRDs — most popular OSS production option.
  • Confluent Cloud: fully managed Kafka from the Kafka creators. Serverless tiers available.
  • AWS MSK: managed Kafka on AWS. MSK Serverless abstracts clusters entirely.
  • Aiven / Upstash / Instaclustr: alternative managed Kafka providers with multi-cloud support.
  • Azure Event Hubs (Kafka endpoint): Azure's Kafka-compatible managed service.
How it differs
  • Self-hosted: full control, lowest per-GB cost at scale, but you own upgrades, patching, scaling, monitoring.
  • Managed: zero-ops, pay premium, constrained in tuning options and features.
  • vs managed alternatives like SQS/Kinesis: Kinesis/SQS have lower per-unit cost for small workloads; Kafka becomes cheaper at high throughput.
Local dev recommendationsUse docker compose up with a KRaft single-broker setup for learning. For multi-broker testing, Bitnami's image + Compose file gives you a 3-broker KRaft cluster out of the box. Redpanda is also a great local option — a single binary, Kafka-compatible, boots in seconds. Tools: kcat (formerly kafkacat), kafka-ui, Kowl/Redpanda Console.
Common gotchasAdvertised listeners: the #1 local setup mistake — listeners and advertised.listeners must match what clients can actually reach. Port conflicts: 9092 (Kafka), 2181 (ZK), 9093 (TLS). Volume persistence: forgetting to mount data volumes loses your data on container restart. Host resolution: localhost inside the container is not the same as outside.
Real-world examplesStrimzi powers Kafka on Kubernetes for Red Hat OpenShift customers. Confluent Cloud is used by Walmart, Expedia, Bosch. AWS MSK is used by Goldman Sachs, NCR, Wix. Aiven serves thousands of smaller Kafka users across multiple clouds.

Before writing code, you need a running Kafka cluster. You have two choices: run it locally with Docker, or use a managed cloud service.

Option 1: Cloud Kafka (Zero Setup)

If you don't want to install anything, use a managed Kafka service. They give you a Kafka cluster in the cloud — you just get a connection URL and start coding.

C
Confluent Cloud

Made by the creators of Kafka. Most popular managed Kafka. Free tier available (no credit card needed).

Free Tier Best Kafka Support

Link: confluent.cloud

A
AWS MSK (Managed Streaming for Kafka)

Amazon's managed Kafka. Good if you're already on AWS. Supports MSK Serverless (pay per use).

AWS Ecosystem Serverless Option

Link: aws.amazon.com/msk

U
Upstash Kafka

Serverless Kafka with a REST API. Generous free tier (10,000 messages/day). Perfect for learning and small projects.

Free Tier REST API

Link: upstash.com/kafka

R
Redpanda Cloud

Kafka-compatible but faster. Drop-in replacement — same Kafka client code works. Free tier available.

Free Tier Fastest

Link: redpanda.com/cloud

Option 2: Docker (Local Setup — Recommended for Learning)

This runs a full Kafka cluster on your machine. You need Docker Desktop installed.

Step 1: Create docker-compose.yml

Create a file called docker-compose.yml in your project root:

# docker-compose.yml # This file tells Docker to start Kafka and its dependencies version: "3.8" services: # Kafka broker using KRaft mode (no ZooKeeper needed!) kafka: image: apache/kafka:3.7.0 # Official Apache Kafka Docker image container_name: kafka ports: - "9092:9092" # Port your app connects to environment: # ---------- KRaft Mode (no ZooKeeper) ---------- KAFKA_NODE_ID: 1 # Unique ID for this broker KAFKA_PROCESS_ROLES: broker,controller # This node is BOTH broker + controller KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093 # List of controller voters # ---------- Listeners ---------- # WHY: Kafka needs separate listeners for inside-Docker and outside-Docker traffic KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:9093,EXTERNAL://0.0.0.0:9092 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,EXTERNAL://localhost:9092 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT,EXTERNAL:PLAINTEXT KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT # ---------- Topic Defaults ---------- KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 # 1 because single broker KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 KAFKA_AUTO_CREATE_TOPICS_ENABLE: "false" # We'll create topics manually KAFKA_LOG_RETENTION_HOURS: 168 # Keep data for 7 days # Kafka UI — visual dashboard to see topics, messages, consumers kafka-ui: image: provectuslabs/kafka-ui:latest container_name: kafka-ui ports: - "8080:8080" # Open localhost:8080 to see the UI environment: KAFKA_CLUSTERS_0_NAME: local KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:29092 depends_on: - kafka
Why These Configs? KAFKA_PROCESS_ROLES: broker,controller — In production, you'd separate these. For local dev, one node does both.   KAFKA_LISTENERS vs KAFKA_ADVERTISED_LISTENERS — Listeners define what Kafka listens on. Advertised listeners tell clients how to reach it. Since Docker networking is different from your host machine, we need both.   KAFKA_AUTO_CREATE_TOPICS_ENABLE: false — We disable auto-create so you learn to create topics explicitly. In production, you should always create topics manually with proper configs.

Step 2: Start Kafka

# Start everything in the background docker compose up -d # Check if containers are running docker compose ps # You should see: # kafka running 0.0.0.0:9092->9092/tcp # kafka-ui running 0.0.0.0:8080->8080/tcp # Open Kafka UI in browser: # http://localhost:8080

Step 3: Verify Kafka is Working

# Create a test topic using the Kafka CLI inside the container docker exec -it kafka /opt/kafka/bin/kafka-topics.sh \ --bootstrap-server localhost:9092 \ --create \ --topic test-topic \ --partitions 3 \ --replication-factor 1 # List all topics docker exec -it kafka /opt/kafka/bin/kafka-topics.sh \ --bootstrap-server localhost:9092 \ --list # Output: test-topic
You're Ready! Kafka is running on localhost:9092. Kafka UI is on localhost:8080. Now let's write some code.
Chapter 34

Project Setup (Node.js + TypeScript)

What is itSetting up a Node.js + TypeScript project for Kafka development involves picking a client library, configuring TypeScript, and structuring your producer/consumer code. The main Node.js options are KafkaJS (pure JS, most popular), node-rdkafka (native bindings to librdkafka, faster, more features), and @confluentinc/kafka-javascript (official Confluent client, also librdkafka-based). KafkaJS is the easiest to start with and has excellent TypeScript support.
Key dependencies
  • kafkajs: pure JS Kafka client with native TypeScript support.
  • typescript, ts-node, @types/node: TypeScript toolchain.
  • dotenv: for environment-based configuration.
  • avsc or @kafkajs/confluent-schema-registry: for Avro + Schema Registry.
  • pino or winston: structured logging.
How it differs from other languages
  • vs Java: Java has the official Apache Kafka client — most feature-complete, used as the reference. Kafka Streams and Schema Registry have first-class Java support.
  • vs Go: confluent-kafka-go (librdkafka-based) is the gold standard for Go. Sarama is a pure-Go alternative.
  • vs Python: confluent-kafka-python and aiokafka (async).
  • vs Rust: rdkafka crate provides bindings to librdkafka.
  • KafkaJS limitation: lacks some advanced features (transactions are available but less mature than Java's).
Project structureTypical layout: src/config/kafka.ts (client factory), src/producers/*.ts, src/consumers/*.ts, src/schemas/*.ts (event types), src/index.ts (bootstrap). Use tsconfig.json with strict: true, target: es2020, moduleResolution: node.
Common gotchasForgetting to disconnect: producers and consumers must be explicitly disconnected on shutdown (SIGTERM handlers). Missing await on send: async producer sends need await or promise chains. Running multiple consumers in one process: each consumer should have its own instance. Connection errors: KafkaJS retries by default, but log retry attempts for observability.
Real-world examplesShopify uses KafkaJS in some Node.js microservices. Segment (now Twilio) historically built event pipelines with Node + Kafka. Uber's mobile backend teams use Node for event ingestion endpoints.

Setting up a TypeScript project with the KafkaJS library — the most popular Kafka client for Node.js.

Step 1: Initialize the Project

# Create project folder mkdir kafka-nodejs-demo && cd kafka-nodejs-demo # Initialize Node.js project npm init -y # Install dependencies npm install kafkajs # Kafka client for Node.js npm install -D typescript tsx @types/node # TypeScript + runner # Initialize TypeScript config npx tsc --init
Why KafkaJS? kafkajs is a pure JavaScript implementation — no native C/C++ dependencies (unlike node-rdkafka). This means no build issues, works everywhere, and is easier to install. It supports all Kafka features: producers, consumers, transactions, admin API, and more.

Update tsconfig.json

{ "compilerOptions": { "target": "ES2020", // Modern JS features like async/await "module": "commonjs", // Node.js module system "strict": true, // Enable all strict type checks "esModuleInterop": true, // Allows default imports from CommonJS "outDir": "./dist", // Compiled JS output folder "rootDir": "./src" // Source TypeScript files }, "include": ["src/**/*"] }

Step 2: Create the Kafka Client (Shared Config)

Create src/client.ts — This file creates a reusable Kafka connection that all other files will import.

// src/client.ts // This is the "entry point" — it creates one Kafka instance that // producer, consumer, and admin will all share. import { Kafka, logLevel } from "kafkajs"; // ── Create the Kafka instance ──────────────────────────────────── const kafka = new Kafka({ // clientId: A human-readable name for this application. // WHY: Kafka uses this to identify your app in logs and monitoring. // Shows up in broker logs as "order-service connected". // OTHER OPTIONS: Any string. Use your service name. clientId: "order-service", // brokers: List of Kafka broker addresses to connect to. // WHY: The client connects to ANY of these to discover the full cluster. // Called "bootstrap servers" — you don't need to list ALL brokers, // just enough to find at least one alive. // OTHER OPTIONS: ["broker1:9092", "broker2:9092", "broker3:9092"] brokers: ["localhost:9092"], // logLevel: Controls how much KafkaJS prints to console. // WHY: WARN is a good default — shows errors and warnings, not every request. // OTHER OPTIONS: // logLevel.NOTHING — silence everything // logLevel.ERROR — only errors // logLevel.WARN — errors + warnings // logLevel.INFO — errors + warnings + info // logLevel.DEBUG — everything (very verbose) logLevel: logLevel.WARN, // ── Connection options (optional) ── // connectionTimeout: 3000, // ms to wait for initial connection (default: 1000) // requestTimeout: 30000, // ms to wait for a response (default: 30000) // retry: { // retries: 5, // how many times to retry failed requests // initialRetryTime: 300, // ms before first retry // maxRetryTime: 30000, // max ms between retries // }, // ── SSL (for cloud Kafka like Confluent/AWS MSK) ── // ssl: true, // sasl: { // mechanism: "plain", // or "scram-sha-256", "scram-sha-512" // username: "your-api-key", // password: "your-api-secret", // }, }); export default kafka;

Project Structure (What We'll Build)

File Structure
kafka-nodejs-demo/ ├── docker-compose.yml ← Kafka + UI containers ├── package.json ├── tsconfig.json └── src/ ├── client.ts ← Shared Kafka connection (Step 2 above) ├── admin.ts ← Create/delete topics ├── producer.ts ← Send events to Kafka ├── consumer.ts ← Read events from Kafka └── advanced/ ├── producer-batch.ts ← Batch sending + idempotent producer ├── consumer-dlq.ts ← Consumer with Dead Letter Queue └── transaction.ts ← Exactly-once transactions

Add Run Scripts to package.json

// Add to your package.json "scripts" section: { "scripts": { "admin": "npx tsx src/admin.ts", "producer": "npx tsx src/producer.ts", "consumer": "npx tsx src/consumer.ts", "batch": "npx tsx src/advanced/producer-batch.ts", "dlq": "npx tsx src/advanced/consumer-dlq.ts", "txn": "npx tsx src/advanced/transaction.ts" } }
Chapter 35

Admin: Create & Manage Topics

What is itThe Admin API lets you programmatically manage Kafka metadata — creating topics, deleting topics, altering configs, listing consumer groups, describing cluster state, and resetting offsets. Every Kafka client library provides an AdminClient, and command-line tools like kafka-topics.sh, kafka-configs.sh, kafka-consumer-groups.sh wrap the same RPCs. In production, topic creation should be explicit via Admin API or Infrastructure-as-Code (Terraform, Strimzi CRDs) — never rely on auto-create.
Common operations
  • Create topic: specify name, partitions, replication factor, config overrides (retention, cleanup policy).
  • Delete topic: asynchronously marks the topic for deletion.
  • Alter topic configs: change retention, compression, min.insync.replicas on the fly.
  • List/describe topics: metadata discovery.
  • List/describe consumer groups: see members, assignments, offsets, lag.
  • Reset offsets: rewind a consumer group to earliest, latest, or a specific offset.
How it differs
  • vs RabbitMQ: RabbitMQ has a management HTTP API and CLI (rabbitmqctl) — similar scope.
  • vs SQS/Kinesis: AWS management is via AWS SDK / CLI / CloudFormation / Terraform.
  • vs database DDL: Kafka's Admin API is like a DDL for streams: create/alter/drop topics, grant ACLs, etc.
Best practicesDisable auto.create.topics.enable in production. Use Terraform (with the Kafka provider), Strimzi KafkaTopic CRDs, or Confluent Cloud's topic management to declaratively manage topics. Separate dev/staging/prod cluster access. Never delete topics from scripts — it's usually irreversible and data is lost.
Common gotchasTopic deletion: requires delete.topic.enable=true on all brokers, and only removes data asynchronously. Alter partition count: can only increase, and breaks key-based hashing. Config override precedence: topic-level configs override broker defaults. ACL changes: must be applied via kafka-acls.sh separately.
Real-world examplesConfluent Cloud's Terraform provider is the cleanest way to manage topics as code. Strimzi KafkaTopic CRDs do the same on Kubernetes. Netflix's internal tooling wraps the Admin API for self-service topic creation by teams.

Before producing or consuming, you need to create topics. The Admin API lets you create, delete, and inspect topics programmatically.

// src/admin.ts // This script creates the topics our application needs. // Run once before starting producers/consumers. import kafka from "./client"; async function main() { // ── Create an Admin instance ────────────────────────────────── // WHY: Admin API is separate from producer/consumer because // topic management is an administrative task, not a data task. const admin = kafka.admin(); // ── Connect to the Kafka cluster ───────────────────────────── // WHY: Every Kafka operation requires an active connection first. // This establishes a TCP connection to one of the brokers. console.log("Connecting to Kafka..."); await admin.connect(); console.log("Connected!"); // ── Create topics ──────────────────────────────────────────── const topicsCreated = await admin.createTopics({ // waitForLeaders: Wait until leader is elected for each partition. // WHY: If you start producing immediately after creating a topic, // it might fail because partitions don't have leaders yet. // Setting this to true ensures the topic is fully ready. // OTHER OPTIONS: false (return immediately, don't wait) waitForLeaders: true, // topics: Array of topic configurations to create. topics: [ { topic: "order-events", // numPartitions: How many partitions this topic will have. // WHY: More partitions = more parallelism = more consumers can // read simultaneously. But too many partitions adds overhead. // RULE OF THUMB: Start with partitions = expected consumer count. // OTHER OPTIONS: Any positive integer. Common: 3, 6, 12, 24. numPartitions: 3, // replicationFactor: How many copies of each partition. // WHY: 1 = no redundancy (fine for local dev). // 3 = production standard (survives 2 broker failures). // RULE: Must be <= number of brokers in cluster. // OTHER OPTIONS: 1 (dev), 2 (staging), 3 (production) replicationFactor: 1, // configEntries: Override default topic-level configs. // WHY: Each topic can have different retention, cleanup, etc. configEntries: [ // How long to keep events before auto-deleting. // 604800000 ms = 7 days. Set to "-1" for infinite retention. { name: "retention.ms", value: "604800000" }, // cleanup.policy: "delete" (remove old) or "compact" (keep latest per key) { name: "cleanup.policy", value: "delete" }, // min.insync.replicas: Minimum replicas for acks=all to succeed. // Set to 2 in production with replication-factor=3. { name: "min.insync.replicas", value: "1" }, ], }, // DLQ topic for failed events { topic: "order-events.DLQ", numPartitions: 1, // DLQ doesn't need high parallelism replicationFactor: 1, }, // Topic for processed results { topic: "order-confirmations", numPartitions: 3, replicationFactor: 1, }, ], }); // createTopics returns true if topics were created, false if they already exist console.log(`Topics created: ${topicsCreated}`); // ── List all topics ────────────────────────────────────────── const topics = await admin.listTopics(); console.log("All topics:", topics); // ── Get detailed topic metadata ────────────────────────────── // WHY: Shows partition count, leader broker, replica brokers, ISR const metadata = await admin.fetchTopicMetadata({ topics: ["order-events"], }); console.log("Topic metadata:", JSON.stringify(metadata, null, 2)); // ── Disconnect (always clean up!) ──────────────────────────── await admin.disconnect(); console.log("Disconnected."); } main().catch(console.error);

Run It

npm run admin # Output: # Connecting to Kafka... # Connected! # Topics created: true # All topics: [ "order-events", "order-events.DLQ", "order-confirmations" ] # Topic metadata: { topics: [{ name: "order-events", partitions: [...] }] } # Disconnected.
Other Admin Operations You Can Do admin.deleteTopics({ topics: ["old-topic"] }) — Delete topics.   admin.createPartitions({ topicPartitions: [{ topic: "x", count: 6 }] }) — Add partitions (can't reduce).   admin.describeGroups(["group-id"]) — See consumer group details.   admin.listGroups() — List all consumer groups.   admin.fetchTopicOffsets("topic") — See current offset for each partition.
Chapter 36

Build a Producer

What is itBuilding a Kafka producer means writing an application that publishes events to Kafka topics. The basics are simple: create a producer instance, connect, call send() with (topic, key, value) tuples, handle callbacks or promises for success/failure, and disconnect gracefully on shutdown. Real production producers add serialization (JSON/Avro/Protobuf), error handling, retry logic, metrics, tracing headers, and graceful shutdown. This section walks through building a producer using KafkaJS with TypeScript.
Key building blocks
  • Kafka client factory: a singleton that holds the Kafka client and produces producer instances.
  • Serializer: JSON.stringify for quick start; Avro/Protobuf for production.
  • Keying strategy: pick keys to control partitioning and preserve per-entity ordering.
  • Error handling: catch send failures, log with context, retry or DLQ.
  • Graceful shutdown: SIGTERM handler that calls producer.disconnect().
  • Metrics hooks: instrument with Prometheus or OpenTelemetry.
How it differs
  • vs REST client: no request/response — send() returns when the broker acks, but the app doesn't wait for a response from consumers.
  • vs RabbitMQ publisher: RabbitMQ publishers typically use publisher confirms; Kafka uses acks levels.
  • vs SQS SendMessage: SQS is an HTTP call per message (or batches of 10). Kafka batches automatically.
Production essentialsAlways set acks: 'all', enable idempotence, set a reasonable retry count (5+), use structured logs with correlation IDs, implement graceful shutdown, wrap send errors for upstream handling, and never create multiple producer instances per process — share one.
Common gotchasNot awaiting send: losing records on shutdown. Creating a producer per request: massive overhead — use a singleton. Forgetting disconnect: connection leaks and lost buffered records. Blocking on send in hot paths: use fire-and-forget (send().catch(log)) where appropriate.
Real-world examplesMost Node.js backends that talk to Kafka follow a similar pattern: a singleton KafkaJS client, producer initialized at startup, sending events from request handlers and background jobs. Shopify, Medium, HubSpot all have open or documented Kafka producer patterns.

The producer sends events to Kafka topics. We'll build it step-by-step, explaining every line and every option.

// src/producer.ts // A complete Kafka producer that sends order events. // Run: npm run producer import kafka from "./client"; import { Partitioners, CompressionTypes } from "kafkajs"; // ── Define the shape of our events (TypeScript interface) ──── // WHY: This ensures we always send events with the correct structure. // TypeScript will catch mistakes at compile time. interface OrderEvent { orderId: string; customerId: string; product: string; quantity: number; price: number; status: "PLACED" | "CONFIRMED" | "SHIPPED" | "DELIVERED"; timestamp: string; } async function main() { // ── Create a Producer instance ──────────────────────────────── const producer = kafka.producer({ // idempotent: Prevents duplicate events on retry. // WHY: If the network drops AFTER Kafka writes but BEFORE you get the ACK, // the producer retries. Without idempotence, Kafka stores the event twice. // With idempotence, Kafka detects the duplicate and ignores it. // WHAT IT CHANGES: Automatically sets acks=all, retries=MAX, maxInFlightRequests=5 // OTHER OPTIONS: false (default — duplicates possible on retry) idempotent: true, // maxInFlightRequests: How many unacknowledged requests can be "in flight". // WHY: Higher = faster (more parallel requests), but risks out-of-order delivery. // With idempotent=true, max is 5 (Kafka tracks sequence numbers to reorder). // OTHER OPTIONS: 1 (strict ordering, slower), 5 (max for idempotent) maxInFlightRequests: 5, // allowAutoTopicCreation: Should the producer auto-create topics if they don't exist? // WHY: Set to false in production — topics should be created deliberately with // proper partition count and replication. Auto-creation uses bad defaults. // OTHER OPTIONS: true (Kafka auto-creates with default configs) allowAutoTopicCreation: false, // createPartitioner: Strategy for choosing which partition an event goes to. // WHY: DefaultPartitioner uses the event key to determine partition. // Same key → same partition → ordering guarantee for that key. // OTHER OPTIONS: // Partitioners.LegacyPartitioner — older algorithm (KafkaJS v1 behavior) // Partitioners.JavaCompatiblePartitioner — matches Java client's partitioning // Custom function — (topic, key, message) => partitionNumber createPartitioner: Partitioners.DefaultPartitioner, }); // ── Connect ────────────────────────────────────────────────── console.log("Connecting producer..."); await producer.connect(); console.log("Producer connected!"); // ── Send a Single Event ────────────────────────────────────── const order: OrderEvent = { orderId: "ORD-001", customerId: "CUST-42", product: "Mechanical Keyboard", quantity: 1, price: 2499, status: "PLACED", timestamp: new Date().toISOString(), }; const result = await producer.send({ // topic: Which topic to send to. topic: "order-events", // acks: How many brokers must confirm the write. // WHY: -1 (all) = safest. Leader + all ISR replicas must confirm. // OTHER OPTIONS: // 0 — fire-and-forget (fastest, data can be lost) // 1 — only leader confirms (balanced) // -1 — all ISR confirm (safest, required for idempotent producer) acks: -1, // timeout: How long the broker waits for ISR replicas to acknowledge. // WHY: If ISR replicas are slow, the write times out instead of hanging forever. // OTHER OPTIONS: Any ms value. Default is 30000 (30 seconds). timeout: 30000, // compression: Compress the batch before sending over the network. // WHY: Reduces network bandwidth and disk usage. Especially good for large events. // OTHER OPTIONS: // CompressionTypes.None — no compression (default) // CompressionTypes.GZIP — best compression ratio, slower // CompressionTypes.Snappy — fast compression, moderate ratio // CompressionTypes.LZ4 — fastest compression // CompressionTypes.ZSTD — best balance of speed and ratio compression: CompressionTypes.GZIP, // messages: Array of events to send (even for one event, it's an array). messages: [ { // key: Determines which partition this event goes to. // WHY: All events with key "ORD-001" go to the SAME partition. // This guarantees ordering for this order's lifecycle events. // (PLACED → CONFIRMED → SHIPPED → DELIVERED stay in order) // OTHER OPTIONS: // null — round-robin across partitions (no ordering guarantee) // "CUST-42" — partition by customer (all customer events together) key: order.orderId, // value: The actual event data. Must be a string or Buffer. // WHY: Kafka stores raw bytes. We JSON.stringify our TypeScript object. // OTHER OPTIONS: Buffer.from(avroEncodedBytes) for Avro/Protobuf value: JSON.stringify(order), // headers: Optional metadata attached to the event. // WHY: Headers let you add metadata WITHOUT changing the event body. // Useful for routing, tracing, versioning, filtering. // OTHER OPTIONS: Any key-value pairs. Values must be strings or Buffers. headers: { "event-type": "ORDER_PLACED", // what kind of event "source": "order-service", // which service sent it "correlation-id": "req-abc-123", // for distributed tracing "schema-version": "1", // event schema version }, // timestamp: Event time in ms since epoch (optional). // WHY: Kafka records when the event HAPPENED (not when it was stored). // Useful for time-based queries and windowed aggregations. // OTHER OPTIONS: omit (Kafka uses broker's clock), or any epoch ms value timestamp: Date.now().toString(), // partition: Force a specific partition (optional). // WHY: Overrides the key-based partitioning. Rarely used. // OTHER OPTIONS: 0, 1, 2, ... (any valid partition index) // partition: 0, }, ], }); // ── Result ──────────────────────────────────────────────────── // result is an array (one entry per topic-partition) with: // topicName, partition, errorCode, baseOffset, logAppendTime, logStartOffset console.log("Event sent!", result); // Output: [{ topicName: "order-events", partition: 1, errorCode: 0, baseOffset: "0" }] // ── Send Multiple Events (Simulating Order Lifecycle) ──────── const statuses: OrderEvent["status"][] = ["CONFIRMED", "SHIPPED", "DELIVERED"]; for (const status of statuses) { await producer.send({ topic: "order-events", acks: -1, messages: [ { key: "ORD-001", // Same key = same partition = ordered! value: JSON.stringify({ ...order, status, timestamp: new Date().toISOString() }), headers: { "event-type": `ORDER_${status}` }, }, ], }); console.log(`Sent: ORDER_${status}`); } // ── Disconnect ──────────────────────────────────────────────── await producer.disconnect(); console.log("Producer disconnected."); } main().catch(console.error);

Run It

npm run producer # Output: # Connecting producer... # Producer connected! # Event sent! [{ topicName: "order-events", partition: 2, baseOffset: "0" }] # Sent: ORDER_CONFIRMED # Sent: ORDER_SHIPPED # Sent: ORDER_DELIVERED # Producer disconnected.
Check Kafka UI Open http://localhost:8080 → Click "order-events" topic → Click "Messages" tab. You'll see all 4 events. Notice they're all in the same partition (because same key "ORD-001").
Chapter 37

Build a Consumer

What is itBuilding a Kafka consumer means writing an app that subscribes to topics and processes records. You create a consumer instance, join a consumer group, subscribe to topics, call run()/eachMessage() (or eachBatch() for batch processing), process each record, and let Kafka handle offset commits — or commit manually for tighter control. A production-grade consumer adds deserialization, idempotency checks, error handling with retry/DLQ, graceful shutdown, and metrics/tracing.
Key building blocks
  • Consumer group configuration: pick a stable, meaningful groupId.
  • Subscription: subscribe to one or more topics.
  • Message handler: your business logic, ideally idempotent.
  • Offset strategy: auto-commit for simple pipelines, manual for stronger guarantees.
  • Error handling: retry transient errors, DLQ permanent ones, never let exceptions kill the loop.
  • Graceful shutdown: stop the consumer on SIGTERM, commit offsets, then disconnect.
How it differs
  • vs RabbitMQ consumer: RabbitMQ pushes messages; Kafka pulls. Kafka has native consumer groups; RabbitMQ uses competing consumers.
  • vs SQS consumer: SQS has long polling + visibility timeout per message. Kafka polls batches.
  • vs Kinesis consumer: Kinesis KCL tracks shard leases in DynamoDB — heavier setup than Kafka consumer groups.
Production essentialsAlways make processing idempotent (handle duplicates gracefully). Use structured error handling with retry classification. Instrument lag per partition. Never throw from eachMessage without catching — it can halt the consumer. Use autoCommit: false with explicit commits for critical pipelines.
Common gotchasPoison pill halting the group: one bad record blocks all progress — use DLQ. Long processing triggers rebalance: tune max.poll.interval.ms or use background worker pattern. Double processing on rebalance: commit before losing partitions via rebalance listeners. Consumer lag growing silently: always monitor lag per partition.
Real-world examplesNetflix's Mantis and Keystone routers are massive Kafka consumer deployments. Uber's consumers process driver events at millions of messages/sec. Shopify's order fulfillment uses Kafka consumers for downstream inventory and shipping updates.

The consumer reads events from Kafka. It joins a consumer group, gets partitions assigned, and processes events in a loop.

// src/consumer.ts // A complete Kafka consumer that reads order events. // Run: npm run consumer import kafka from "./client"; // ── Define the event type (same as producer) ───────────────── interface OrderEvent { orderId: string; customerId: string; product: string; quantity: number; price: number; status: string; timestamp: string; } async function main() { // ── Create a Consumer instance ──────────────────────────────── const consumer = kafka.consumer({ // groupId: The consumer group this consumer belongs to. // WHY: All consumers with the same groupId share the work. // Kafka assigns partitions among group members so each partition // is read by exactly one consumer in the group. // OTHER OPTIONS: Any string. Use your service name. // "notification-service", "analytics-service", etc. // Different groupId = independent reading (each group gets all events) groupId: "notification-service", // sessionTimeout: How long before Kafka considers this consumer dead. // WHY: If the consumer doesn't send a heartbeat within this time, // Kafka triggers a rebalance and gives its partitions to others. // OTHER OPTIONS: 6000-300000 ms. Default: 30000 (30s). // Lower = faster failure detection but more false alarms. // Higher = fewer false alarms but slower failure detection. sessionTimeout: 30000, // heartbeatInterval: How often the consumer sends "I'm alive" signals. // WHY: Must be less than sessionTimeout (typically 1/3 of it). // Kafka uses heartbeats to detect dead consumers. // OTHER OPTIONS: Any ms value. Default: 3000 (3s). heartbeatInterval: 3000, // maxWaitTimeInMs: How long the broker waits before returning empty if no new events. // WHY: Longer = fewer network requests but higher latency for first event. // OTHER OPTIONS: Any ms value. Default: 5000 (5s). Set to 100 for low-latency. maxWaitTimeInMs: 5000, // maxBytesPerPartition: Max bytes to fetch per partition in one request. // WHY: Controls memory usage. Larger = fewer requests but more memory. // OTHER OPTIONS: Default: 1048576 (1MB). Increase for high-throughput. maxBytesPerPartition: 1048576, // retry: Retry configuration for consumer operations. retry: { retries: 10, // how many times to retry (default: 5) }, }); // ── Connect ────────────────────────────────────────────────── console.log("Connecting consumer..."); await consumer.connect(); console.log("Consumer connected!"); // ── Subscribe to topics ────────────────────────────────────── await consumer.subscribe({ // topic: Which topic to read from. // OTHER OPTIONS: // topics: ["order-events", "payment-events"] — subscribe to multiple // topic: /order-.*/ — regex pattern (all topics matching the pattern) topic: "order-events", // fromBeginning: Where to start reading if this consumer group has no committed offset. // WHY: true = read ALL events from the very start (offset 0). // false = only read NEW events arriving after this consumer starts. // NOTE: This ONLY matters the FIRST time this groupId reads this topic. // After that, Kafka uses the committed offset. // OTHER OPTIONS: false (default — skip old events) fromBeginning: true, }); // ── Process events ─────────────────────────────────────────── // consumer.run() starts the poll loop: fetch → process → commit → repeat await consumer.run({ // autoCommit: Should offsets be committed automatically? // WHY: true = Kafka auto-commits every autoCommitInterval ms. // Simpler but risky — events may be "committed" before fully processed. // false = YOU decide when to commit (safer, more control). // OTHER OPTIONS: false (for manual commit control) autoCommit: true, // autoCommitInterval: How often auto-commit runs (only if autoCommit=true). // OTHER OPTIONS: Default: 5000 (5s). Lower = less data loss risk on crash. autoCommitInterval: 5000, // eachMessage: Called ONCE for EACH event. This is where your business logic goes. // WHY: eachMessage gives you fine-grained control — process one event at a time. // OTHER OPTIONS: // eachBatch: ({ batch, resolveOffset, heartbeat }) => {} // — called once per BATCH of events (more efficient for high throughput) // — you manually call resolveOffset() and heartbeat() eachMessage: async ({ topic, partition, message, heartbeat, pause }) => { // ── Parameters explained ── // topic: which topic this event came from // partition: which partition (number) // message: the actual Kafka message object // heartbeat: call this during long processing to prevent session timeout // pause: call this to temporarily stop reading from this topic-partition // ── Extract event data ──────────────────────────────────── const offset = message.offset; const key = message.key?.toString(); // Buffer → string const value: OrderEvent = JSON.parse(message.value!.toString()); // Buffer → JSON const eventType = message.headers?.["event-type"]?.toString(); const timestamp = message.timestamp; // event creation time // ── Process the event ───────────────────────────────────── console.log(` ┌───────────────────────────────────────────────── │ Topic: ${topic} | Partition: ${partition} | Offset: ${offset} │ Key: ${key} | Event: ${eventType} │ Order: ${value.orderId} — ${value.product} │ Status: ${value.status} | Price: ₹${value.price} │ Time: ${value.timestamp} └─────────────────────────────────────────────────`); // ── If processing takes long, send heartbeat ───────────── // WHY: If your processing takes > heartbeatInterval (3s), // Kafka might think the consumer is dead and trigger a rebalance. // Calling heartbeat() tells Kafka "I'm still alive, just busy." await heartbeat(); // ── Example: Pause reading if an external service is down ─ // const resume = pause(); // setTimeout(() => resume(), 30000); // resume after 30 seconds }, }); // ── Graceful Shutdown ──────────────────────────────────────── // WHY: When your app is stopped (Ctrl+C), you should disconnect cleanly // so Kafka knows immediately (instead of waiting for session timeout). // This triggers a rebalance right away instead of after 30 seconds. const shutdown = async () => { console.log("\nShutting down consumer..."); await consumer.disconnect(); console.log("Consumer disconnected."); process.exit(0); }; process.on("SIGINT", shutdown); // Ctrl+C process.on("SIGTERM", shutdown); // Docker stop / kill } main().catch(console.error);

Run It

# Terminal 1: Start the consumer (it keeps running, waiting for events) npm run consumer # Terminal 2: Send events from the producer npm run producer # Terminal 1 output: # ┌───────────────────────────────────────────────── # │ Topic: order-events | Partition: 2 | Offset: 0 # │ Key: ORD-001 | Event: ORDER_PLACED # │ Order: ORD-001 — Mechanical Keyboard # │ Status: PLACED | Price: ₹2499 # └───────────────────────────────────────────────── # ... (CONFIRMED, SHIPPED, DELIVERED follow in order)
Manual Commit Example For safer processing, disable autoCommit and commit manually:
// Manual commit — commit AFTER processing succeeds await consumer.run({ autoCommit: false, // disable auto-commit eachMessage: async ({ topic, partition, message }) => { // Process the event... const value = JSON.parse(message.value!.toString()); console.log(`Processing: ${value.orderId}`); // Only AFTER successful processing, commit this offset. // WHY: If the consumer crashes before this line, Kafka will // re-deliver this event (at-least-once guarantee). await consumer.commitOffsets([ { topic, partition, offset: (parseInt(message.offset) + 1).toString(), // WHY offset + 1? The committed offset means // "I've processed everything UP TO this offset." // So if you processed offset 5, commit 6 // (meaning "next time, start from 6"). }, ]); }, });
Chapter 38

Advanced Patterns

What is itBeyond basic produce/consume, Kafka unlocks a whole catalog of advanced architectural patterns that solve real distributed-system problems: reliable database-to-Kafka writes (Outbox), exactly-once processing (transactional produce+commit), dead-letter handling, fan-out to multiple consumers, Saga orchestration for distributed transactions, and Command Query Responsibility Segregation (CQRS) with event-sourced state. These patterns are what turn Kafka from "just a message broker" into an event backbone for entire systems.
Key patterns
  • Outbox pattern: write to DB and outbox table in one transaction; CDC streams the outbox to Kafka. Guarantees DB and Kafka stay in sync.
  • Transactional produce-consume-commit: Kafka Streams-style exactly-once processing.
  • Saga: long-running distributed transactions using a series of events + compensating events.
  • Event sourcing: state is derived from a log of all events — rebuild by replay.
  • CQRS: separate write model (commands) from read models (materialized views fed by events).
  • Dead Letter Queue: isolate poison pills.
  • Retry topic: delayed retries with backoff.
  • Fanout: one topic feeding many consumer groups independently.
How it differs
  • vs REST-based distributed transactions: REST requires 2PC or Saga across services. Kafka + Outbox is simpler and more reliable.
  • vs RabbitMQ: many of these patterns are harder in RabbitMQ because of the lack of replay and the per-message ack model.
  • vs Temporal/Zeebe: Temporal/Zeebe are workflow engines with durable state machines. Kafka Saga is more lightweight but requires more manual coordination.
Why learn theseThese patterns are the difference between "we use Kafka" and "we use Kafka correctly at scale". They solve the hardest problems in distributed systems: consistency, failure recovery, ordering, and idempotency.
Common gotchasOutbox polling lag: if using polling instead of CDC, events are delayed. Saga compensation complexity: compensations are hard to write correctly for all failure modes. Event sourcing schema evolution: old events must remain deserializable forever — be careful with breaking changes. CQRS read model lag: reads are eventually consistent with writes.
Real-world examplesShopify uses the Outbox pattern with Debezium for cross-service eventing. Netflix uses CQRS with Kafka for personalization read models. Eventuate and Axon Framework are frameworks that implement these patterns on Kafka.

Production-ready patterns: batch sending, Dead Letter Queue, and transactions.

1. Batch Producer (High Throughput)

src/advanced/producer-batch.ts — Send many events at once for better performance.

// src/advanced/producer-batch.ts // Sends 100 events in batches — much faster than one-by-one. import kafka from "../client"; import { CompressionTypes } from "kafkajs"; async function main() { const producer = kafka.producer({ idempotent: true, maxInFlightRequests: 5 }); await producer.connect(); // ── Build a batch of 100 events ────────────────────────────── const messages = Array.from({ length: 100 }, (_, i) => ({ key: `ORD-${String(i).padStart(4, "0")}`, value: JSON.stringify({ orderId: `ORD-${String(i).padStart(4, "0")}`, customerId: `CUST-${i % 10}`, // 10 unique customers product: "Laptop", quantity: 1, price: 49999, status: "PLACED", timestamp: new Date().toISOString(), }), headers: { "event-type": "ORDER_PLACED" }, })); // ── Send all 100 events in ONE network request ────────────── // WHY: Batching reduces network round-trips dramatically. // 100 individual sends = 100 network calls. // 1 batched send = 1 network call (100x faster!). console.time("batch-send"); await producer.send({ topic: "order-events", acks: -1, compression: CompressionTypes.GZIP, // Compress the entire batch messages, // all 100 events go in one request }); console.timeEnd("batch-send"); // Typically < 50ms for 100 events! // ── sendBatch: Send to MULTIPLE topics in one call ─────────── // WHY: If you need to write to different topics atomically, // sendBatch groups them into one network operation. await producer.sendBatch({ topicMessages: [ { topic: "order-events", messages: [{ key: "ORD-999", value: '{"orderId":"ORD-999","status":"PLACED"}' }], }, { topic: "order-confirmations", messages: [{ key: "ORD-999", value: '{"orderId":"ORD-999","confirmed":true}' }], }, ], }); console.log("Batch + multi-topic send complete."); await producer.disconnect(); } main().catch(console.error);

2. Consumer with Dead Letter Queue

src/advanced/consumer-dlq.ts — Retry failed events and move permanently failed ones to a DLQ topic.

// src/advanced/consumer-dlq.ts // Consumer with retry logic and Dead Letter Queue. import kafka from "../client"; import { KafkaMessage } from "kafkajs"; const MAX_RETRIES = 3; // ── DLQ Producer (sends failed events to the DLQ topic) ────── const dlqProducer = kafka.producer(); // ── Simulated processing that sometimes fails ──────────────── async function processEvent(value: any): Promise<void> { // Simulate: 30% chance of failure if (Math.random() < 0.3) { throw new Error(`Payment gateway timeout for ${value.orderId}`); } console.log(` ✓ Processed: ${value.orderId} — ${value.status}`); } // ── Send failed event to DLQ topic ─────────────────────────── async function sendToDLQ( message: KafkaMessage, topic: string, partition: number, error: Error, retryCount: number ) { await dlqProducer.send({ topic: "order-events.DLQ", messages: [ { // Keep the same key so DLQ events can be correlated key: message.key, // Keep the original event data value: message.value, // Add metadata about the failure headers: { ...message.headers, // preserve original headers "dlq-original-topic": topic, "dlq-original-partition": partition.toString(), "dlq-original-offset": message.offset, "dlq-error-message": error.message, "dlq-retry-count": retryCount.toString(), "dlq-timestamp": Date.now().toString(), }, }, ], }); console.log(` ✗ Sent to DLQ: ${message.key?.toString()} — ${error.message}`); } async function main() { const consumer = kafka.consumer({ groupId: "notification-service-dlq" }); await dlqProducer.connect(); await consumer.connect(); await consumer.subscribe({ topic: "order-events", fromBeginning: true }); await consumer.run({ autoCommit: false, // Manual commit for maximum safety eachMessage: async ({ topic, partition, message }) => { const value = JSON.parse(message.value!.toString()); let retries = 0; // ── Retry loop ─────────────────────────────────────────── while (retries < MAX_RETRIES) { try { await processEvent(value); break; // Success! Exit retry loop } catch (err) { retries++; console.log(` ↻ Retry ${retries}/${MAX_RETRIES}: ${(err as Error).message}`); if (retries === MAX_RETRIES) { // All retries exhausted → send to DLQ await sendToDLQ(message, topic, partition, err as Error, retries); } else { // Wait before retrying (exponential backoff) await new Promise((r) => setTimeout(r, 1000 * retries)); } } } // ── Commit offset (whether processed successfully or sent to DLQ) ─ // WHY: We commit after DLQ too, so the consumer moves forward. // The failed event is now in the DLQ — no need to block here. await consumer.commitOffsets([{ topic, partition, offset: (parseInt(message.offset) + 1).toString(), }]); }, }); const shutdown = async () => { await consumer.disconnect(); await dlqProducer.disconnect(); process.exit(0); }; process.on("SIGINT", shutdown); process.on("SIGTERM", shutdown); } main().catch(console.error);

3. Transactions (Exactly-Once Processing)

src/advanced/transaction.ts — Read from one topic, process, write to another topic, and commit offset — all as ONE atomic operation.

// src/advanced/transaction.ts // Exactly-once: Read → Process → Write → Commit as ONE transaction. // If any step fails, everything rolls back. import kafka from "../client"; async function main() { // ── Transactional Producer ──────────────────────────────────── // WHY: A transactional producer can group multiple writes + offset commits // into one atomic operation. Either ALL succeed or NONE do. const producer = kafka.producer({ idempotent: true, // required for transactions maxInFlightRequests: 1, // required for transactions transactionalId: "order-processor-txn-1", // unique ID for this transactional producer // WHY transactionalId: Kafka uses this to track the transaction state. // If the producer crashes and restarts with the same ID, // Kafka can abort any incomplete transactions from the old instance. // RULE: Each producer instance must have a UNIQUE transactionalId. }); const consumer = kafka.consumer({ groupId: "order-processor", // WHY: With transactions, the consumer reads events and the transactional // producer commits the consumer's offsets as part of the transaction. }); await producer.connect(); await consumer.connect(); await consumer.subscribe({ topic: "order-events", fromBeginning: true }); await consumer.run({ autoCommit: false, // MUST be false — transactions handle commits eachMessage: async ({ topic, partition, message }) => { const order = JSON.parse(message.value!.toString()); // ── Start a transaction ────────────────────────────────── const txn = await producer.transaction(); // WHY: Everything between transaction() and commit()/abort() // is part of the same atomic operation. try { // Step 1: Process the order (your business logic) const confirmation = { orderId: order.orderId, status: "CONFIRMED", confirmedAt: new Date().toISOString(), estimatedDelivery: "3-5 business days", }; // Step 2: Write the result to another topic (inside the transaction) await txn.send({ topic: "order-confirmations", messages: [ { key: order.orderId, value: JSON.stringify(confirmation), }, ], }); // Step 3: Commit the consumer offset (inside the transaction) // WHY: This ties the offset commit to the write above. // If the write to "order-confirmations" fails, the offset is NOT committed. // So the event will be re-read and re-processed. await txn.sendOffsets({ consumerGroupId: "order-processor", topics: [ { topic, partitions: [ { partition, offset: (parseInt(message.offset) + 1).toString(), }, ], }, ], }); // Step 4: COMMIT the transaction // WHY: This makes all the writes and offset commit visible atomically. // Consumers with isolation.level=read_committed will now see them. await txn.commit(); console.log(`✓ Transaction committed: ${order.orderId}`); } catch (err) { // Something went wrong → ABORT the entire transaction // WHY: This undoes everything — writes and offset commits. // The event will be re-delivered on next poll. await txn.abort(); console.error(`✗ Transaction aborted: ${order.orderId} — ${(err as Error).message}`); } }, }); const shutdown = async () => { await consumer.disconnect(); await producer.disconnect(); process.exit(0); }; process.on("SIGINT", shutdown); process.on("SIGTERM", shutdown); } main().catch(console.error);
Complete Implementation Summary You've now built a full Kafka application with Node.js + TypeScript covering:   1. Admin API — create topics with custom configs   2. Producer — send events with keys, headers, compression   3. Consumer — read events, auto-commit and manual commit   4. Batch Producer — high-throughput multi-event sending   5. DLQ Consumer — retry + dead letter queue for failed events   6. Transactions — exactly-once read-process-write pattern   Every line is explained. Every option has alternatives documented. You're production-ready!

Quick Reference: Run Commands

CommandWhat It DoesRun Order
docker compose up -dStart Kafka + UI containersFirst (once)
npm run adminCreate topics (order-events, DLQ, confirmations)Second (once)
npm run consumerStart consumer (keeps running)Third (Terminal 1)
npm run producerSend order eventsFourth (Terminal 2)
npm run batchSend 100 events at onceAnytime
npm run dlqConsumer with retry + DLQAnytime
npm run txnExactly-once transaction consumerAnytime
Chapter 39

Multi-Broker Docker Setup (3 Brokers)

What is itA multi-broker Docker setup runs a small Kafka cluster (typically 3 brokers) on your laptop via Docker Compose so you can develop and test against realistic cluster semantics — replication, leader election, failure simulation, rebalancing — before shipping to production. Modern setups use KRaft mode (no ZooKeeper needed) and tools like Bitnami's Kafka image or Confluent's cp-kafka. Three brokers is the minimum for replication.factor=3 + min.insync.replicas=2, which gives you real fault-tolerance testing.
Key components
  • 3 broker services in Docker Compose, each with unique broker.id and advertised listeners.
  • KRaft controller quorum: in combined mode, all 3 nodes are controller+broker; for production, separate them.
  • Internal listeners: for broker-to-broker communication (INTERNAL://kafka-1:29092).
  • External listeners: for host access from your laptop (EXTERNAL://localhost:9092).
  • Named volumes: for persistent data across restarts.
  • Optional: Kafka UI (Redpanda Console, Kowl, kafka-ui, AKHQ) for visualization.
How it differs
  • vs single-broker: single-broker setups can't test replication, ISR behavior, or broker failover.
  • vs Kubernetes (Strimzi): Strimzi provides a full operator-driven Kafka on K8s — production-grade, but heavier to run locally.
  • vs Confluent Cloud / MSK: managed options skip local Docker entirely but cost money and need internet.
Why run locallyLocal multi-broker clusters let you test failure scenarios (kill a broker, watch partitions rebalance), practice admin operations (describe cluster, reassign partitions), and validate client retry behavior. All of this is free and offline.
Common gotchasAdvertised listener misconfiguration: clients get broker metadata pointing to unreachable hostnames — the biggest setup pain point. Fix: use PLAINTEXT://localhost:9092, PLAINTEXT://localhost:9093, etc., for host-visible ports. Host memory: 3 JVM brokers + app can be heavy; tune heap sizes. Clock skew between containers is rare but matters for time-based retention.
Real-world examplesMost Kafka tutorials from Confluent, Bitnami, and Redpanda ship with Docker Compose files for multi-broker setups. Testcontainers provides Kafka integration for integration testing in CI.

A single broker is fine for learning, but production needs at least 3 brokers for fault tolerance. If one broker dies, the other two keep serving data with zero downtime.

In Simple Words

One broker = one copy of your data. If it dies, everything is gone. Three brokers = three copies of your data on three different machines. Even if one machine catches fire, the other two have a full copy and keep working. That's why production always uses 3+ brokers.

What We're Building
┌──────────────────────────────────────────────────────────────┐ │ DOCKER NETWORK │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Broker 1 │ │ Broker 2 │ │ Broker 3 │ │ │ │ :9092 │ │ :9094 │ │ :9096 │ │ │ │ │ │ │ │ │ │ │ │ P0-Leader │◄──▶│ P0-Follow │ │ P0-Follow │ │ │ │ P1-Follow │ │ P1-Leader │◄──▶│ P1-Follow │ │ │ │ P2-Follow │ │ P2-Follow │ │ P2-Leader │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ ┌──────────┐ Controller = Broker 1 (elected via KRaft) │ │ │ Kafka UI │ │ │ │ :8080 │ Replication Factor = 3 (all brokers have │ │ └──────────┘ a copy of every partition) │ └──────────────────────────────────────────────────────────────┘ Your App connects to: localhost:9092, localhost:9094, localhost:9096

docker-compose.yml (3 Brokers + KRaft + Kafka UI)

# docker-compose.yml # Production-like setup: 3 Kafka brokers using KRaft (no ZooKeeper) # Each broker is both a broker AND a controller (for simplicity) # In real production, you'd separate controller nodes from broker nodes version: "3.8" services: # ════════════════════════════════════════════════════════════════ # BROKER 1 — Port 9092 # ════════════════════════════════════════════════════════════════ kafka-1: image: apache/kafka:3.7.0 container_name: kafka-1 ports: - "9092:9092" # Your app connects here for Broker 1 environment: # ── Identity ── KAFKA_NODE_ID: 1 # Unique ID. Must be different for each broker. # WHY: Kafka uses this to track which broker owns which partitions. # ── Roles ── KAFKA_PROCESS_ROLES: broker,controller # WHY: This node acts as both a data broker AND a controller voter. # In large production clusters (50+ brokers), you'd set dedicated # controller-only nodes: KAFKA_PROCESS_ROLES: controller # and separate broker-only nodes: KAFKA_PROCESS_ROLES: broker # ── Controller Quorum ── KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093 # WHY: Lists ALL controllers that participate in leader election. # Format: nodeId@host:controllerPort # With 3 voters: quorum = 2 (majority). Can tolerate 1 controller failure. # ── Listeners ── # WHY we need 3 different listeners: # INTERNAL = broker-to-broker traffic (replication, metadata) # EXTERNAL = your app connects from outside Docker # CONTROLLER = KRaft controller communication KAFKA_LISTENERS: INTERNAL://0.0.0.0:29092,EXTERNAL://0.0.0.0:9092,CONTROLLER://0.0.0.0:29093 KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka-1:29092,EXTERNAL://localhost:9092 # WHY advertised != listeners? # "listeners" = what Kafka binds to inside the container (0.0.0.0 = all interfaces) # "advertised" = what Kafka tells clients to connect to # INTERNAL://kafka-1:29092 → other brokers inside Docker use this # EXTERNAL://localhost:9092 → your app outside Docker uses this KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT,CONTROLLER:PLAINTEXT # WHY: Maps each listener name to a security protocol. # PLAINTEXT = no encryption. In production, use SSL or SASL_SSL. KAFKA_INTER_BROKER_LISTENER_NAME: INTERNAL # WHY: Tells Kafka which listener to use for broker-to-broker communication. # Replication traffic uses this. Must be one of the defined listeners. KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER # ── Topic Defaults (production-like) ── KAFKA_DEFAULT_REPLICATION_FACTOR: 3 # WHY: Every new topic will have 3 copies by default. # With 3 brokers, RF=3 means every broker has a copy of every partition. KAFKA_MIN_INSYNC_REPLICAS: 2 # WHY: For acks=all, at least 2 of 3 replicas must confirm the write. # This means you can lose 1 broker and still accept writes. # If 2 brokers die, writes are rejected (data safety over availability). KAFKA_NUM_PARTITIONS: 3 # WHY: Default partition count for auto-created topics. # 3 partitions = 3 consumers can read in parallel. KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2 # WHY: Internal topics (__consumer_offsets, __transaction_state) also need # replication. If these lose data, consumer offsets and transactions break. KAFKA_AUTO_CREATE_TOPICS_ENABLE: "false" KAFKA_LOG_RETENTION_HOURS: 168 # ── KRaft Cluster ID (must be SAME for all brokers) ── CLUSTER_ID: "MkU3OEVBNTcwNTJENDM2Qg" # WHY: All brokers in the same cluster must share this ID. # Generate your own with: kafka-storage.sh random-uuid # ════════════════════════════════════════════════════════════════ # BROKER 2 — Port 9094 # ════════════════════════════════════════════════════════════════ kafka-2: image: apache/kafka:3.7.0 container_name: kafka-2 ports: - "9094:9094" environment: KAFKA_NODE_ID: 2 # Different ID! KAFKA_PROCESS_ROLES: broker,controller KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093 KAFKA_LISTENERS: INTERNAL://0.0.0.0:29092,EXTERNAL://0.0.0.0:9094,CONTROLLER://0.0.0.0:29093 KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka-2:29092,EXTERNAL://localhost:9094 # NOTE: EXTERNAL is localhost:9094 (different port than Broker 1) KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT,CONTROLLER:PLAINTEXT KAFKA_INTER_BROKER_LISTENER_NAME: INTERNAL KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_DEFAULT_REPLICATION_FACTOR: 3 KAFKA_MIN_INSYNC_REPLICAS: 2 KAFKA_NUM_PARTITIONS: 3 KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2 KAFKA_AUTO_CREATE_TOPICS_ENABLE: "false" KAFKA_LOG_RETENTION_HOURS: 168 CLUSTER_ID: "MkU3OEVBNTcwNTJENDM2Qg" # SAME cluster ID # ════════════════════════════════════════════════════════════════ # BROKER 3 — Port 9096 # ════════════════════════════════════════════════════════════════ kafka-3: image: apache/kafka:3.7.0 container_name: kafka-3 ports: - "9096:9096" environment: KAFKA_NODE_ID: 3 # Different ID! KAFKA_PROCESS_ROLES: broker,controller KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka-1:29093,2@kafka-2:29093,3@kafka-3:29093 KAFKA_LISTENERS: INTERNAL://0.0.0.0:29092,EXTERNAL://0.0.0.0:9096,CONTROLLER://0.0.0.0:29093 KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka-3:29092,EXTERNAL://localhost:9096 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT,CONTROLLER:PLAINTEXT KAFKA_INTER_BROKER_LISTENER_NAME: INTERNAL KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_DEFAULT_REPLICATION_FACTOR: 3 KAFKA_MIN_INSYNC_REPLICAS: 2 KAFKA_NUM_PARTITIONS: 3 KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2 KAFKA_AUTO_CREATE_TOPICS_ENABLE: "false" KAFKA_LOG_RETENTION_HOURS: 168 CLUSTER_ID: "MkU3OEVBNTcwNTJENDM2Qg" # SAME cluster ID # ════════════════════════════════════════════════════════════════ # KAFKA UI — Visual Dashboard # ════════════════════════════════════════════════════════════════ kafka-ui: image: provectuslabs/kafka-ui:latest container_name: kafka-ui ports: - "8080:8080" environment: KAFKA_CLUSTERS_0_NAME: production-local KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka-1:29092,kafka-2:29092,kafka-3:29092 depends_on: - kafka-1 - kafka-2 - kafka-3

Start the 3-Broker Cluster

# Start all 3 brokers + UI docker compose up -d # Verify all containers are running docker compose ps # Expected output: # kafka-1 running 0.0.0.0:9092->9092/tcp # kafka-2 running 0.0.0.0:9094->9094/tcp # kafka-3 running 0.0.0.0:9096->9096/tcp # kafka-ui running 0.0.0.0:8080->8080/tcp # Create a topic with replication factor 3 docker exec -it kafka-1 /opt/kafka/bin/kafka-topics.sh \ --bootstrap-server localhost:9092 \ --create \ --topic order-events \ --partitions 3 \ --replication-factor 3 # Verify partition distribution across brokers docker exec -it kafka-1 /opt/kafka/bin/kafka-topics.sh \ --bootstrap-server localhost:9092 \ --describe \ --topic order-events # Output shows: # Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3 # Partition: 1 Leader: 2 Replicas: 2,3,1 Isr: 2,3,1 # Partition: 2 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2 # ↑ Each partition has a DIFFERENT leader = load is distributed! # ↑ Replicas: 1,2,3 means ALL 3 brokers have a copy # ↑ ISR: 1,2,3 means ALL copies are in-sync

Update client.ts for Multi-Broker

The only change in your code — list all 3 broker addresses:

// src/client.ts — Updated for 3-broker cluster import { Kafka, logLevel } from "kafkajs"; const kafka = new Kafka({ clientId: "order-service", // List ALL broker addresses. // WHY: The client tries each address until it finds a live broker. // If Broker 1 is down, it connects to Broker 2 or 3 automatically. // Once connected, Kafka tells the client about the full cluster topology. // NOTE: You don't NEED all 3 — even 1 is enough to discover the cluster. // But listing all 3 gives better availability during startup. brokers: [ "localhost:9092", // Broker 1 "localhost:9094", // Broker 2 "localhost:9096", // Broker 3 ], logLevel: logLevel.WARN, // retry: Controls what happens when a broker is temporarily unavailable. // WHY: In a multi-broker setup, one broker might restart while others are fine. // The client should retry connecting to another broker automatically. retry: { retries: 10, // try up to 10 times initialRetryTime: 300, // first retry after 300ms maxRetryTime: 30000, // max wait between retries: 30 seconds }, }); export default kafka;

Test Fault Tolerance — Kill a Broker

# Terminal 1: Start the consumer npm run consumer # Terminal 2: Start producing events continuously npm run producer # Terminal 3: Kill Broker 2 while events are flowing docker stop kafka-2 # What happens: # 1. Controller (Broker 1) detects kafka-2 is dead (no heartbeat) # 2. For partitions where kafka-2 was leader → new leader elected from ISR # 3. For partitions where kafka-2 was follower → removed from ISR # 4. Producer/Consumer get metadata update → reconnect to new leaders # 5. Events keep flowing! Zero data loss! # Check the topic — notice ISR shrunk from 3 to 2: docker exec -it kafka-1 /opt/kafka/bin/kafka-topics.sh \ --bootstrap-server localhost:9092 \ --describe \ --topic order-events # Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,3 ← Broker 2 gone from ISR # Partition: 1 Leader: 3 Replicas: 2,3,1 Isr: 3,1 ← New leader! Was 2, now 3 # Partition: 2 Leader: 3 Replicas: 3,1,2 Isr: 3,1 ← Broker 2 gone from ISR # Bring Broker 2 back docker start kafka-2 # Broker 2 catches up with the leaders, re-joins ISR: # Partition: 0 Replicas: 1,2,3 Isr: 1,3,2 ← Broker 2 is back!
This is Real Fault Tolerance You just killed a broker and nothing broke. No events were lost. Producers and consumers reconnected automatically. This is why production uses 3 brokers with replication factor 3 and min.insync.replicas 2.
Chapter 40

Production-Grade Producer

What is itA production-grade producer is one hardened for real-world conditions: high throughput, network unreliability, broker failures, schema evolution, and operational requirements like metrics, tracing, and graceful shutdown. It's fundamentally different from a tutorial "send one message" snippet. The design goals are: zero data loss under failure, backpressure handling, observability, security, and maintainability.
Must-have configurations
  • acks: 'all' — wait for all ISR.
  • enable.idempotence: true — deduplicate retries.
  • retries: high (10+), with exponential backoff.
  • max.in.flight.requests.per.connection: ≤5 with idempotence.
  • compression.type: zstd or lz4.
  • linger.ms: 10–50 for better batching.
  • batch.size: tune based on record size.
  • request.timeout.ms and delivery.timeout.ms: bound total delivery time.
Operational must-haves
  • Singleton producer: one instance per process, shared across handlers.
  • Structured logging: include trace IDs, topic, partition, offset.
  • Prometheus metrics: throughput, latency, error rate, buffer usage.
  • Distributed tracing: inject trace context into record headers (OpenTelemetry, W3C Trace Context).
  • TLS + SASL: encrypted + authenticated connections.
  • Graceful shutdown: SIGTERM handler flushes pending sends.
  • Circuit breakers: stop sending if Kafka is persistently unreachable.
How it differs
  • vs dev producer: production producers prioritize correctness and durability over simplicity.
  • vs RabbitMQ production publisher: RabbitMQ uses publisher confirms and mandatory routing flags — different failure modes.
Common gotchasSilent buffer overflow: if buffer.memory fills up, sends block or throw. Missing graceful shutdown: loses unsent batches on deploys. Ignoring send errors: fire-and-forget is fine for logs but dangerous for business events. Not monitoring producer metrics: latency regressions go unnoticed.
Real-world examplesNetflix's Keystone producer library wraps the Kafka producer with Netflix-specific retries, circuit breakers, and metrics. Uber's heatpipe is a production library for their producer needs. Shopify's Rails Kafka producer is open-sourced as kafka-ruby.

The basic producer works, but production needs retry handling, error callbacks, graceful shutdown, and proper batching configs. Here's a production-ready producer.

// src/production/producer.ts // Production-grade Kafka producer with all the bells and whistles. import kafka from "../client"; import { CompressionTypes, Partitioners } from "kafkajs"; const producer = kafka.producer({ idempotent: true, maxInFlightRequests: 5, allowAutoTopicCreation: false, createPartitioner: Partitioners.DefaultPartitioner, }); // ═══════════════════════════════════════════════════════════════ // EVENT LISTENERS — Know what's happening inside the producer // ═══════════════════════════════════════════════════════════════ // WHY: In production, you need visibility into producer behavior. // These events help you debug issues, monitor health, and alert on failures. producer.on("producer.connect", () => { console.log("[PRODUCER] Connected to Kafka cluster"); // PRODUCTION: Emit a metric to your monitoring system (Prometheus, Datadog, etc.) }); producer.on("producer.disconnect", () => { console.log("[PRODUCER] Disconnected from Kafka cluster"); // PRODUCTION: This could mean a crash. Trigger an alert. }); producer.on("producer.network.request", (event) => { // Fires on EVERY request to Kafka (very noisy, use for debugging only) // event.payload: { apiKey, apiName, broker, clientId, correlationId, duration, size } // PRODUCTION: Track request latency → producer.network.request.duration }); producer.on("producer.network.request_timeout", (event) => { console.error("[PRODUCER] Request timeout:", event.payload); // PRODUCTION: Alert! Kafka might be overloaded or network issues. }); // ═══════════════════════════════════════════════════════════════ // SEND FUNCTION — Wrapped with error handling // ═══════════════════════════════════════════════════════════════ interface OrderEvent { orderId: string; customerId: string; product: string; quantity: number; price: number; status: string; } async function sendOrderEvent(order: OrderEvent): Promise<void> { try { const result = await producer.send({ topic: "order-events", acks: -1, timeout: 30000, compression: CompressionTypes.GZIP, messages: [ { key: order.orderId, value: JSON.stringify({ ...order, timestamp: new Date().toISOString(), version: 1, }), headers: { "event-type": `ORDER_${order.status}`, "source": "order-service", "correlation-id": crypto.randomUUID(), // WHY correlation-id: When this event flows through 5 services, // you can trace the entire journey using this single ID. // Essential for debugging in distributed systems. }, }, ], }); const meta = result[0]; console.log( `[SENT] ${order.orderId} → partition:${meta.partition} offset:${meta.baseOffset}` ); } catch (error: any) { // ── Handle specific Kafka errors ── if (error.type === "TOPIC_AUTHORIZATION_FAILED") { console.error("[ERROR] No permission to write to this topic. Check ACLs."); } else if (error.type === "REQUEST_TIMED_OUT") { console.error("[ERROR] Kafka is not responding. Check broker health."); } else { console.error("[ERROR] Failed to send event:", error.message); } // PRODUCTION: Send failed event to a retry queue or DLQ throw error; } } // ═══════════════════════════════════════════════════════════════ // MAIN — Connect, send events, handle shutdown // ═══════════════════════════════════════════════════════════════ async function main() { await producer.connect(); // Simulate continuous order production let orderNum = 1; const interval = setInterval(async () => { const id = String(orderNum++).padStart(4, "0"); await sendOrderEvent({ orderId: `ORD-${id}`, customerId: `CUST-${orderNum % 10}`, product: "Laptop", quantity: 1, price: 49999, status: "PLACED", }); }, 1000); // Send one event per second // ── Graceful Shutdown ── // WHY: When deploying new code (rolling restart), your app gets SIGTERM. // You must: // 1. Stop sending new events // 2. Wait for in-flight events to complete (acks received) // 3. Disconnect cleanly // If you don't, in-flight events may be lost or duplicated. const shutdown = async (signal: string) => { console.log(`\n[SHUTDOWN] Received ${signal}. Cleaning up...`); clearInterval(interval); // 1. Stop sending new events await producer.disconnect(); // 2. Flush pending + disconnect console.log("[SHUTDOWN] Producer stopped cleanly."); process.exit(0); }; process.on("SIGINT", () => shutdown("SIGINT")); // Ctrl+C process.on("SIGTERM", () => shutdown("SIGTERM")); // Docker stop / K8s pod kill process.on("uncaughtException", async (err) => { console.error("[FATAL] Uncaught exception:", err); await shutdown("uncaughtException"); }); } main().catch(console.error);
Chapter 41

Production-Grade Consumer

What is itA production-grade consumer is hardened for the full lifecycle of events in a real pipeline: bursty traffic, poison pills, slow downstream dependencies, rebalancing, graceful shutdown, exactly-once processing, and observability. The pattern involves careful offset management, retry/DLQ strategy, idempotent processing, and backpressure handling. Getting this right is harder than getting production producers right because consumers deal with processing failures and time-varying load.
Must-have practices
  • Manual offset commits: commit after processing, not before.
  • Idempotent processing: handle duplicate delivery gracefully.
  • Retry + DLQ strategy: distinguish transient from permanent errors.
  • Rebalance listeners: commit on partition revocation to avoid double-processing.
  • Backpressure: don't fetch faster than you can process.
  • Graceful shutdown: finish in-flight work, commit offsets, then disconnect.
  • Static membership: reduce rebalancing during rolling deploys.
  • Lag monitoring: alert on per-partition lag thresholds.
Key configs
  • group.id: stable, descriptive.
  • enable.auto.commit: false: take manual control.
  • max.poll.interval.ms: longer than worst-case processing time.
  • max.poll.records: tune based on per-record processing cost.
  • session.timeout.ms and heartbeat.interval.ms: for liveness detection.
  • isolation.level: read_committed: skip aborted transactional records.
  • partition.assignment.strategy: CooperativeStickyAssignor for incremental rebalance.
How it differs
  • vs dev consumer: handles real failure modes instead of happy path.
  • vs RabbitMQ consumer: RabbitMQ has per-message acks and no rebalancing — different operational model.
  • vs SQS consumer: SQS has visibility timeouts and DLQs at the queue level, more prescriptive.
Common gotchasAuto-commit + processing after: loses messages on crash. Infinite retry of poison pills: halts everyone. Long processing in main loop: triggers rebalance. Not handling rebalance events: reprocesses records on partition handoff. Consumer lag not alerted: silent SLA breaches.
Real-world examplesKafka Streams and Flink's Kafka source are examples of production-grade consumer patterns. Burrow (LinkedIn) is the gold standard lag monitoring tool. Spring Kafka provides production-grade abstractions for Java consumers.

A production consumer needs manual commits, rebalance listeners, error handling, DLQ integration, and graceful shutdown.

// src/production/consumer.ts // Production-grade consumer: manual commit, DLQ, rebalance handling, graceful shutdown. import kafka from "../client"; const consumer = kafka.consumer({ groupId: "notification-service", // ── Session & Heartbeat ── sessionTimeout: 30000, heartbeatInterval: 3000, // RULE: heartbeatInterval should be 1/3 of sessionTimeout. // If heartbeat is missed for 30s, Kafka thinks consumer is dead. // ── Rebalance Timeout ── rebalanceTimeout: 60000, // WHY: During rebalance, consumers must join within this time. // If your consumer takes too long to commit offsets during rebalance, // it gets kicked out of the group. // ── Poll Interval ── maxWaitTimeInMs: 5000, // WHY: Broker waits up to 5 seconds for new data before returning empty. // Lower = lower latency but more network traffic. // Higher = higher latency but fewer requests. // ── Fetch Controls ── maxBytesPerPartition: 1048576, // 1MB per partition per fetch maxBytes: 10485760, // 10MB total per fetch across all partitions // WHY: Controls memory usage. If you have 100 partitions and 1MB each, // one fetch could use 100MB of memory. maxBytes caps the total. // ── Read Isolation (for transactions) ── // readUncommitted: false, // default // Set to: readUncommitted: false to only read committed transactions. // This is important if your producers use transactions. }); // DLQ producer for failed events const dlqProducer = kafka.producer(); // ═══════════════════════════════════════════════════════════════ // EVENT LISTENERS — Monitor consumer lifecycle // ═══════════════════════════════════════════════════════════════ consumer.on("consumer.group_join", (event) => { console.log(`[CONSUMER] Joined group. Member: ${event.payload.memberId}`); console.log(`[CONSUMER] Assigned partitions: ${JSON.stringify(event.payload.memberAssignment)}`); // WHY: Know which partitions this consumer instance is responsible for. // PRODUCTION: Log this so you can debug partition assignment issues. }); consumer.on("consumer.rebalancing", () => { console.warn("[CONSUMER] Rebalancing started! Partitions being reassigned..."); // WHY: During rebalance, this consumer STOPS reading. // PRODUCTION: Track rebalance frequency. Too many rebalances = problem. // Possible causes: consumers crashing, long processing, network issues. }); consumer.on("consumer.stop", () => { console.log("[CONSUMER] Stopped."); }); consumer.on("consumer.crash", (event) => { console.error("[CONSUMER] CRASHED!", event.payload.error); // PRODUCTION: Alert immediately! Consumer is dead. // The event.payload.restart indicates if KafkaJS will auto-restart. console.error(`[CONSUMER] Will auto-restart: ${event.payload.restart}`); }); consumer.on("consumer.fetch", () => { // Fires every time the consumer fetches a batch from Kafka. // PRODUCTION: Use to track fetch rate and throughput metrics. }); // ═══════════════════════════════════════════════════════════════ // PROCESSING LOGIC // ═══════════════════════════════════════════════════════════════ const MAX_RETRIES = 3; async function processOrder(order: any): Promise<void> { // Your actual business logic goes here. // Example: Send notification email, update database, call external API. console.log(` Processing: ${order.orderId} — ${order.status}`); // Simulate occasional failure if (Math.random() < 0.1) { throw new Error("Email service timeout"); } } // ═══════════════════════════════════════════════════════════════ // MAIN // ═══════════════════════════════════════════════════════════════ async function main() { await dlqProducer.connect(); await consumer.connect(); await consumer.subscribe({ topic: "order-events", fromBeginning: false, // WHY false in production: You usually don't want to reprocess // millions of old events when deploying a new consumer. // Only new events from now on. Set true only for backfill jobs. }); await consumer.run({ autoCommit: false, // Manual commit for safety eachMessage: async ({ topic, partition, message, heartbeat }) => { const value = JSON.parse(message.value!.toString()); let retries = 0; let success = false; // ── Retry loop with exponential backoff ── while (retries < MAX_RETRIES && !success) { try { await processOrder(value); success = true; } catch (err) { retries++; if (retries < MAX_RETRIES) { const backoff = 1000 * Math.pow(2, retries); // 2s, 4s, 8s... console.warn(` Retry ${retries}/${MAX_RETRIES} in ${backoff}ms...`); await new Promise((r) => setTimeout(r, backoff)); await heartbeat(); // Keep alive during wait! } } } // ── Send to DLQ if all retries failed ── if (!success) { await dlqProducer.send({ topic: "order-events.DLQ", messages: [{ key: message.key, value: message.value, headers: { ...message.headers, "dlq-error": "Max retries exhausted", "dlq-original-topic": topic, "dlq-original-partition": String(partition), "dlq-original-offset": message.offset, "dlq-retry-count": String(MAX_RETRIES), "dlq-timestamp": Date.now().toString(), }, }], }); console.error(` ✗ DLQ: ${value.orderId}`); } // ── Commit offset (whether processed or sent to DLQ) ── await consumer.commitOffsets([{ topic, partition, offset: (parseInt(message.offset) + 1).toString(), }]); }, }); // ── Graceful Shutdown ── // WHY in production: When Kubernetes rolls out a new deployment, // it sends SIGTERM to old pods. You must: // 1. Stop fetching new events // 2. Finish processing current events // 3. Commit final offsets // 4. Leave the consumer group cleanly (triggers immediate rebalance) // If you skip this, the group waits for sessionTimeout (30s) before rebalancing. const shutdown = async (signal: string) => { console.log(`\n[SHUTDOWN] ${signal} received. Draining...`); await consumer.stop(); // stop fetching, finish current batch await consumer.disconnect(); // commit offsets + leave group await dlqProducer.disconnect(); console.log("[SHUTDOWN] Consumer stopped cleanly."); process.exit(0); }; process.on("SIGINT", () => shutdown("SIGINT")); process.on("SIGTERM", () => shutdown("SIGTERM")); } main().catch(console.error);
Running Multiple Consumer Instances In production, you run 3 instances of this consumer (e.g., 3 Kubernetes pods). Since they share the same groupId: "notification-service", Kafka automatically assigns 1 partition to each consumer. If one pod crashes, its partition is reassigned to a surviving pod within sessionTimeout (30 seconds).
3 Consumer Instances Sharing 3 Partitions
Topic: order-events (3 partitions) Consumer Group: notification-service Pod 1 (consumer instance) ──── reads from ──── Partition 0 Pod 2 (consumer instance) ──── reads from ──── Partition 1 Pod 3 (consumer instance) ──── reads from ──── Partition 2 If Pod 2 crashes: Pod 1 ──── Partition 0 + Partition 1 (takes over Pod 2's work) Pod 3 ──── Partition 2 When Pod 2 recovers: Pod 1 ──── Partition 0 (rebalance gives partition back) Pod 2 ──── Partition 1 (resumes from last committed offset) Pod 3 ──── Partition 2
Chapter 42

Monitoring & Health Checks

What is itMonitoring Kafka means tracking the health of brokers, producers, consumers, and the data flowing through the system. Kafka exposes a huge surface of JMX metrics (broker, producer, consumer) that need to be scraped and visualized. Critical dimensions include cluster health (URP, offline partitions, controller status), throughput (bytes in/out per topic), latency (request/produce/fetch time), consumer lag (per group per partition), and resource usage (disk, CPU, network, GC). Without monitoring, Kafka problems are silent until customers notice.
Key metrics to alert on
  • UnderReplicatedPartitions: should be 0. Non-zero = durability compromised.
  • OfflinePartitionsCount: should be 0. Non-zero = unavailable data.
  • ActiveControllerCount: should be exactly 1 cluster-wide.
  • UnderMinIsrPartitionCount: should be 0 — producers with acks=all will block.
  • Consumer lag (per group): alert if lag > threshold or growing.
  • Request handler pool utilization: >80% = broker is overloaded.
  • Disk usage: alert at 70%, page at 85%.
  • JVM GC pauses: long G1 pauses cause heartbeat timeouts.
Tooling
  • Prometheus + Grafana: industry standard. Use jmx_exporter or the newer kafka-exporter.
  • Burrow: LinkedIn's consumer lag monitor — de-facto standard for lag.
  • Cruise Control: LinkedIn's automated rebalancing + capacity monitoring.
  • Confluent Control Center: commercial dashboard for Confluent Platform.
  • AKHQ, Kowl, Redpanda Console, kafka-ui: open-source UIs for topic browsing and lag inspection.
  • Datadog, New Relic, Splunk: commercial APMs with Kafka integrations.
How it differs
  • vs RabbitMQ monitoring: RabbitMQ has a built-in management plugin with UI and metrics. Kafka relies on external scrapers.
  • vs SQS/Kinesis: AWS provides CloudWatch metrics out of the box.
Common gotchasMonitoring only producers/consumers: missing broker-side issues like URP. Not monitoring consumer lag per partition: one stuck partition can be hidden by an "average lag" view. Alert fatigue: too many noisy alerts get ignored — tune thresholds per topic importance.
Real-world examplesLinkedIn runs Burrow + Cruise Control + internal tools. Netflix has custom Atlas dashboards for every Kafka cluster. Confluent Cloud bundles monitoring as a managed service.

In production, you need to monitor consumer lag, broker health, and throughput. Here's how to build monitoring into your Node.js app.

In Simple Words

Imagine a factory assembly line. "Consumer lag" is how many boxes are piled up on the belt, waiting to be picked up. If the pile grows, your workers (consumers) are too slow. You need to either speed them up or add more workers.

Check Consumer Lag Programmatically

// src/production/monitor.ts // Checks consumer lag — how far behind each consumer group is. import kafka from "../client"; async function checkConsumerLag() { const admin = kafka.admin(); await admin.connect(); const groupId = "notification-service"; const topic = "order-events"; // ── Step 1: Get the latest offset for each partition ── // WHY: This is the "end" of the log — where the newest event is. const topicOffsets = await admin.fetchTopicOffsets(topic); // Returns: [{ partition: 0, offset: "500" }, { partition: 1, offset: "450" }, ...] // ── Step 2: Get the committed offset for this consumer group ── // WHY: This is where the consumer has read up to. const groupOffsets = await admin.fetchOffsets({ groupId, topics: [topic] }); // Returns: [{ topic: "order-events", partitions: [{ partition: 0, offset: "480" }, ...] }] // ── Step 3: Calculate lag = latest offset - committed offset ── console.log(`\n[LAG REPORT] Consumer Group: ${groupId}`); console.log(`${"─".repeat(60)}`); let totalLag = 0; const groupPartitions = groupOffsets[0].partitions; for (const tp of topicOffsets) { const latestOffset = parseInt(tp.offset); const committed = groupPartitions.find((p) => p.partition === tp.partition); const committedOffset = committed ? parseInt(committed.offset) : 0; const lag = latestOffset - committedOffset; totalLag += lag; const status = lag === 0 ? "CAUGHT UP" : lag < 100 ? "OK" : lag < 1000 ? "WARNING" : "CRITICAL"; console.log( ` Partition ${tp.partition}: latest=${latestOffset} committed=${committedOffset} lag=${lag} [${status}]` ); } console.log(`${"─".repeat(60)}`); console.log(` TOTAL LAG: ${totalLag} events`); if (totalLag > 10000) { console.error(" ⚠ ALERT: Consumer lag is too high! Consider adding more consumers."); // PRODUCTION: Send alert to Slack/PagerDuty/email } // ── Step 4: Describe the consumer group ── // WHY: See which consumers are active, which partitions they own. const groupDesc = await admin.describeGroups([groupId]); const group = groupDesc.groups[0]; console.log(`\n[GROUP INFO]`); console.log(` State: ${group.state}`); // Stable, Rebalancing, Dead, Empty console.log(` Members: ${group.members.length}`); group.members.forEach((member, i) => { console.log(` Member ${i + 1}: ${member.clientId} (${member.clientHost})`); }); await admin.disconnect(); } // ── Run monitoring every 30 seconds ── async function main() { console.log("Starting Kafka monitor..."); while (true) { await checkConsumerLag(); await new Promise((r) => setTimeout(r, 30000)); } } main().catch(console.error);

Health Check Endpoint

Add a simple HTTP health check so Kubernetes knows your consumer is alive:

// Add this to your consumer file // WHY: Kubernetes uses HTTP health checks to decide if a pod is healthy. // If the health check fails, K8s restarts the pod automatically. import http from "http"; let isHealthy = true; let lastMessageTime = Date.now(); // Track when last message was processed consumer.on("consumer.fetch", () => { lastMessageTime = Date.now(); }); consumer.on("consumer.crash", () => { isHealthy = false; }); // Simple HTTP server for health checks const server = http.createServer((req, res) => { if (req.url === "/health") { // Liveness: Is the process alive? res.writeHead(isHealthy ? 200 : 503); res.end(isHealthy ? "OK" : "UNHEALTHY"); } else if (req.url === "/ready") { // Readiness: Is the consumer actively processing? // If no fetch in 60 seconds, something might be wrong. const stale = Date.now() - lastMessageTime > 60000; res.writeHead(stale ? 503 : 200); res.end(stale ? "STALE" : "READY"); } else { res.writeHead(404); res.end(); } }); server.listen(3000, () => { console.log("Health check server on :3000 (/health, /ready)"); });

Key Metrics to Monitor

MetricWhat It MeansAlert When
Consumer LagEvents waiting to be processedLag > 10,000 or growing continuously
Rebalance RateHow often partitions are reassignedMore than 5 rebalances / hour
Fetch LatencyTime to fetch a batch from Kafka> 5 seconds consistently
Commit LatencyTime to commit an offset> 3 seconds consistently
Under-Replicated PartitionsPartitions where ISR < replication factorAny value > 0
Offline PartitionsPartitions with no active leaderAny value > 0 (critical!)
DLQ Message CountFailed events in the Dead Letter QueueAny increase (investigate immediately)
Consumer Group StateStable, Rebalancing, Empty, DeadAnything other than "Stable"
Chapter 43

Production Deployment Checklist

What is itA production deployment checklist is the set of cross-cutting concerns you must get right before putting a Kafka cluster and its clients in front of real users and real money. Missing any one of these can lead to data loss, outages, or cost surprises. This is the "tribal knowledge" that separates a working Kafka demo from a system that survives 3 AM incidents.
Cluster-level checklist
  • Replication factor ≥ 3 for all production topics.
  • min.insync.replicas = 2 for durable topics.
  • unclean.leader.election.enable = false.
  • Rack awareness (broker.rack) across AZs.
  • Disable auto.create.topics.enable; manage topics via IaC.
  • KRaft mode (for new clusters) with dedicated controller quorum of 3 or 5.
  • TLS + SASL/mTLS for all listeners.
  • ACLs defined per service account.
  • Monitoring + alerts on URP, controller count, lag, disk.
  • Backup/disaster recovery strategy: MirrorMaker 2 or Confluent Replicator for cross-region.
Client-level checklist
  • Producers: acks=all, idempotence enabled, sensible retries, graceful shutdown.
  • Consumers: manual commits, idempotent processing, DLQ strategy, rebalance listeners.
  • Singletons: one producer/consumer instance per process.
  • Structured logging with correlation IDs.
  • Metrics scraped and dashboards wired up.
  • Distributed tracing integrated (headers propagation).
Operational checklist
  • Runbooks: for broker failure, partition reassignment, consumer stalls, disk full.
  • On-call rotation with clear escalation paths.
  • Capacity planning: headroom for 2× peak traffic.
  • Chaos testing: kill a broker in staging regularly.
  • Schema Registry with compatibility enforcement.
  • Regular upgrades on supported Kafka versions.
How it differsCompared to RabbitMQ, SQS, or other brokers, Kafka deployments require significantly more planning around partitioning, durability settings, and client config — because Kafka exposes so many knobs.
Common gotchasSkipping rack awareness: all replicas on the same AZ = AZ outage kills data. RF=2 or min.isr=1: looks fine until something fails. No capacity headroom: surprise traffic spikes saturate brokers. Ignoring client-side backpressure: producer buffer exhaustion cascades into application hangs.
Real-world examplesConfluent's Site Reliability Engineering guide is a gold standard. LinkedIn's SRE team published extensive Kafka operational guides. AWS MSK Best Practices documentation covers most of this for managed clusters. Strimzi docs cover K8s-specific deployment best practices.

Everything you need to verify before going to production. This is the summary of everything we've learned, as a checklist.

Cluster Configuration

ConfigDev ValueProduction ValueWhy
Broker Count13+ (odd number)Fault tolerance. Odd for clean quorum majority.
replication.factor13Survive 2 broker failures without data loss.
min.insync.replicas12With RF=3, at least 2 must confirm writes.
auto.create.topics.enabletruefalseTopics must be created deliberately with correct configs.
Controller Nodes1 (combined)3 dedicatedDedicated controllers = faster metadata operations.
log.retention.hours168 (7 days)Based on needBalance between replay ability and disk cost.

Producer Configuration

ConfigDev ValueProduction ValueWhy
acks1-1 (all)All ISR replicas must confirm. No data loss.
idempotentfalsetruePrevents duplicate events on retry.
compressionnoneGZIP or ZSTDReduces network bandwidth and disk usage 60-80%.
retries0MAX_INTAuto-set by idempotent=true. Never give up.
maxInFlightRequests55Max for idempotent. Sequence numbers maintain order.
Message KeynullBusiness keyorderId, userId, etc. Guarantees ordering per key.
Graceful ShutdownNoneSIGTERM handlerFlush pending events before exit.

Consumer Configuration

ConfigDev ValueProduction ValueWhy
autoCommittruefalseManual commit after processing. Prevents data loss.
fromBeginningtruefalseOnly read new events. Don't reprocess millions of old events.
sessionTimeout30s30sBalance between detection speed and false alarms.
heartbeatInterval3s3s (1/3 of session)Regular heartbeats prevent false death detection.
Instances Count1= partition count1 consumer per partition = max parallelism.
DLQNonetopic.DLQFailed events go to DLQ, not blocking the queue.
Health CheckNone/health + /readyKubernetes needs HTTP endpoints to manage pods.
Graceful ShutdownNoneSIGTERM handlerCommit offsets and leave group cleanly on deploy.

Operational Checklist

Before Launch
  • All topics created manually with correct partition count and RF=3.
  • SSL/TLS enabled for all connections.
  • SASL authentication configured per service.
  • ACLs restrict topic access (service A can't read service B's topics).
  • Schema Registry configured with BACKWARD compatibility.
  • DLQ topics created for every consumer topic.
  • Monitoring dashboards set up (consumer lag, broker health).
  • Alerts configured (lag > threshold, offline partitions, DLQ growth).
During Operation
  • Monitor consumer lag trends (growing = problem).
  • Watch rebalance frequency (too many = consumers are unstable).
  • Review DLQ events daily (fix root causes, replay fixed events).
  • Check under-replicated partitions (ISR < RF = broker may be failing).
  • Track disk usage on brokers (set up alerts at 80% capacity).
  • Test disaster recovery: periodically kill a broker, verify auto-recovery.
Scaling Decisions
  • Consumer lag growing? Add more consumer instances (up to partition count).
  • Need more parallelism? Add more partitions (can't reduce later!).
  • Disk full? Add more brokers or reduce retention period.
  • High latency? Check compression, batch size, fetch size configs.
  • Too many rebalances? Use CooperativeSticky assignor, tune timeouts.
  • Network bottleneck? Enable compression (ZSTD is fastest with good ratio).
You're Production-Ready! You've learned Kafka from the very basics (what is an event?) all the way to production-grade distributed systems with 3-broker clusters, fault tolerance testing, monitoring, DLQ patterns, transactions, and deployment checklists. The theory chapters (1-32) explain WHY things work. The implementation chapters (33-43) show HOW to build them. Go ship something amazing.