Apache Kafka: A Comprehensive Guide to Real-Time Data Streaming

Apache Kafka has emerged as a cornerstone technology for organizations seeking to harness the power of real-time data. This article provides a comprehensive overview of Kafka, exploring its core concepts, architecture, setup, operations, use cases, and its role in modern data streaming applications.

Introduction to Apache Kafka

Apache Kafka is a distributed event streaming platform designed for high-throughput, fault-tolerant handling of real-time data feeds. It's more than just a messaging system; it's a foundational technology for building real-time data pipelines and event-driven applications. Kafka enables you to read, write, store, and process events across many machines.

Core Concepts of Kafka

Understanding Kafka's core components is crucial for leveraging its capabilities.

Events (Messages or Records)

At the heart of Kafka is the concept of an event, also known as a message or record. An event represents any action, incident, or change identified or recorded by software or applications. Examples include payment transactions, geolocation updates, shipping orders, and sensor measurements. Events are modeled as key/value pairs, with keys often being primitive types like strings or integers. The key is not necessarily a unique identifier, but rather a means to associate related events.

Topics

Producers publish messages to categorized feeds called topics. Consumers subscribe to specific topics, receiving only relevant messages. Topics are logical channels or categories to which events are published, serving as the fundamental organizing principle in Kafka. Very simplified, a topic is similar to a folder in a filesystem, and the events are the files in that folder.

Read also: Learn Forex Trading

Partitions

Each topic in Kafka is divided into partitions, which are ordered, immutable sequences of records. Topics are divided into partitions to distribute data across the cluster for redundancy. Partitioning takes the single topic log and breaks it into multiple logs, each of which can live on a separate node in the Kafka cluster. Within one partition, messages get an order number (offset). This can ensure that messages are delivered to the consumer in the same order as they were stored in the partition.

When messages have no specified key, they are distributed across partitions in a round-robin fashion. If the message does have a key, then the destination partition will be computed from a hash of the key.

Brokers

A single Kafka server is called a broker, and brokers operate as part of a cluster. Brokers are the core of the Kafka cluster. They receive messages from producers, store them in partitions, and deliver them to consumers. Each broker hosts some of the partitions from various topics and handles requests from producers, consumers, and other brokers.

Producers

Producers are applications that publish events to Kafka topics. They send messages to specific topics within the Kafka cluster. The producer API allows producers to publish messages to topics, configure message serialization formats, set compression options, and control message acknowledgment behavior.

Consumers

Consumers are applications that subscribe to topics and receive messages from them. They can process the received messages, store them, or perform other actions. Consumer groups allow a group of consumers to collaborate in processing messages from one or more topics. The consumer API enables subscribing to topics, defining message deserialization formats, managing consumer offsets, and configuring consumer groups.

Read also: Understanding the Heart

ZooKeeper and KRaft

Traditionally, Kafka relied on Apache ZooKeeper for cluster coordination, metadata management, and leader election. However, current versions of Apache Kafka are moving towards a ZooKeeper-less architecture using KRaft (Kafka Raft). KRaft is a consensus protocol that provides a more integrated and efficient way to manage metadata within the Kafka cluster itself.

Offsets

Offsets are unique identifiers that represent the position of a message within a specific partition of a topic. Each consumer tracks its progress using an offset, ensuring that the messages are processed exactly once.

Kafka Architecture and Data Flow

Kafka utilizes a publish-subscribe messaging model for managing data streams. Producers send messages to specific topics within a Kafka cluster. These topics are split into partitions and distributed across the brokers within the cluster. Consumers, part of a consumer group, subscribe to these topics and read messages from their partitions delivered by the brokers.

When a producer publishes a message, it connects to any broker in the cluster, which acts as a bootstrap server. Consumers operate similarly, first connecting to a bootstrap server to discover partition leaders, then establishing connections to those leaders to stream messages.

To ensure fault tolerance, Kafka replicates partition data across multiple brokers. Each partition has one broker designated as the leader, handling all reads and writes, while other brokers maintain replicas that stay in sync with the leader. If one broker fails, the remaining brokers can continue processing messages. Kafka automatically rebalances partitions among the remaining brokers to minimize downtime and ensure data availability.

Read also: Guide to Female Sexual Wellness

Setting Up Kafka

The process of setting up Apache Kafka is relatively straightforward. Here's a guide to getting Kafka and ZooKeeper running on Windows:

Prerequisites

Java Installation: Make sure that Java 8+ is installed in your local environment. Verify the version using the java -version command. To install Java, first, go to the official download page for Java and download the installer for the latest version from there. After the download is complete, open the installer and go through the on-screen prompts to install Java on your machine.
Kafka Download: Go to Kafka Downloads and download the latest binary release. When writing this guide, the latest versions were Scala 2.13 for Kafka 3.7.0.

Installation Steps

Extraction: Extract the downloaded files in your desired location. After downloading Kafka, go to the directory where the archive is downloaded. Then, extract the contents from the archive to a new folder and rename it to kafka.
Environment Variable: Add the KAFKA_HOME variable in Environment Variables and set it to the directory where you placed the Kafka extract.

Starting Kafka and ZooKeeper

Windows runs programs using batch files (.bat) instead of shell scripts (.sh). The Kafka installation provides these batch files specifically for Windows.

Start ZooKeeper: Open CMD, navigate to your Kafka directory, and run the following command:
```
.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties
```
Start Kafka: Open another terminal window and run the following command:
```
.\bin\windows\kafka-server-start.bat .\config\server.properties
```

Once Kafka and ZooKeeper run, you can verify their functionality by sending and receiving messages via a topic in a terminal window.

Creating a Topic

Use the following command to create a topic from CMD:

..\bin\windows\kafka-topics.bat --create --topic redpanda-demo --bootstrap-server localhost:9092

--bootstrap-server: Specifies the Kafka broker (server) to connect to (localhost:9092). Apache Kafka requires a broker to manage the topic and perform operations like publishing or consuming messages.
localhost:9092: Refers to the address of the Kafka broker. localhost indicates that the Kafka broker is running on the same machine where the command is executed. 9092 is the default port that Apache Kafka uses to listen to incoming connections.

Sending a Message

Send a message to the newly created <redpanda-demo> topic:

..\bin\windows\kafka-console-producer.bat --topic redpanda-demo --bootstrap-server localhost:9092

Receiving a Message

Use the following command to receive a message that you just sent as a producer:

./bin/windows/kafka-console-consumer.bat --topic redpanda-demo --from-beginning --bootstrap-server localhost:9092

Basic Kafka Operations

In this section, we'll explore basic operations such as listing topics, creating partitions, and deleting topics.

Listing Topics

To list the topics in the Kafka server, use the following command:

kafka-topics.bat --bootstrap-server localhost:9092 --list

Creating Partitions

This command creates three partitions within MyTopic:

kafka-topics.bat --bootstrap-server localhost:9092 --topic MyTopic --create --partitions 3 --replication-factor 1

Deleting Topics

To delete a topic from the Kafka server, use the following command:

kafka-topics.bat --bootstrap-server localhost:9092 --topic MyTopic --delete

If you list the topics in the Kafka server again, it will not output anything because you have deleted the only topic (MyTopic) from the server.

Producer and Consumer Applications

Now that Kafka is set up, let's look at the code for the producer and consumer applications.

Producer API

The producer API allows producers to publish messages to topics, configure message serialization formats, set compression options, and control message acknowledgment behavior. The SimpleProducer class in the code below demonstrates a basic Kafka Producer. It sends a series of messages to the specified Topic.

import org.apache.kafka.clients.producer.*;import java.util.Properties;public class SimpleProducer { public static void main(String[] args) throws Exception { String topicName = "redpanda-demo"; Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("acks", "all"); props.put("retries", 0); props.put("batch.size", 16384); props.put("linger.ms", 1); props.put("buffer.memory", 33554432); props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); Producer<String, String> producer = new KafkaProducer<>(props); for (int i = 0; i < 10; i++) { producer.send(new ProducerRecord<>(topicName, Integer.toString(i), "Message " + i)); System.out.println("Message sent successfully"); } producer.close(); }}

Consumer API

Similarly, the consumer API enables subscribing to topics, defining message deserialization formats, managing consumer offsets, and configuring consumer groups. The SimpleConsumer class in the code below demonstrates a basic Kafka Consumer. It subscribes to a specified topic and continuously polls for new messages.

import org.apache.kafka.clients.consumer.*;import java.util.Properties;import java.util.Collections;public class SimpleConsumer { public static void main(String[] args) throws Exception { String topicName = "redpanda-demo"; Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("group.id", "test"); props.put("enable.auto.commit", "true"); props.put("auto.commit.interval.ms", "1000"); props.put("session.timeout.ms", "30000"); props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props); consumer.subscribe(Collections.singletonList(topicName)); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { System.out.printf("offset = %d, key = %s, value = %s\n", record.offset(), record.key(), record.value()); } } }}

Compiling and Running the Applications

Save the above code in two separate files named SimpleProducer.java and SimpleConsumer.java inside a new folder.

Compile the code using the following commands:

javac -cp "D:\kafka_2.13-3.7.0\libs\*" SimpleProducer.javajavac -cp "D:\kafka_2.13-3.7.0\libs\*" SimpleConsumer.java

This will create two new .class files inside your directory.

With Kafka running in the background and a topic created, you can run the producer and consumer applications in two separate terminals.

Run the producer:

java -cp ".;D:\kafka_2.13-3.7.0\libs\*" SimpleProducer

Run the consumer:

java -cp ".;D:\kafka_2.13-3.7.0\libs\*" SimpleConsumer

Experiment with different configurations in the sample code, send messages, create new topics, and receive them via consumers.

Producer and Consumer Configurations

The producer and consumer have configurations to control various features. You can set these configurations directly into the code or a separate properties file.

Producer Configurations

acks: Controls the criteria under which requests are considered complete. Use acks=1 to ensure the message is written to the leader broker's disk, providing a balance between reliability and latency. For higher reliability, consider acks=all (or -1), where the message is only considered committed when it is written to all replicas.
compression.type: Sets the compression type of all data generated by the Producer.
linger.ms: This setting tells the producer to wait a short time to collect more messages into a batch, potentially improving throughput.

Consumer Configurations

fetch.min.bytes: Controls the minimum amount of data that consumer wants to receive from the broker when fetching records.
max.poll.records: Controls the maximum number of records that a single call to poll() returns.
partition.assignment.strategy: This specifies the consumer's strategy for assigning partitions to consumer group members. Strategy types include âRange,â âRoundRobin,â âSticky,â and âCooperativeSticky.â

tags: #learn #apache #kafka #tutorial

Popular posts: