Author : MD TAREQ HASSAN | Updated : 2021/04/06
Topic
- Topic is a particular stream of data
- Similar to a table in a database (without all the constraints)
- You can have as many topics as you want
- A topic is defined by it’s name
- Topic can be think of “a anmed container for similar events” or “append only / ordered log of events”
- If you wish to send a message you send it to a specific topic and if you wish to read a message you read it from a specific topic
- A consumer pulls messages off of a Kafka topic while producers push messages into a Kafka topic
- I n Kafka, the word topic refers to a category/feed name or a common name used to store and publish a particular stream of data
- All Kafka records are organized into topics
- Data in Kafka is stored in topics
- Producer applications write data to topics and consumer applications read from topics
Partition
- A kafka topic is split in partitions
- Partitions are separated in an order, starting from zero
- While creating a topic, we need to specify the number of partitions(the number is arbitrary and can be changed later)
- Data content gets stored in the partitions within the topic
- The data once written to a partition can never be changed. It is immutable
- The data is kept in a partition for a limited time only (retaintion period can be set)
- Data is assigned randomly to a partition unless a key is provided
Each topic is divided into partions (i.e. 3 partitions) and then each partion will be replicated
- Topic: T0 (T-zero)
- Partitions:
- T0_P0
- T0_P1
- T0_P2
- Replication Factor: 3
- Replicas
- T0_P0_R0
- T0_P0_R1
- T0_P0_R2
Partitions allow you to parallelize a topic by splitting the data in a particular topic across multiple brokers — each partition can be placed on a separate machine to allow for multiple consumers to read from a topic in parallel. Consumers can also be parallelized so that multiple consumers can read from multiple partitions in a topic allowing for very high message processing throughput.
Offset
- Each message gets stored into partitions with an incremental id known as its Offset value
- Each message within a partition has an identifier called its offset
- Offset starts with zero
- The offsets for a partition are infinite
- The order of the offset value is guaranteed within the partition only and not across the partition
- The ordering of messages in a partition is an immutable sequence
- The offset value always remains in an incremental state, it never goes back to an empty space
- data in offset:
- Does not have any relation with the data in offset in different partition
- Inter-related with data in offset in the same partition
Another way to view a partition is as a log. A data source writes messages to the log and one or more consumers reads from the log at the point in time they choose.
Hierarchy of topic, partion and offset
Data Log
Kafka retains messages for a configurable period of time and it is up to the consumers to adjust their behaviour accordingly. For instance, if Kafka is configured to keep messages for a day and a consumer is down for a period of longer than a day, the consumer will lose messages. However, if the consumer is down for an hour it can begin to read messages again starting from its last known offset. From the point of view of Kafka, it keeps no state on what the consumers are reading from a topic.
Broker
- Kafka, as a distributed system, runs in a cluster (a cluster is a set of nodes where each node is a VM or Physical Server). Each node in the cluster is called a Kafka broker
- A broker is a Kafka instance running on a node
- As far as the definition goes, a node is a typically a physical entity, a machine or a VM, but a broker is a process
- We can run multiple brokers in one physical node also
- The brokers in the cluster are identified by an integer id only
- Kafka brokers are also known as Bootstrap brokers because connection with any one broker means connection with the entire cluster
- Although a broker does not contain whole data, but each broker in the cluster knows about all other brokers, partitions as well as topics
What broker does?
- Manages partitions
- Manages replication of partitions
- Manages read and write requests
Leader Partition
- Each topic is divided into partions (i.e. 3 partitions) and then each partion will be replicated
- The kafka cluster chooses one of the broker’s partition as a leader, and the rest of them becomes its followers
- The followers(brokers) will be allowed to synchronize the data. But, in the presence of a leader, none of the followers is allowed to serve the client’s request. These replicas are known as in-sync-replica. So, Apache Kafka offers multiple in-sync-replica for the data
- Only the leader is allowed to serve the client request. The leader handles all the read and writes operations of data for the partitions. The leader and its followers are determined by the zookeeper
In Sync Replicas
- In-Sync Replicas = ISR
- Each broker holds a number of partitions and each of these partitions can be either:
- A leader replica (of a partition)
- A follower replica
- All writes and reads to a topic go through the leader and the leader coordinates updating replicas with new data
- The follower replicas that are co-ordinating with leader and synced properly (no data mismatch) are called ISR
- Lets say we have 3 replicas - PartitionX_R0, PartitionX_R1, PartitionX_R3
- R0 is leader and other two are followers
- At a given moment, PartitionX_R1 and PartitionX_R2 are properly synced with PartitionX_R0 (no data mismatch, all 3 replicas containing same data)
- PartitionX_R0 -> Leader
- PartitionX_R1 and PartitionX_R2 -> ISR
- If the broker holding the leader for the partition fails to serve the data due to any failure, one of its respective in-sync-replica replicas will takeover the leadership. Afterward, if the previous leader returns back, it tries to acquire its leadership again
Producer
- A producer is the client that publishes or writes data/message to the topics
- A producer is an external application that writes messages to a Kafka cluster, communicating with the cluster using Kafka’s network protocol
- Producers write to a single leader. then leader replicates data to in-sync-replicas
- A producer uses following strategies to write data to the cluster:
- Message Keys
- Acknowledgment
Acknowledgment
- wait for all in sync replicas to acknowledge the message (
acks=all
) - wait for only the leader to acknowledge the message (
acks=1
) -> this is default - do not wait for acknowledgement (
acks=0
)
Writing to partition
- If message does not have any key, then data will published by producer will be written as ‘round-robin’
- If message does have a key, then same key messages will always be written to same partition
Message with key
- The given key will be hashed
- Based on the hash value, message with same key will be written to the same partition (of a topic)
- Gurantees that **message with same key will always be written to same partition
- It does not guarantees in which partion “a particular message with key” will be written (decided randomly based on hash value of the given key)
Consumer
- A Consumer is the client that reads (consumes) data/message from a topic
- The consumer is an external application that reads messages from Kafka topics and does some work with them
- A consumer knows that from which broker it should read the data (a consumer can read data from multiple brokers at the same time)
- The consumer reads the data within each partition in an orderly manner (i.e. consumer is not supposed to read data from offset 1 before reading from offset 0)
- Consumer can only ever read committed messages (ommitted messages == those that have been written to all in sync replicas)
Providing consistency as a consumer
- receive each message at most once
- receive each message at least once
- usually preffered
- make sure yous consumer client is idempotent otherwise duplication will happen
- receive each message exactly once. Each of these scenarios deserves a discussion of its own
Consumer Group
- Consumers can also be organized into consumer groups for a given topic
- Each consumer within the group reads from a unique partition and the group as a whole consumes all messages from the entire topic
- Reading partition by a group
- If you have more consumers than partitions then some consumers will be idle because they have no partitions to read from
- If you have more partitions than consumers then consumers will receive messages from multiple partitions
- If you have equal numbers of consumers and partitions, each consumer reads messages in order from exactly one partition
Bootstrap Server
- Every Kafka broker is also called a “bootstrap server”. Why? because:
- All brokers know about all other brokers
- Access to any broker gives access to all brokers
- That means that you only need to connect to one broker and you will be connected to the entire cluster
- Each broker knows about all brokers, topics and partitions (it’s called metadata)
Zookeeper
- Apache Zookeeper is used in distributed computing for managing cluster
- Zookeeper in Kafka
- Prior to 2.8 release, Kafka can’t work without Zookeeper
- Version 2.8+ : Zookeeper is removed
- Zookeeper manages brokers (keeps a list of them)
- Zookeeper helps in performing leader election for partitions
- Zookeeper sends notifications to Kafka in case of changes (e.g. new topic, broker dies, broker comes up, delete topics, etc….)
- Zookeeper by design operates with an odd number of servers (3,5,7)
- Zookeeper has a leader (handle writes) the rest of the servers are followers (handle reads)
Kafka in a Nutshell
Core APIs
- Producer API : This API allows/permits an application to publish streams of records to one or more topics. (discussed in later section)
- Consumer API : This API allows an application to subscribe one or more topics and process the stream of records produced to them.
- Streams API : This API allows an application to effectively transform the input streams to the output streams. It permits an application to act as a stream processor which consumes an input stream from one or more topics, and produce an output stream to one or more output topics.
- Connector API: This API executes the reusable producer and consumer APIs with the existing data systems or applications.
Kafka Connect
- Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems
- Kafka Connect is a framework to stream data into and out of Apache Kafka
- Kafka connect is an application running outside of cluster (independent of brokers)
- It makes it simple to quickly define connectors that move large data sets into and out of Kafka
- Kafka Stream is the Streams API to transform, aggregate, and process records from a stream and produces derivative streams
- Kafka Connect is the connector API to create reusable producers and consumers
Kafka Connect is a free, open-source component of Apache Kafka that works as a centralized data hub for simple data integration between databases, key-value stores, search indexes, and file systems.
The benefits of Kafka Connect include:
- Connect uses meaningful data abstractions to pull or push data to Kafka
- Connect runs with streaming and batch-oriented systems on a single node (standalone) or scaled to an organization-wide service (distributed)
- Connect leverages existing connectors or extends them to tailor to your needs and provides lower time to production
Connector
- A connector is a plugable component that will used in Kafka connect
- Kafka connect provides API and connector uses that API
- A connector is either source or sink
- Source connector: acts as Producer to Kafka (i.e. Database -> Connector -> Kafka)
- Sink connector: acts as Consumer from Kafka (i.e. Kafka -> Connector -> Database)
- Kafka sees a connector as only producer or consumer
- There are different connectors for different Source/Sink
Kafka Streams
- Stream processing API
- The Kafka Streams API exists to provide layer of abstraction on top of the vanilla consumer
- Consumer can grow into complexity, therefore abstraction on top of the vanilla consumer might be needed. Here comes Stream API
Kafka Component Architecture
Strimzi deployment of Kafka: https://strimzi.io/docs/operators/latest/overview.html#kafka-concepts-components_str