Concepts

Author : MD TAREQ HASSAN | Updated : 2021/04/06

Topic

Topic is a particular stream of data
- Similar to a table in a database (without all the constraints)
- You can have as many topics as you want
- A topic is defined by it’s name
Topic can be think of “a anmed container for similar events” or “append only / ordered log of events”
If you wish to send a message you send it to a specific topic and if you wish to read a message you read it from a specific topic
A consumer pulls messages off of a Kafka topic while producers push messages into a Kafka topic
I n Kafka, the word topic refers to a category/feed name or a common name used to store and publish a particular stream of data
All Kafka records are organized into topics
Data in Kafka is stored in topics
Producer applications write data to topics and consumer applications read from topics

Partition

A kafka topic is split in partitions
- Partitions are separated in an order, starting from zero
- While creating a topic, we need to specify the number of partitions(the number is arbitrary and can be changed later)
Data content gets stored in the partitions within the topic
- The data once written to a partition can never be changed. It is immutable
- The data is kept in a partition for a limited time only (retaintion period can be set)
Data is assigned randomly to a partition unless a key is provided

Each topic is divided into partions (i.e. 3 partitions) and then each partion will be replicated

Topic: T0 (T-zero)
Partitions:
- T0_P0
- T0_P1
- T0_P2
Replication Factor: 3
Replicas
- T0_P0_R0
- T0_P0_R1
- T0_P0_R2

Partitions allow you to parallelize a topic by splitting the data in a particular topic across multiple brokers — each partition can be placed on a separate machine to allow for multiple consumers to read from a topic in parallel. Consumers can also be parallelized so that multiple consumers can read from multiple partitions in a topic allowing for very high message processing throughput.

Offset

Each message gets stored into partitions with an incremental id known as its Offset value
Each message within a partition has an identifier called its offset
Offset starts with zero
The offsets for a partition are infinite
The order of the offset value is guaranteed within the partition only and not across the partition
The ordering of messages in a partition is an immutable sequence
The offset value always remains in an incremental state, it never goes back to an empty space
data in offset:
- Does not have any relation with the data in offset in different partition
- Inter-related with data in offset in the same partition

Another way to view a partition is as a log. A data source writes messages to the log and one or more consumers reads from the log at the point in time they choose.

Hierarchy of topic, partion and offset

Kafka - hierarchy of topic, partion and offset

Data Log

Kafka retains messages for a configurable period of time and it is up to the consumers to adjust their behaviour accordingly. For instance, if Kafka is configured to keep messages for a day and a consumer is down for a period of longer than a day, the consumer will lose messages. However, if the consumer is down for an hour it can begin to read messages again starting from its last known offset. From the point of view of Kafka, it keeps no state on what the consumers are reading from a topic.

Broker

Kafka, as a distributed system, runs in a cluster (a cluster is a set of nodes where each node is a VM or Physical Server). Each node in the cluster is called a Kafka broker
A broker is a Kafka instance running on a node
- As far as the definition goes, a node is a typically a physical entity, a machine or a VM, but a broker is a process
- We can run multiple brokers in one physical node also
The brokers in the cluster are identified by an integer id only
Kafka brokers are also known as Bootstrap brokers because connection with any one broker means connection with the entire cluster
Although a broker does not contain whole data, but each broker in the cluster knows about all other brokers, partitions as well as topics

What broker does?

Manages partitions
Manages replication of partitions
Manages read and write requests

Leader Partition

Each topic is divided into partions (i.e. 3 partitions) and then each partion will be replicated
The kafka cluster chooses one of the broker’s partition as a leader, and the rest of them becomes its followers
The followers(brokers) will be allowed to synchronize the data. But, in the presence of a leader, none of the followers is allowed to serve the client’s request. These replicas are known as in-sync-replica. So, Apache Kafka offers multiple in-sync-replica for the data
Only the leader is allowed to serve the client request. The leader handles all the read and writes operations of data for the partitions. The leader and its followers are determined by the zookeeper

In Sync Replicas

In-Sync Replicas = ISR
Each broker holds a number of partitions and each of these partitions can be either:
- A leader replica (of a partition)
- A follower replica
All writes and reads to a topic go through the leader and the leader coordinates updating replicas with new data
The follower replicas that are co-ordinating with leader and synced properly (no data mismatch) are called ISR
- Lets say we have 3 replicas - PartitionX_R0, PartitionX_R1, PartitionX_R3
- R0 is leader and other two are followers
- At a given moment, PartitionX_R1 and PartitionX_R2 are properly synced with PartitionX_R0 (no data mismatch, all 3 replicas containing same data)
  - PartitionX_R0 -> Leader
  - PartitionX_R1 and PartitionX_R2 -> ISR
If the broker holding the leader for the partition fails to serve the data due to any failure, one of its respective in-sync-replica replicas will takeover the leadership. Afterward, if the previous leader returns back, it tries to acquire its leadership again

Producer

A producer is the client that publishes or writes data/message to the topics
A producer is an external application that writes messages to a Kafka cluster, communicating with the cluster using Kafka’s network protocol
Producers write to a single leader. then leader replicates data to in-sync-replicas
A producer uses following strategies to write data to the cluster:
- Message Keys
- Acknowledgment

Acknowledgment

wait for all in sync replicas to acknowledge the message (acks=all)
wait for only the leader to acknowledge the message (acks=1) -> this is default
do not wait for acknowledgement (acks=0)

Writing to partition

If message does not have any key, then data will published by producer will be written as ‘round-robin’
If message does have a key, then same key messages will always be written to same partition

Message with key

The given key will be hashed
Based on the hash value, message with same key will be written to the same partition (of a topic)
Gurantees that **message with same key will always be written to same partition
It does not guarantees in which partion “a particular message with key” will be written (decided randomly based on hash value of the given key)

Consumer

A Consumer is the client that reads (consumes) data/message from a topic
The consumer is an external application that reads messages from Kafka topics and does some work with them
A consumer knows that from which broker it should read the data (a consumer can read data from multiple brokers at the same time)
The consumer reads the data within each partition in an orderly manner (i.e. consumer is not supposed to read data from offset 1 before reading from offset 0)
Consumer can only ever read committed messages (ommitted messages == those that have been written to all in sync replicas)

Providing consistency as a consumer

receive each message at most once
receive each message at least once
- usually preffered
- make sure yous consumer client is idempotent otherwise duplication will happen
receive each message exactly once. Each of these scenarios deserves a discussion of its own

Consumer Group

Consumers can also be organized into consumer groups for a given topic
Each consumer within the group reads from a unique partition and the group as a whole consumes all messages from the entire topic
Reading partition by a group
- If you have more consumers than partitions then some consumers will be idle because they have no partitions to read from
- If you have more partitions than consumers then consumers will receive messages from multiple partitions
- If you have equal numbers of consumers and partitions, each consumer reads messages in order from exactly one partition

Bootstrap Server

Every Kafka broker is also called a “bootstrap server”. Why? because:
- All brokers know about all other brokers
- Access to any broker gives access to all brokers
That means that you only need to connect to one broker and you will be connected to the entire cluster
Each broker knows about all brokers, topics and partitions (it’s called metadata)

Zookeeper

Apache Zookeeper is used in distributed computing for managing cluster
Zookeeper in Kafka
- Prior to 2.8 release, Kafka can’t work without Zookeeper
- Version 2.8+ : Zookeeper is removed
Zookeeper manages brokers (keeps a list of them)
Zookeeper helps in performing leader election for partitions
Zookeeper sends notifications to Kafka in case of changes (e.g. new topic, broker dies, broker comes up, delete topics, etc….)
Zookeeper by design operates with an odd number of servers (3,5,7)
Zookeeper has a leader (handle writes) the rest of the servers are followers (handle reads)

Kafka in a Nutshell

Picture of 'Kafka in a Nutshell'

Core APIs

Producer API : This API allows/permits an application to publish streams of records to one or more topics. (discussed in later section)
Consumer API : This API allows an application to subscribe one or more topics and process the stream of records produced to them.
Streams API : This API allows an application to effectively transform the input streams to the output streams. It permits an application to act as a stream processor which consumes an input stream from one or more topics, and produce an output stream to one or more output topics.
Connector API: This API executes the reusable producer and consumer APIs with the existing data systems or applications.

Kafka Connect

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems
Kafka Connect is a framework to stream data into and out of Apache Kafka
Kafka connect is an application running outside of cluster (independent of brokers)
It makes it simple to quickly define connectors that move large data sets into and out of Kafka
Kafka Stream is the Streams API to transform, aggregate, and process records from a stream and produces derivative streams
Kafka Connect is the connector API to create reusable producers and consumers

Kafka Connect is a free, open-source component of Apache Kafka that works as a centralized data hub for simple data integration between databases, key-value stores, search indexes, and file systems.

The benefits of Kafka Connect include:

Connect uses meaningful data abstractions to pull or push data to Kafka
Connect runs with streaming and batch-oriented systems on a single node (standalone) or scaled to an organization-wide service (distributed)
Connect leverages existing connectors or extends them to tailor to your needs and provides lower time to production

Connector

A connector is a plugable component that will used in Kafka connect
Kafka connect provides API and connector uses that API
A connector is either source or sink
- Source connector: acts as Producer to Kafka (i.e. Database -> Connector -> Kafka)
- Sink connector: acts as Consumer from Kafka (i.e. Kafka -> Connector -> Database)
- Kafka sees a connector as only producer or consumer
There are different connectors for different Source/Sink

Kafka Streams

Stream processing API
The Kafka Streams API exists to provide layer of abstraction on top of the vanilla consumer
Consumer can grow into complexity, therefore abstraction on top of the vanilla consumer might be needed. Here comes Stream API

Kafka Component Architecture

Strimzi deployment of Kafka: https://strimzi.io/docs/operators/latest/overview.html#kafka-concepts-components_str