Kafka is an open source, distributed event streaming platform. This is design for fault tolerant, high throughput data processing. This is used to build a data pipeline, streaming applications and allow the processing stream of records in fault tolerant manner.
We will be keeping this blogs more hands on with a little theoretical information, which is easily digested by anyone. we are not going to compare any traditional messaging system over Kafka as we are focusing on distributed environment.
Kafka can serve the variety of purpose, specially with the large scale message and its processing. Following are few usage of it.
- Real time Streaming
- Messaging
- Event Sourcing
- Real time analytics
- Data Pipelines
Not only the above, Kafka is included to design the loosely coupled system, that provide the al the benefits that comes with distributed systems as below listed and many more. .
- Low Latency
- Fault Tolerant
- High Throughput
- Scalable
- Durable
Lets take a deep dive into the Kafka world.
Kafka is used for variety of use cases for messages, and we can consolidated them in 3 .
- Batch Processing: (ETL Jobs)
- RealTime Processing: Transaction Management (Saga)
- Publish & Subscriber Processing: Messaging System
In General following are the terminology of Kafka System.
| Kafka Terminology | Details |
|---|---|
| Broker | Running instance of Kafka is called broker. There may be multiple brokers of kafka running in cluster. |
| Controller | In a Kafka Cluster, one of the available broker became controller (ideally first broker join the cluster) This controller receives request and take action on them |
| Topics | Topic is an entity where the messages (data) resides. It is a named entity where producers produce the data and consumers consumes from it. |
| Partition | To achieve parallelism, single topic can be divided into multiple instances called partition. Partitions can be increased and decreased as per the consumer requirement. Consumer consumes from a single partition as they are single threaded. Ordering is guaranteed in a single partition. |
| Offset | Offset is a locator which point out the location of data. Data can be located through partition & offsets. |
| Message & Ordering | Data that reside in the topics called message. Ordering of data is guaranteed in the partition. |
| Followers | |
| Leader | |
| Client – API: | These are the API, through which client can interact with the Kafka clusters. |
| Source Connector | |
| Sink Connector | |
| Producer | |
| Consumer | |
| Stream |
Kafka Cluster contains multiple kafka broker (nodes). Specifically they are called message broker and responsible to established the interaction between systems. These system are called producers and consumers. The Health of the cluster is maintained by the Zookeeper
Producer produce the data, and consumer is consuming it. This data is shipped to a specific topic. , which is further passed to partitions. When topics are created, then the partitions associated with the topic are distributed across the broker. lets see what happens when topics are created.
When topic creation is initiated, that request is received by zookeeper. Zookeeper pass this request to controller, which is an additional responsibility attained by one of the broker. The responsibility of the controller is to distribute the ownership of partition to the available brokers. This partition distribution among the broker is called leader assignment. So with this a topic is created and partitions are distributed.
Kafka The topic is divided into the partition and data is arrange in a sequential manner in a partition. Kafka stores the data in a file system
