Simple Explanation of Apache Kafka

Can anybody explain Apache Kafka for me in a plain language? I'd appreciate an explanation with a practical example instead of abstract theoretical definitions, then I can understand better.

What is it used for? What does messaging mean? Messaging between what?!! At which stage of a BigData analysis is it used?

And what are prerequisites for learning it?

PS: Please explain as you would explain for a non-technical person

Topic apache-kafka bigdata

Category Data Science


It might be easier for you to take a look at the events streaming services that are offered in Amazon Web Services (AWS), mainly Kinesis.

The easiest of them in Kinesis Firehose (http://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html), which gives you an endpoint where you can write your events to, and they will appear in Amazon Redshift, which is a data warehouse service that can scale to huge sizes (https://aws.amazon.com/redshift/).

AWS also announced that they will add Kinesis Analytics, which will allow you to write SQL code on top of the events stream of Kinesis. You can calculate different statistics such as moving agerages and other relevant statistic on your incoming events.

These events that are processed in near real time, can be clicks on your web site, purchases in your e-commerce site, security events that should be monitored for potential intrusion, etc.


Can anybody explain Apache Kafka for me in a plain language?

As the official documentation of Kafka says:

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. It follows the publish-subscribe messaging style, with speed and durability built in.

Now, let me give a dumbed down explanation of the same:

A publish-subscribe(commonly called pub-sub) architecture basically contains a publisher(who sends the messages, for example: A twitter stream) and a subscriber(who receives the messages Ex: the analyst/the user's activity stream)

A clearer explanation would be:

In software architecture, publish–subscribe is a messaging pattern where senders of messages, called publishers, do not program the messages to be sent directly to specific receivers, called subscribers, but instead characterize published messages into classes without knowledge of which subscribers, if any, there may be. Similarly, subscribers express interest in one or more classes and only receive messages that are of interest, without knowledge of which publishers, if any, there are.

And keep in mind that Kafka is a distributed pub-sub messaging system, designed to scale.

enter image description here

Now,

At which stage of a BigData analysis is it used?

It is basically used in the Extraction step of the ETL pipeline. It can contain high volumes and speed of data flow (we call it high throughput in technical terms).

It can store messages also, but it is a persistent store; which means that the data is not stored forever. The stored data has an expiry date.

And what are prerequisites for learning it?

This is a tough question, as everyone who can program can get started with it, but if you really want to implement this in your architecture, then these are the pre-requisites which I can think of:

  • Know what is a persistent data store and where to use them
  • Know what a pub-sub messaging system is
  • Know what an extraction step is, in ETL. And also know how to deal with data with huge volume coming at you at high velocity.

Further reading:

Kafka has an excellent documentation, which can help to get started with it.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.