Can anybody explain Apache Kafka for me in a plain language?
As the official documentation of Kafka says:
Apache Kafka is publish-subscribe messaging rethought as a distributed
commit log. It follows the publish-subscribe messaging style, with speed and durability built in.
Now, let me give a dumbed down explanation of the same:
A publish-subscribe(commonly called pub-sub) architecture basically contains a publisher(who sends the messages, for example: A twitter stream) and a subscriber(who receives the messages Ex: the analyst/the user's activity stream)
A clearer explanation would be:
In software architecture, publish–subscribe is a messaging pattern
where senders of messages, called publishers, do not program the
messages to be sent directly to specific receivers, called
subscribers, but instead characterize published messages into classes
without knowledge of which subscribers, if any, there may be.
Similarly, subscribers express interest in one or more classes and
only receive messages that are of interest, without knowledge of which
publishers, if any, there are.
And keep in mind that Kafka is a distributed pub-sub messaging system, designed to scale.
![enter image description here](https://i.stack.imgur.com/gcBD0.png)
Now,
At which stage of a BigData analysis is it used?
It is basically used in the Extraction step of the ETL pipeline. It can contain high volumes and speed of data flow (we call it high throughput in technical terms).
It can store messages also, but it is a persistent store; which means that the data is not stored forever. The stored data has an expiry date.
And what are prerequisites for learning it?
This is a tough question, as everyone who can program can get started with it, but if you really want to implement this in your architecture, then these are the pre-requisites which I can think of:
- Know what is a persistent data store and where to use them
- Know what a pub-sub messaging system is
- Know what an extraction step is, in ETL. And also know how to deal with data with huge volume coming at you at high velocity.
Further reading:
Kafka has an excellent documentation, which can help to get started with it.