Things change quickly in the IT industry – so quickly that what one day is relevant can be obsolete the next. So what can be expected from a system launched back in 2011, like Apache Kafka?
In this article, we look at whether this platform remains relevant a decade after its development and we explore its key functions and typical applications.
Brief history of Apache Kafka
In 2011, Kafka was born as an OpenSource platform at LinkedIn and remained in the Apache incubator until 2012.
Initially, Kafka was developed to be used within the LinkedIn site, which had low latency issues when handling high volumes of event data. In the early 2010s, real time processing technologies didn’t exist, and instead organizations had to rely on enterprise messaging or batch-based solutions.
Kafka changed this thanks to its superb data ingestion capabilities (trillions at the time of writing). Currently, tens of thousands of organisations have adopted Kafka, among them very popular real-time event-driven experiences, such as Netflix, PayPal, AirBnB and Pinterest.
How Apache Kafka works
A typical Kafka environment consists of:
- One or more Producers, which write or publish data.
- A Kafka cluster containing one or more Brokers. Connecting to any one Broker means being connected to the entire cluster.
- Topics or data streams (similar to database tables).
- Partitions, where topics are split as separate nodes.
- One or more Consumers, which read data contained in a topic.
A Kafka environment works through five APIs:
- An Admin client.
- Producers API, used by producer apps to write data into topics.
- Consumers API, used by consuming apps to read topics.
- Connect API, has an import-export function as it enables data streaming between Kafka and other systems.
- Streams API, which transforms input topics into output topics in the form or real-time applications and micro-services.
Where is Kafka used for?
As a system, Kafka serves three main functions:
- Real-time data processing.
- Accurate, durable, and sequential data storage.
- As a message broker, it lets applications publish or subscribe one or multiple event streams.
This means that Kafka isn’t only used to process and analyse real-time data, but also to respond to data events as they happen. As such, it’s well-suited to power data-rich interactive ecosystems that rely on multiple layers of databases, events, and applications and who therefore need to integrate these layers to make optimal use of data.
Its characteristics also make it popular to build applications on event-driven architectures, offering distinct advantages over traditional counterparts, namely scalability, fault tolerance, speed, and reliability.
Another common use case is event streaming, a booming market following the widespread adoption of remote or hybrid work models and the growing implementation of machine learning, artificial intelligence and Internet of Things in corporate environments.
In summary, Kafka is used in environments with complex data needs, especially in large organizations with multiple departments and different lines of business, such as e-commerce, banking, cybersecurity, manufacturing, and social media.
Specific applications include online and offline analytics, anomaly detection, trend analysis, platforms that handle interactive queries, payment processing, infrastructure monitoring, and financial trading, to mention just a few.
Real-time data processing
Kafka excels at ingesting large volumes of real-time data from multiple sources. As a result, typical case use scenarios would be:
- – To build or to operate distributed applications with real-time streaming capabilities.
- – To build or to operate applications that need to function as real-time data pipelines.
- – In Big Data applications.
Kafka’s data processing capabilities provide an interface between different systems and databases. For example, let’s take a supermarket with nationwide operations that generates data on sale transactions. Here, the store would typically have some software in place that monitors sales data. This software publishes data to Kafka, which then sorts it into ordered Topics or data streams (for example, sales between 7 a.m. and 9 a.m., sales under $100, sales transactions for product XYZ, etc.)
Topics are turned into data streams using the Producer function, after which they can be read and fed to processing pipelines (think Spark or Storm) or to data lakes, through the Consumer function. Using the above example, Kafka would serve as the interface between data streams and the different departments that underpin the store’s operations, for instance by providing real-time data relevant to the company’s sales strategy, to its warehousing practices, or to forecast trends.
Netflix is a real-life example of Kafka’s data processing capabilities, since it uses the platform to ingest up to 500 billion events per day, using a setup that includes Java and REST APIs, ElasticSearch, Spark, multiple analytics engines, and Kafka as the structuring backbone of it all.
Messaging
As a message broker, Kafka enables both queueing and publish-subscribe functionalities, and acts as the intermediary or bridge that enables communication between applications.
Typical Kafka messaging flow would look like this:
- A Producer publishes messages to a Topic.
- Brokers store the messages in the form of serialized bytes. This is done in order making use of the message log structure.
- Consuming applications can read and subscribe those messages, as well as de-serialize them.
There are additional things to note about this process. First, the Producer can create an unlimited number of topics, as long as each topic has a unique name. These topics are not stored in a single location or cluster, but instead subdivided into Partitions, which contains multiple messages or records.
Second, messages or records published to topic partitions don’t have a message ID, but rather an offset. The offset is assigned in an incremental manner, which makes sense given that Kafka records cannot be deleted or altered (they’re append-only).
Lastly, Kafka persists and replicates messages to one multiple brokers over a given period of time, which is set by default to 7 days but can be configured depending on needs. This contributes to high durability and high reliability, which make Kafka relevant as a message broker.
Log Aggregation
Apache Kafka doesn’t use application logs as such, and instead conceptualizes a log as the basic structure upon which a database can be built. In other words, log aggregation in Kafka is the process of appending data in a sequential manner.
In this respect, Kafka displays offers some advantages over other log aggregation tools. Instead of simply collecting log files and placing them in a centralized location for further processing, Kafka is capable of abstracting file details and presenting event or log data as a clean stream of messages. As a result, users benefit from low latency, which facilitates the processing and consumption of distributed data.
This last aspect is crucial, since centralized log aggregation systems (CLS) run into performance issues once they need to operate in complex environments, such as multi-cloud or distributed applications – think log bursts or the inability to meet uptime as defined in SLAs. This isn’t the case with Kafka, which certainly adds to its relevance.
When Kafka shouldn’t be used for
Despite its remarkable capabilities, Kafka doesn’t suit every possible scenario. The technology has limitations in:
– IoT applications where hard real-time with sub-millisecond latency and no spikes are a must. When we say Kafka delivers real-time processing, we mean soft or near-real time.
– Settings that process safety-critical data, such as vehicle-to-vehicle communication, robotics, industrial process controllers, or medical systems. The reason being that Kafka is not a deterministic system.
– ETL pipelines in tasks other than supporting real-time analytics.
– Kafka doesn’t replace a traditional database used for persistent storage, since it lacks important elements like indexes and transactions.
– Systems that only handle a few thousand messages per day. Kafka will still serve its purpose, but implementing it outside of Big Data environments is overkill.
This last point brings us onto the issue of Kafka alternatives, such as RabbitMQ.
Apache Kafka vs RabbitMQ
RabbitMQ is an open source message broker. As a messaging framework, its key function is to provide point-to-point message queueing to so that server resources aren’t overwhelmed, causing delayed responses. Rabbit is also used to balance loads across different workers or to distribute messages to different consumers.
The message flow is similar to Kafka, as data goes from Producer to Consumer via a Broker, but these systems can’t be used interchangeably.
Main differences
– In terms of architecture, Kafka uses a partitioned log model, which is more complex than the messaging queue architecture used by Rabbit.
– Message retention isn’t a feature in Rabbit, since messages are deleted upon consumption.
– Kafka replays topics as per the specified retention policy, so multiple consumers can subscribe or read the same message, which isn’t possible with Rabbit due to default message deletion.
– You can assign high-priority messages in RabbitMQ, but not in Kafka, where message ordering is embedded in its architecture.
– Scalability in Kafka means distributing partitions in multiple servers, whereas in Rabbit is achieved by increasing the number of consumers.
– Lastly, although both use binary protocols, Kafka relies on TCP while Rabbit supports AMQP as well as STOMP and MQTT through plugins.
Which one should you choose?
This depends on your needs, so take the following considerations into account:
Kafka is the gold standard for event streaming and was designed to be used in stream processing scenarios. Therefore, consider Kafka if you require log aggregation, metrics, event sourcing, activity tracking, applications with a stream history, or multi-stage pipelines.
On the other hand, RabbitMQ’s main functions are delivering a quick response, sharing loads, and efficient message delivery. This makes Rabbit best suited to “Little Data” scenarios, whereas Kafka delivers in Big Data applications.
In terms of performance, they both do well in their respective ranges: one of the main Kafka advantages is that it can handle 1 million messages / second, whereas RabbitMQ’s upper limit is around 10,000 / second.
Undoubtedly, Kafka’s highlight is that it’s built to perform at scale. There are no partitions in Rabbit, so if you’re looking for both scalability and redundancy, Kafka is the way to go.
Conclusions
Kafka is a powerful technology used by 80% of Fortune 100 companies, and for a good reason. The platform offers reliability, quick scalability and constant high performance in stream processing, persistent storage, and messaging being the idea choice for soft real-time applications in enterprise environments with complex data needs.