Thursday, December 8

Stream Processing At Scale : Kafka & Samza

Businesses today generate millions of events as part of their daily operations. One such example is Uber that generates thousands of events like when you open the Uber app to see how many cars are near by that is a eye ball event, your booking of a cab is an event, the uber driver accepting your request is another event and many more such events.
Unbounded un-ordered and large scale data sets have become increasingly common and come from varied sources like satellite data , scientific instruments, stock data and traffic control. Basically data that arrives continuously and is large in scale and is never ending. To summarize entire business can be represented as streams of data. Turning these streams into valuable, actionable information in real time is critical to the success of any organization.

Challenges

These requirements lead to development of applications whose primary job is to consume these never ending continuous stream of events and process them successfully in near real time. The number of events from such a business are extremely high or large scale in nature. Each of these events needs to be sent somewhere and in most cases there would be multiple applications that would like to process a single event.

Stream Processing

Stream Processing can be looked into in 2 parts. One is how the application gets its input and the way it produces the output. The aspect of getting the stream input can be owned by a message broker and the job to produce the output can be owned by the processing framework.

Message Broker

A Message Broker would need to deal with the never ending fire hose of events (100k+/sec) and hence needs to be scalable, fault-tolerant and with high throughput. The Message Broker should also support multiple subscribers as there could be more than one application that would process the messages. The message broker should also have the ability to persist the messages. Another important requirement from the message broker is performance.

Apache Kafka

Currently Apache Kafka is the unanimous choice when it comes to message broker in stream processing applications. Performance-wise, it looks like Kafka blows away the competition as it handles tens of millions of reads and writes per second from thousands of clients all day long on modest hardware. In Kafka, messages belonging to a topic are distributed among partitions.

This ability of a Kafka topic to be divided into partitions allows Kafka to score high on scalability. Kafka is designed as distributed from the ground up as it runs as a cluster comprised of one or more servers each of which is called a broker. Kafka's message delivery is durable as it writes everything to the disk while maintaining the performance. Kafka can process 8 million messages per second at peak load.

​Processing Framework

​The other half of the stream processing is the Processing Framework that would consume the message, process it and produce the output message. Stream Processing framework need to have a one-at-a-time processing model and the data has to be processed immediately upon arrival. A Stream Processing framework also need to achieve low sub-second latency as it is critical to keep the data moving. The requirement to produce result for every event needs the processing framework to be scalable, highly available and fault tolerant.

streaming-comparision.png
Stream Processing Frameworks Comparision. Source : http://www.cakesolutions.net


​Apache Samza

​The Stream Processing framework space is crowded with multiple players like Strom, Flink, Spark Streaming, Samza, Kafka Streams, DataFlow. Apache Samza is probably the least well known stream processing framework that is trying to make a space for itself. Data Intensive organizations like Uber, NetFlix, Linkedin that process millions of events every second have Samza in their Stream Processing Architecture.
Samza has an advantage when it comes to performance, stability and support for a variety of input sources. Since Samza and Kafka were developed at Linkedin around the same time Samza is very Kafka-centric and has excellent integration with Kafka.
The key difference between Samza and other streaming technologies is its stateful streaming processing capability. Samza tasks have dedicated key/value store co-located on the same machine as the task. This approach delivers better read/write performance than any other streaming processing software. The Kafka, Samza based Stream Processing Architecture has proved its mettle in data intensive high frequency unbound stream processing use cases.