Saturday, November 28

Stream Processing Using Apache Apex


            The open source streaming projects landscape is getting crowded with the Apache Software Foundation listing no less than 4 distinct streaming projects Apache Storm, Apache Apex, Apache Spark Streaming and Apache Flink.                            
                                     
                     There is a certain spotlight on Applications that do real-time processing of high-volume steaming data. The primary driver for this push is the need to provide analytics and actionable insights to businesses at a blazing fast speed. The use case could be a Real Time Bidding of ads or credit card fraud detection or in store customer experience. All such applications are pushing the limits of traditional data processing infrastructures.
                      
                       In this context a couple of day’s back I attended a MeetUp event of DataTorrent’s open source stream and batch processing platform “Apache Apex”. One of the sessions talked about the implementation of Apache Apex for Real Time Insights for Advertising Tech helped explain its capabilities and the kind of problems it is equipped to solve. 

Directed Acyclic Graph (DAG)

                      The Apex applications are different from the typical Map Reduce application. An Apache Apex application is a directed acyclic graph (DAG) of multiple operators. A Map Reduce application with multiple iterations is inefficient as the data between each map-reduce pair gets written and read from file system. Compare this to a DAG based application where it avoids the writing of data back and forth after every reduce thus adding much needed efficiency in the application. 

Code Reuse & Operability                       

                   Apex also scores with code reuse as an Apex application enables the same business logic to be used for stream as well as batch.  Apex as a platform is built for operability, such that things like scalability, performance, security, fault tolerance, high availability are taken care by Apex as a platform. Apex achieves fault tolerance by ensuring that the master is backed up, and the application state is retained in HDFS, a persistent store.

Pre-built Operators

                      Apex is bundled with a library of operators named Malhar.  It is a library of pre-built operators for data sources and destinations of popular message buses, file systems, and databases like Kafka, Flume, Oracle, Cassandra, MongoDB, HDFS and others. This library of pre-built operators reduces the time-to-market significantly, and the costs incurred in developing a streaming analytics application.

Summary

              All in all Apache Apex seems promising especially since it is a unified framework for batch & stream processing and the fact that some of the leading organizations have implement their streaming applications using Apex signals positive things are in the offing for this open source framework.