The open source streaming projects landscape is getting
crowded with the Apache Software Foundation listing no less than 4 distinct
streaming projects Apache Storm, Apache Apex, Apache Spark Streaming and
Apache Flink.
There is a certain spotlight on Applications that do real-time
processing of high-volume steaming data. The primary driver for this push is
the need to provide analytics and actionable insights to businesses at a blazing
fast speed. The use case could be a Real Time Bidding of ads or credit card
fraud detection or in store customer experience. All such applications are
pushing the limits of traditional data processing infrastructures.
In this context a couple of day’s back I attended a MeetUp
event of DataTorrent’s open source stream and batch processing platform “Apache
Apex”. One of the sessions talked about the implementation of Apache Apex for
Real Time Insights for Advertising Tech helped explain its capabilities and the
kind of problems it is equipped to solve.
Directed Acyclic Graph (DAG)
The Apex applications are different from the typical Map
Reduce application. An Apache Apex application is a directed acyclic graph
(DAG) of multiple operators. A Map Reduce application with multiple iterations
is inefficient as the data between each map-reduce pair gets written and read
from file system. Compare this to a DAG based application where it avoids the
writing of data back and forth after every reduce thus adding much needed
efficiency in the application.
Code Reuse & Operability
Apex also scores with code reuse as an Apex application
enables the same business logic to be used for stream as well as batch. Apex as a platform is built for operability,
such that things like scalability, performance, security, fault tolerance, high
availability are taken care by Apex as a platform. Apex achieves fault
tolerance by ensuring that the master is backed up, and the application state
is retained in HDFS, a persistent store.
Pre-built Operators
Apex is bundled with a library of operators named Malhar. It is a library of pre-built operators for
data sources and destinations of popular message buses, file systems, and
databases like Kafka, Flume, Oracle, Cassandra, MongoDB, HDFS and others. This library
of pre-built operators reduces the time-to-market significantly, and the costs
incurred in developing a streaming analytics application.
Summary
All in all Apache Apex seems promising especially since it is a unified framework for batch & stream processing and the fact that
some of the leading organizations have implement their streaming applications
using Apex signals positive things are in the offing for this open source framework.