Sunday, March 20

Schema On Read

According to a study 90% of today's data warehouses process just 20% of the enterprise data. One of the primary reason for such a low percentage of data being processed is because the traditional data warehouses are schema on write which requires schema, partitions and indexes to be pre-built before the data can be read.

For a long time now the data world has adapted the schema-on-write approach where systems require users to create a schema before loading any data into the system. The flow is to define the schema, then write the data, then read the data in the schema that you defined at the start.  This allows such systems to tightly control the placement of the data during load time hence enabling them to answer interactive queries very fast. However, this leads to loss of agility.

But there is an alternative to this approach and that approach is schema-on-read. Hadoop's schema-on-read capability, allows data to start flowing into the system in its original form, then the schema is parsed at read time where each user can apply their own view to interpret the data.

In the "Schema on read" world the flow is to load the data in its current form and apply your own lens to the data when you read it back out. The reason to do it this way is because nowadays data is shared with large and varied set of users with each category of user having a unique way to look at the data and derive a insight that is specific to them. With schema-on-read approach the data is not restricted to a one-size-fits-all format. This approach allows one to present the data back in a schema that is most relevant to the task at hand and allows for extreme agility while dealing with complex evolving data structures..
The schema-on-read approach allows one to load the data as is and start to get value from it right away. This is equally true for structured,  semi-structured and unstructured data. It is easier to create multiple and different views of the same data with schema on read.
The popularity of Hadoop and related technologies in today's enterprise technology can be in parts credited to their ability to support the schema-on-read strategy. Organisations have large amounts of raw data that powers all sorts of business processes by applying transformation  systems involving corporate data warehouses and other large data assets.

Saturday, November 28

Stream Processing Using Apache Apex


            The open source streaming projects landscape is getting crowded with the Apache Software Foundation listing no less than 4 distinct streaming projects Apache Storm, Apache Apex, Apache Spark Streaming and Apache Flink.                            
                                     
                     There is a certain spotlight on Applications that do real-time processing of high-volume steaming data. The primary driver for this push is the need to provide analytics and actionable insights to businesses at a blazing fast speed. The use case could be a Real Time Bidding of ads or credit card fraud detection or in store customer experience. All such applications are pushing the limits of traditional data processing infrastructures.
                      
                       In this context a couple of day’s back I attended a MeetUp event of DataTorrent’s open source stream and batch processing platform “Apache Apex”. One of the sessions talked about the implementation of Apache Apex for Real Time Insights for Advertising Tech helped explain its capabilities and the kind of problems it is equipped to solve. 

Directed Acyclic Graph (DAG)

                      The Apex applications are different from the typical Map Reduce application. An Apache Apex application is a directed acyclic graph (DAG) of multiple operators. A Map Reduce application with multiple iterations is inefficient as the data between each map-reduce pair gets written and read from file system. Compare this to a DAG based application where it avoids the writing of data back and forth after every reduce thus adding much needed efficiency in the application. 

Code Reuse & Operability                       

                   Apex also scores with code reuse as an Apex application enables the same business logic to be used for stream as well as batch.  Apex as a platform is built for operability, such that things like scalability, performance, security, fault tolerance, high availability are taken care by Apex as a platform. Apex achieves fault tolerance by ensuring that the master is backed up, and the application state is retained in HDFS, a persistent store.

Pre-built Operators

                      Apex is bundled with a library of operators named Malhar.  It is a library of pre-built operators for data sources and destinations of popular message buses, file systems, and databases like Kafka, Flume, Oracle, Cassandra, MongoDB, HDFS and others. This library of pre-built operators reduces the time-to-market significantly, and the costs incurred in developing a streaming analytics application.

Summary

              All in all Apache Apex seems promising especially since it is a unified framework for batch & stream processing and the fact that some of the leading organizations have implement their streaming applications using Apex signals positive things are in the offing for this open source framework.