Monday, May 30

Serverless Computing

Introduction

Serverless is the latest buzzword in the software architecture world. It is an approach to development where the need for creating and maintaining server’s from physical machines to VMs to cloud instances is removed. This usually means that the architecture is some form of application that interacts with multiple third party APIs/services and self-created non-server based APIs to deliver it's functionality.
Serverless computing could mean that while the server-side logic is written by the application developer, it is run in stateless compute containers that are event-triggered and fully managed by a 3rd party. Going by the amount of attention that major cloud vendors  -  Amazon, Microsoft, Google and IBM  -  are giving it, Serverless technologies could well be the future of cloud.


Existing Platforms

Naturally not all applications can be implemented in the server-less way. There are limitations especially when it comes to legacy systems and using a public cloud. However, the adoption of existing server-less frameworks is only growing by the day. Currently the following server-less computing frameworks are available

  • ·        AWS Lambda
  • ·        Google Cloud Functions
  • ·        Iron.io
  • ·        IBM Open Whisk
  • ·        Microsoft Azure Web Jobs

Function As A Service

Function-As-A-Service lets you run code without provisioning or managing servers. The paradigm of Server-less
Computing is based on the micro-services architecture. Server-less computing frameworks invoke autonomous code snippets when triggered by external events.
These snippets are loosely coupled with each other that are essentially designed to perform one task at a time. Server-less frameworks are responsible for orchestrating the code snippets at runtime. This way one can deploy their applications as independent functions, that respond to events, get charged for only when they run, and scale automatically.
The concept of Server-less computing allows developers to upload autonomous code snippets that are invoked and orchestrated at runtime. Each snippet is versioned and maintained independently of other snippets. This approach marries the Microservices concept with Server-less computing. 
The server-less computing allows the developers to skip the need to provision their resources based on current or anticipated loads, or put a lot of effort into planning for new projects. Similar to Virtual Machines which have made it easy to spin up servers to create new applications, server-less computing services make it simple to grow.

Conclusion

The rise in containers and microservices is the driving force behind serverless computing. Server-less Computing turns out to be a excellent choice for applications like Event Driven Systems, Mobile Backends, IoT Applications, ETL and APIs. For certain use cases serverless computing approach can significantly reduce operational cost.  

References

Friday, April 8

Apache Sentry : Hadoop Security Authorization

                           

            Hadoop has gained significant momentum in the past few years as a platform for scalable data storage and processing. More and more organisations are adopting to Hadoop as an integral part of their data strategy. With the concepts of data lakes and democratization of data becoming more popular, more users are getting access to data that was privy to a select few.
             
                  In absence of a robust security model in place, this approach posses a risk to the sensitive data getting into the hands of unintended audience. Also with Government regulations and compliance like HIPPA, PII and others it is becoming more important for organizations to implement a meaningful security model around the Data Lake. The topmost concern for anyone building a Data Lake today are security and governance. The problem becomes all the more complex as there are multiple components within Hadoop that accesses the data. Also there are multiple security mechanisms that one can use.
                  
                 While implementing security would mean taking care of Authentication, Authorization and Auditing, in today's blog we will focus on the Authorization aspect and particularly on one solution Apache Sentry.  Apache Sentry an Apache Top Level project is an authorization module for Hive, Search, Impala, and others that helps define what users and applications can do with data.

Currently Users can grant themselves permissions in order to prevent accidental deletion of data. The problem with this approach is it does not guard against malicious users. The other option is HDFS Impersonation where Data is protected at the file level by HDFS permissions. But File-level permissions are not granular enough and it does not support role based access control.

              
           Apache Sentry overcomes most of these problems providing secure role-based authorization. Sentry provides fine grained authorization, a key requirement for any authorization service. The granularity supported by Sentry is supported for servers, databases, tables, views, indexes and collections. Sentry supports multi-tenant administration with separate policies for each database/schema and support for maintaining by separate administrators. In order to use Sentry one needs to have CDH 4.3.0 or later and secure HiveServer2 with strong authentication (Kerberos or LDAP).

                In short Apache Sentry is enabled to store sensitive Data in Hadoop, there by extending Hadoop to more users and helping organizations to comply with regulations. The fact that Sentry recently as graduated out of Incubator and is now an Apache Top Level says a lot about its abilities in providing a robust authorization to Hadoop ecosystem. 

Wednesday, April 6

Data Lake

A data lake can be defined as an unstructured data warehouse where you pull in all of your different sources into one large pool of data. Data lake as a concept can be applied more broadly to include any type of systems but in today's Big Data world it is most likely to be about storing data in the Hadoop Distributed File System (HDFS) across a set of clustered compute nodes.
More and more organisations are moving away from the traditional data warehouses and adapting Hadoop data lakes that provide a less expensive central location for analytics data. The reasons could be far and many but the primary reasons among them are the ability to build the data lake using cheap commodity hardware and the use of open source technology for Hadoop development.
James Dixon, the founder and CTO of Pentaho, who has been credited with coming up with the term "Data Lake" describes it as follows:
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” - James Dixon, the founder and CTO of Pentaho
The focus of a model Data Lake is on storing unrelated data and ignores the how and why the data is used, governed and secured. The Data Lake solves the problem of isolated pieces of information. It removes the need for having lots of independently managed collections of data. It combines these sources in the single data lake. This leads to increase in information use and sharing and at the same time reduces the costs through server and license reduction.
In today's scenario the security model within Data Lakes (which in all probability are hadoop based) is still immature and not comprehensive. The reason for this could be because Hadoop is not a single technology stack but is a collection of projects.
Anybody working on the defining the architecture of a Hadoop based Data Lake needs to think hard on the how to provide robust authentication, authorization and auditing. There are several open source security efforts in the Apache community like Knox, Ranger, Sentry and Falcon. One needs to understand that dumping data in the Data Lake with no process, procedures or data governance will only lead to a mess.
To summarize "Data Lake" as a concept is increasingly recognized as a important part of data strategy. There are multiple use cases within businesses that exist for the Data Lake. The key challenges for building the Data Lake are identified as security and Governance.

Sunday, March 20

Schema On Read

According to a study 90% of today's data warehouses process just 20% of the enterprise data. One of the primary reason for such a low percentage of data being processed is because the traditional data warehouses are schema on write which requires schema, partitions and indexes to be pre-built before the data can be read.

For a long time now the data world has adapted the schema-on-write approach where systems require users to create a schema before loading any data into the system. The flow is to define the schema, then write the data, then read the data in the schema that you defined at the start.  This allows such systems to tightly control the placement of the data during load time hence enabling them to answer interactive queries very fast. However, this leads to loss of agility.

But there is an alternative to this approach and that approach is schema-on-read. Hadoop's schema-on-read capability, allows data to start flowing into the system in its original form, then the schema is parsed at read time where each user can apply their own view to interpret the data.

In the "Schema on read" world the flow is to load the data in its current form and apply your own lens to the data when you read it back out. The reason to do it this way is because nowadays data is shared with large and varied set of users with each category of user having a unique way to look at the data and derive a insight that is specific to them. With schema-on-read approach the data is not restricted to a one-size-fits-all format. This approach allows one to present the data back in a schema that is most relevant to the task at hand and allows for extreme agility while dealing with complex evolving data structures..
The schema-on-read approach allows one to load the data as is and start to get value from it right away. This is equally true for structured,  semi-structured and unstructured data. It is easier to create multiple and different views of the same data with schema on read.
The popularity of Hadoop and related technologies in today's enterprise technology can be in parts credited to their ability to support the schema-on-read strategy. Organisations have large amounts of raw data that powers all sorts of business processes by applying transformation  systems involving corporate data warehouses and other large data assets.

Saturday, November 28

Stream Processing Using Apache Apex


            The open source streaming projects landscape is getting crowded with the Apache Software Foundation listing no less than 4 distinct streaming projects Apache Storm, Apache Apex, Apache Spark Streaming and Apache Flink.                            
                                     
                     There is a certain spotlight on Applications that do real-time processing of high-volume steaming data. The primary driver for this push is the need to provide analytics and actionable insights to businesses at a blazing fast speed. The use case could be a Real Time Bidding of ads or credit card fraud detection or in store customer experience. All such applications are pushing the limits of traditional data processing infrastructures.
                      
                       In this context a couple of day’s back I attended a MeetUp event of DataTorrent’s open source stream and batch processing platform “Apache Apex”. One of the sessions talked about the implementation of Apache Apex for Real Time Insights for Advertising Tech helped explain its capabilities and the kind of problems it is equipped to solve. 

Directed Acyclic Graph (DAG)

                      The Apex applications are different from the typical Map Reduce application. An Apache Apex application is a directed acyclic graph (DAG) of multiple operators. A Map Reduce application with multiple iterations is inefficient as the data between each map-reduce pair gets written and read from file system. Compare this to a DAG based application where it avoids the writing of data back and forth after every reduce thus adding much needed efficiency in the application. 

Code Reuse & Operability                       

                   Apex also scores with code reuse as an Apex application enables the same business logic to be used for stream as well as batch.  Apex as a platform is built for operability, such that things like scalability, performance, security, fault tolerance, high availability are taken care by Apex as a platform. Apex achieves fault tolerance by ensuring that the master is backed up, and the application state is retained in HDFS, a persistent store.

Pre-built Operators

                      Apex is bundled with a library of operators named Malhar.  It is a library of pre-built operators for data sources and destinations of popular message buses, file systems, and databases like Kafka, Flume, Oracle, Cassandra, MongoDB, HDFS and others. This library of pre-built operators reduces the time-to-market significantly, and the costs incurred in developing a streaming analytics application.

Summary

              All in all Apache Apex seems promising especially since it is a unified framework for batch & stream processing and the fact that some of the leading organizations have implement their streaming applications using Apex signals positive things are in the offing for this open source framework.