The Technology Blog: April 2016

Friday, April 8

Apache Sentry : Hadoop Security Authorization

            Hadoop has gained significant momentum in the past few years as a platform for scalable data storage and processing. More and more organisations are adopting to Hadoop as an integral part of their data strategy. With the concepts of data lakes and democratization of data becoming more popular, more users are getting access to data that was privy to a select few.

                  In absence of a robust security model in place, this approach posses a risk to the sensitive data getting into the hands of unintended audience. Also with Government regulations and compliance like HIPPA, PII and others it is becoming more important for organizations to implement a meaningful security model around the Data Lake. The topmost concern for anyone building a Data Lake today are security and governance. The problem becomes all the more complex as there are multiple components within Hadoop that accesses the data. Also there are multiple security mechanisms that one can use.

   While implementing security would mean taking care of Authentication, Authorization and Auditing, in today's blog we will focus on the Authorization aspect and particularly on one solution Apache Sentry. Apache Sentry an Apache Top Level project is an authorization module for Hive, Search, Impala, and others that helps define what users and applications can do with data.

Currently Users can grant themselves permissions in order to prevent accidental deletion of data. The problem with this approach is it does not guard against malicious users. The other option is HDFS Impersonation where Data is protected at the file level by HDFS permissions. But File-level permissions are not granular enough and it does not support role based access control.

           Apache Sentry overcomes most of these problems providing secure role-based authorization. Sentry provides fine grained authorization, a key requirement for any authorization service. The granularity supported by Sentry is supported for servers, databases, tables, views, indexes and collections. Sentry supports multi-tenant administration with separate policies for each database/schema and support for maintaining by separate administrators. In order to use Sentry one needs to have CDH 4.3.0 or later and secure HiveServer2 with strong authentication (Kerberos or LDAP).

                In short Apache Sentry is enabled to store sensitive Data in Hadoop, there by extending Hadoop to more users and helping organizations to comply with regulations. The fact that Sentry recently as graduated out of Incubator and is now an Apache Top Level says a lot about its abilities in providing a robust authorization to Hadoop ecosystem.

Wednesday, April 6

Data Lake

A data lake can be defined as an unstructured data warehouse where you pull in all of your different sources into one large pool of data. Data lake as a concept can be applied more broadly to include any type of systems but in today's Big Data world it is most likely to be about storing data in the Hadoop Distributed File System (HDFS) across a set of clustered compute nodes.

More and more organisations are moving away from the traditional data warehouses and adapting Hadoop data lakes that provide a less expensive central location for analytics data. The reasons could be far and many but the primary reasons among them are the ability to build the data lake using cheap commodity hardware and the use of open source technology for Hadoop development.

James Dixon, the founder and CTO of Pentaho, who has been credited with coming up with the term "Data Lake" describes it as follows:

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” - James Dixon, the founder and CTO of Pentaho

The focus of a model Data Lake is on storing unrelated data and ignores the how and why the data is used, governed and secured. The Data Lake solves the problem of isolated pieces of information. It removes the need for having lots of independently managed collections of data. It combines these sources in the single data lake. This leads to increase in information use and sharing and at the same time reduces the costs through server and license reduction.

In today's scenario the security model within Data Lakes (which in all probability are hadoop based) is still immature and not comprehensive. The reason for this could be because Hadoop is not a single technology stack but is a collection of projects.

Anybody working on the defining the architecture of a Hadoop based Data Lake needs to think hard on the how to provide robust authentication, authorization and auditing. There are several open source security efforts in the Apache community like Knox, Ranger, Sentry and Falcon. One needs to understand that dumping data in the Data Lake with no process, procedures or data governance will only lead to a mess.

To summarize "Data Lake" as a concept is increasingly recognized as a important part of data strategy. There are multiple use cases within businesses that exist for the Data Lake. The key challenges for building the Data Lake are identified as security and Governance.