Hadoop has gained significant momentum in the past few years as a platform for scalable data storage and processing. More and more organisations are adopting to Hadoop as an integral part of their data strategy. With the concepts of data lakes and democratization of data becoming more popular, more users are getting access to data that was privy to a select few.
In absence of a robust security model in place, this approach posses a risk to the sensitive data getting into the hands of unintended audience. Also with Government regulations and compliance like HIPPA, PII and others it is becoming more important for organizations to implement a meaningful security model around the Data Lake. The topmost concern for anyone building a Data Lake today are security and governance. The problem becomes all the more complex as there are multiple components within Hadoop that accesses the data. Also there are multiple security mechanisms that one can use.
While implementing security would mean taking care of Authentication, Authorization and Auditing, in today's blog we will focus on the Authorization aspect and particularly on one solution Apache Sentry. Apache Sentry an Apache Top Level project is an authorization module for Hive, Search, Impala, and others that helps define what users and applications can do with data. Currently Users can grant themselves permissions in order to prevent accidental deletion of data. The problem with this approach is it does not guard against malicious users. The other option is HDFS Impersonation where Data is protected at the file level by HDFS permissions. But File-level permissions are not granular enough and it does not support role based access control.
Apache Sentry overcomes most of these problems providing secure role-based authorization. Sentry provides fine grained authorization, a key requirement for any authorization service. The granularity supported by Sentry is supported for servers, databases, tables, views, indexes and collections. Sentry supports multi-tenant administration with separate policies for each database/schema and support for maintaining by separate administrators. In order to use Sentry one needs to have CDH 4.3.0 or later and secure HiveServer2 with strong authentication (Kerberos or LDAP).
In short Apache Sentry is enabled to store sensitive Data in Hadoop, there by extending Hadoop to more users and helping organizations to comply with regulations. The fact that Sentry recently as graduated out of Incubator and is now an Apache Top Level says a lot about its abilities in providing a robust authorization to Hadoop ecosystem.