The Technology Blog: Data Lake

A data lake can be defined as an unstructured data warehouse where you pull in all of your different sources into one large pool of data. Data lake as a concept can be applied more broadly to include any type of systems but in today's Big Data world it is most likely to be about storing data in the Hadoop Distributed File System (HDFS) across a set of clustered compute nodes.

More and more organisations are moving away from the traditional data warehouses and adapting Hadoop data lakes that provide a less expensive central location for analytics data. The reasons could be far and many but the primary reasons among them are the ability to build the data lake using cheap commodity hardware and the use of open source technology for Hadoop development.

James Dixon, the founder and CTO of Pentaho, who has been credited with coming up with the term "Data Lake" describes it as follows:

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” - James Dixon, the founder and CTO of Pentaho

The focus of a model Data Lake is on storing unrelated data and ignores the how and why the data is used, governed and secured. The Data Lake solves the problem of isolated pieces of information. It removes the need for having lots of independently managed collections of data. It combines these sources in the single data lake. This leads to increase in information use and sharing and at the same time reduces the costs through server and license reduction.

In today's scenario the security model within Data Lakes (which in all probability are hadoop based) is still immature and not comprehensive. The reason for this could be because Hadoop is not a single technology stack but is a collection of projects.

Anybody working on the defining the architecture of a Hadoop based Data Lake needs to think hard on the how to provide robust authentication, authorization and auditing. There are several open source security efforts in the Apache community like Knox, Ranger, Sentry and Falcon. One needs to understand that dumping data in the Data Lake with no process, procedures or data governance will only lead to a mess.

To summarize "Data Lake" as a concept is increasingly recognized as a important part of data strategy. There are multiple use cases within businesses that exist for the Data Lake. The key challenges for building the Data Lake are identified as security and Governance.

Wednesday, April 6

Data Lake