Sunday, March 20

Schema On Read

According to a study 90% of today's data warehouses process just 20% of the enterprise data. One of the primary reason for such a low percentage of data being processed is because the traditional data warehouses are schema on write which requires schema, partitions and indexes to be pre-built before the data can be read.

For a long time now the data world has adapted the schema-on-write approach where systems require users to create a schema before loading any data into the system. The flow is to define the schema, then write the data, then read the data in the schema that you defined at the start.  This allows such systems to tightly control the placement of the data during load time hence enabling them to answer interactive queries very fast. However, this leads to loss of agility.

But there is an alternative to this approach and that approach is schema-on-read. Hadoop's schema-on-read capability, allows data to start flowing into the system in its original form, then the schema is parsed at read time where each user can apply their own view to interpret the data.

In the "Schema on read" world the flow is to load the data in its current form and apply your own lens to the data when you read it back out. The reason to do it this way is because nowadays data is shared with large and varied set of users with each category of user having a unique way to look at the data and derive a insight that is specific to them. With schema-on-read approach the data is not restricted to a one-size-fits-all format. This approach allows one to present the data back in a schema that is most relevant to the task at hand and allows for extreme agility while dealing with complex evolving data structures..
The schema-on-read approach allows one to load the data as is and start to get value from it right away. This is equally true for structured,  semi-structured and unstructured data. It is easier to create multiple and different views of the same data with schema on read.
The popularity of Hadoop and related technologies in today's enterprise technology can be in parts credited to their ability to support the schema-on-read strategy. Organisations have large amounts of raw data that powers all sorts of business processes by applying transformation  systems involving corporate data warehouses and other large data assets.