Analytics: The emergence of data lakes

Data Lakes encourage schema-free entry of raw data into the stores, unlike traditional Data Warehouses. Thus they can contain structured, unstructured and semi-structured data in the same environment, setting stage for wonders of Machine Learning.

IDG Staff Jan 23rd 2018

Data lakes is an interesting data storage architecture which is being adopted at a fast pace by organizations across the globe. Unlike traditional Data warehouses which usually recommend schemas on the data which is being stored, Data Lakes encourage schema free entry of raw data into the stores. This means data lakes can contain structured, unstructured and semi-structured data all in the same environment. Once an organization deploys a data lake, Machine Learning tools can then be used to mine insights out of unstructured and semi-structured data and combine it with insights from traditionally mined insights from structured data to power data driven decision-making within the organization. 

The data lakes advantage

Data Lakes are essentially data stores where an organization can literally dump any kind of data. By this we refer to whether the data is structured, semi-structured, or structured. Data Lakes are different from a Data Warehouse, as the latter often has a fixed schema in which one should enter data. Data Lakes, on the other hand, impose no restriction of the kind of schema data must have. In fact the data can be completely schema-free. Data Lakes are convenient to the user as it serves as a single repository, which is the source of almost all kinds of data produced by multiple data generation processes within an organization be it sales, marketing, production or HR.

Data Lake Solutions

Data lakes are now being provided as a service by several organizations. Dell EMC, which is a leader in providing warehouse solutions, is offering multiple Data Lake solutions claiming high ROI and massive scaling benefits. Even cloud providers like Microsoft Azure have started providing Data Lake solutions to enterprise customers.

We are also seeing a massive rise in interest concerning setting-up of Data Lakes from scratch on bare metal machines on the cloud. To achieve this, organizations are spinning off multiple virtual machines and storage spaces on their favorite cloud provider and installing HDFS  on them. Apache Hadoop which works over HDFS is a favorite platform for setting up Data Lakes. Hadoop also comes cheap and is great at handling raw files which is a primary requirement of Data Lakes. Hadoop also comes in handy because analysts can use Map-Reduce to perform computations on the files and extract insights from the load of unstructured data.

Data Lakes: How they help analytics

Advantages of having a Data Lake

As technology evolves, technologists will begin developing better ways of processing fuzzy, unstructured or semi-structured data. Specifically, the quality of processing of structured data depends on the machine learning techniques being used as these are the primary tools to extract insights from unstructured data.

Given that industries have analysts who are trained to use machine-learning tools on unstructured data, a Data Lake can be a huge asset. Since there is no restriction on the schema of data entering the warehouse, the data has a high probability to be insight rich. Unstructured data can be viewed from multiple points of view and each point of view can render useful information about it. Also Data Lakes aggregate information from multiple cross-domain functions within the organization.

So, cross-domain analytics also becomes possible easily. For example skilled analysts can quickly find out how well a sales-strategy is aligned to a recent product change or how a product change can be done to drive the success metrics of a marketing campaign higher.

As we know, Data Lakes have both structured and unstructured data and therefore the analyst needs to be well-versed in domain knowledge and also in techniques for extracting insights from unstructured data sources.

Leveraging Data Lakes

Data Lakes are indeed an interesting architecture. The positive impact created by the presence of a Data Lake on an organization depends on how well the analysts leverage the data present in these lakes and how effectively they combine domain knowledge with insight extraction tools in machine learning to get the most impactful insights out of data.