Data lakes have emerged as a special storage and repository mechanism of the raw unsoiled data in its native form. Organizations today generate enormous data related to various aspects of activity. In order to resolve the challenges of data storage, integration and accessibility, data lake is created. It allows refining, exploring and enriching the data as per the organizational requirements.
As Wikipedia states, A data lake is a large storage repository and processing engine. They provide “massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs”.
The Data Lake concept emerges from the need to manage and exploit new forms of data. Data lake allows distinct data records to be stored in their original formats for subsequent analysis rather than data warehouse style integration. This feature helps in maintaining data originality and different analyses can be performed against different contexts.
Organizations are on the way to collect, extract and save data into single repository without much effort in saving them initially in relational Data Warehouses. They require a platform/framework to serve the requirements of flexible, cost effective data processing model; which can accommodate the inflating data volumes. Moreover, Data Lakes can hold unstructured data with compatibility on mobile, cloud solutions and IoT platform.
Initially, web based organizations which required large scale of data and performed analytics on data started creating data lakes. Eventually, when other organizations started adopting big data, they happened to create data lakes to complement existing enterprise data warehouse. A Data Lake could support batch workloads, and interactive UI. This also offers agility and ease of data capture at efficient pricing models.
They follow ‘ETL’ model analyze the data. The ETL process can include data both from enterprise applications and big data. The data is generally categorized into video, audio, images and other unstructured and semi structured styles to apply analytic techniques to extract insights. The categories may vary according to the industries and client requirements. Once the analysis is performed, the results provide actionable insights into the situation or may act as a base for further analysis. In this way, the data becomes more structured offering business value.
Nevertheless, there are concerns pertaining to Data Security, Access Control, Compliance Regulations and Data Lifecycle stage for the organization. Data Lakes are based on experimentation and progressive growth and are adopted to bring logical value for the organization’s extensive business data.
With storage systems becoming cheaper day by day, we are not restricted to just store limited cleansed structure data for operational reporting, with data stored in Data Lake we have all the resources available to answer any ad-hoc query as well.