Data Lake

What is a Data Lake?

A data lake is a centralized location in cloud architecture that holds large amounts of data in its raw, native format. Unlike data warehouses or silos, data lakes use flat architecture with object storage to maintain the files’ meta data. 

How were data lakes developed?

The term “data lake” originated in 2015, but the concept has been in practice for over a decade. Data lakes address the need for scalable data repositories that are capable of storing numerous file types and sources that can then be analyzed.

A data lake can be viewed as a centralized location that is capable of holding petabytes of data, but in its raw, native format. When compared to a hierarchical data warehouse, which keeps data stored in files and folders, data lakes utilize a flat architecture with object-based storage. By using meta data tags and identifiers, Big Data operations are able to more easily locate and retrieve data across regions with improved performance, while additionally enabling multiple applications to take advantage of their format.

Why do organizations choose data lakes?

Data lakes enable enterprises to transform raw data into structured data ready for SQL-based analytics, data science, and machine learning but with lower latency. Any and all types of data are more easily collected and retained indefinitely, including streaming images, video, binary files, and more. Since the data lake is responsive to multiple file types and a “safe harbor” for new data, it’s more easily kept up to date.

With this kind of flexibility, data lakes enable users with all different skillsets, locales, and languages to perform the tasks they need. When contrasted with the data warehouses and silos that data lakes have effectively replaced, the flexibility they provide to Big Data and machine learning applications are increasingly self-evident.

Data lake vs. data warehouse

While both data lakes and warehouses can be used for storing large amounts of data, there are several key differences in the ways that data can be accessed or used. Data lakes store raw data of literally any file type. Alternatively, a data warehouse stores data that has already been structured and filtered for a specific purpose.

With their open format, data lakes do not require a specific file type, nor are users subject to a proprietary vendor lock-in. One advantage of data lakes over silos or warehouses is the ability to store any type of data or file, compared to a more structured environment. Another is that the intention behind setting up a data lake need not be defined at the time, whereas a data warehouse is created as a repository for filtered data that has already been processed with a specific intention.

A centralized data lake is favorable over silos and warehouses because it eliminates issues like data duplication, redundant security policies, and difficulty with multi-user collaboration. To the downstream user, a data lake appears as a single place to look for or interpolate multiple sources of data.

Data lakes are also, by comparison, very durable and economical due to their scalability and ability to leverage object storage. And since advanced analytics and machine learning with unstructured data are an increasing priority with many businesses today, the ability to “ingest” raw data in structured, semi-structured, and unstructured formats makes data lakes an increasingly popular choice with data storage.

Understanding data lake architecture

The initial point of contact with a data lake is the ingestion tier. Simply put, this is where the raw data is added to a data lake. On-premises environments use the Apache Hadoop File System (HDFS), migrating files and data to what’s often referred to as an “insights tier,” where the environment caches relevant information for data analysis. Whether the user chooses to use SQL or NoSQL for queries, insights on the raw data are now moved to the distillation (or processing) tier, where the meta data is converted to structured data to be utilized by data managers and administrators.

At the unified operations tier, systems management and monitoring perform extensive auditing, ensuring proficiency, data, and workflow management. Additionally, it’s important that the data is continuously checked for potential security and compliance issues. Data-driven enterprise clients need real-time updates to properly identify the trends and insights that a data lake provides.

The purpose of data lakes is data access and consumption. Without a catalog, the inefficiencies of a user profiling data sets for their integrity slows operations. Likewise, governance dictates that compliance and security issues are a non-issue, eliminating personal identifying information from files, and that the data lake supports the ability to delete specific data without disrupting consumption.

What are data lake platforms?

Virtually all major cloud services providers offer modern data lake solutions. On-premises data centers continue to use the Hadoop File System (HDFS) as a near-standard. As enterprises continue to adopt the cloud environment, however, numerous options are available to data scientists, engineers, and IT professionals looking to leverage the enhanced possibilities of moving their data storage to a cloud-based data lake environment.

Data lakes are particularly helpful when working with streaming data, such as JSON. The three most typical business use cases are business analytics or intelligence, data science focused on machine learning, and data serving—high-performance applications that depend on real-time data.

All major cloud-service providers, from Amazon Web Services (AWS) to Microsoft Azure to Google BigQuery, offer the storage and services necessary for cloud-based data lakes. Whatever level of integration an organization is looking for, from simple backup to complete integration, there is no shortage of options.

How are data lakes used?

Compared to just two or three decades ago, most business decisions are no longer based on transactional data stored in warehouses. The sea change from a structured data warehouse to the fluidity of the modern data lake structure is in response to changing needs and abilities of modern Big Data and data science applications.

Though new applications continue to emerge on an almost-daily basis, some of the more typical applications for the modern data lake are focused on fast acquisition and analysis of new data. For example, a data lake is able to combine a CRM platform’s customer data with social media analytics, or a marketing platform that can integrate a customer’s buying history. When these are combined, a business can better understand potential areas of profit or the cause of customer churn.

Likewise, a data lake enables research and development teams to test hypotheses and assess the results. With more and more ways to collect data in real time, a data lake makes the storage or analysis methods faster, more intuitive, and accessible to more engineers.

HPE and data lakes

Big Data is how businesses today tackle their biggest challenges. Where Hadoop has been successful in distilling value from unstructured data, organizations are looking for newer, better ways to simplify the way they do it.

Today’s businesses make enormous expenditures in analytics—from systems to data scientists to their IT workforce—to implement, operate, and maintain their on-premises Hadoop-based data management. As with any data environment, needs in capacity can change exponentially.

HPE GreenLake offers organizations a truly scalable, cloud-based solution that can fundamentally simplify their Hadoop experience, eliminating complexity and cost and instead focusing on gaining the insights that the data provides. HPE GreenLake offers a complete end-to-end solution with hardware, software, and HPE Services.

By maximizing the potential of your data, HPE GreenLake takes full advantage of the HDFS data lake already contained in the on-premises environment, while leveraging the advantages and insights offered in the cloud.