What is a Data Lakehouse?
A data lakehouse is an open data management architecture that combines the flexibility and scalability benefits of a data lake with the data structures and data management features of a data warehouse.
How have data warehouses changed over the past few decades?
Organizations have used data warehouses, also known as enterprise data warehouses (EDWs), for decades to store and manage the data they need to drive business insight. But as the types, sources, and amounts of data generated have multiplied over the years, traditional data warehouse architectures have not been able to fully keep up with the velocity, variety, and volumes of business data being created within enterprises every day. And as enterprises increasingly adopted artificial intelligence (AI) and machine learning (ML) technologies, the algorithms used by these tools required direct access to the data.
What are data lakes?
Data lakes are architectures used to store the vast amounts of unstructured and semi-structured data they collect from their enterprise’s various business applications, systems, and devices. Data lakes typically employ low-cost storage infrastructure with a file application programming interface (API) that holds data in generic, open file formats. This means data lakes are useful for storing data at scale and making it available for AI and ML algorithms, but they do not address data quality or governance requirements. As duplicate, irrelevant, and unorganized data is added to data lakes due to poor organization or management, they can become what are known as data swamps, making it difficult to extract meaningful insights from data they contain.
How do data lakehouses prevent data swamps?
A data lakehouse’s flexibility and scalability, combined with its structures and management capabilities, provides data science teams the agility to use data without needing to access multiple systems. Data lakehouses also ensure that data scientists have the most complete and up-to-date data available for business analytics, AI, and ML projects.
What are the advantages of a data lakehouse?
The data lakehouse architecture offers a number of advantages:
1. It eliminates simple extract, transfer, and load (ETL) jobs because query engines are connected directly to the data lake.
2. It reduces data redundancy with a single tool used to process data, instead of managing data on multiple platforms with multiple tools.
3. It enables direct connection to multiple BI and analytics tools.
4. It makes data governance easier because sensitive data does not have to be moved from one data pool to another and can be managed from one point.
It helps reduce costs because data can be stored in one location using object storage.
What’s the difference between a data lakehouse, a data warehouse, and a data lake?
A data warehouse is a large collection of business data aggregated from multiple different sources into a single, consistent data store. These platforms are specifically designed to perform analytics on large amounts of structured data. A data warehouse system regularly pulls data from various business intelligence (BI) systems, then formats and imports that data to match the format and standards of the data already within the data warehouse. This allows data to be stored in organized files or folders so it is readily available for reporting and data analysis.
A data lake stores all types of raw, structured, and unstructured data from all enterprise data sources in its native format at scale. Data is added to the data lake as is, meaning there is no reformatting of the new data to align with other data already in the system. Data lakes play a key role in making data available for AI and ML systems and Big Data analytics.
A data lakehouse is a new, open architecture that combines the flexibility and scalability benefits of a data lake with similar data structures and data management features of a data warehouse. This combination of features enable agility for data science teams as they are able to use data without needing to access multiple systems. Data lakehouses also ensure that data scientists have the most complete and up-to-date data available.
What are the elements of a data lakehouse?
At a high level, there are two primary layers to the data lakehouse architecture. The lakehouse platform manages the ingest of data into the storage layer (i.e., the data lake). The processing layer is then able to query the data in the storage layer directly using a variety of tools without requiring the data to be loaded into a data warehouse or transformed into a proprietary format. The data can then be used by both BI applications as well as AI and ML tools.
This architecture provides the economics of a data lake, but because any type of processing engine can read this data, organizations have the flexibility to make the prepared data available for analysis by a variety of systems. In this way, processing and analysis can be done with higher performance and lower cost.
The architecture also enables multiple parties to concurrently read and write data within the system because it supports database transactions that comply with ACID (atomicity, consistency, isolation, and durability) principles, detailed below:
Atomicity means that when processing transactions, either the entire transaction succeeds or none of it does. This helps prevent data loss or corruption in case of an interruption in a process.
Consistency makes sure that transactions take place in predictable, consistent ways. It ensures all data is valid according to predefined rules, maintaining integrity of the data.
Isolation guarantees that no transaction can be affected by any other transaction in the system until it is completed. This makes it possible for multiple parties to read and write from the same system at the same time without them interfering with one another.
Durability ensures that changes made to the data in a system persist once a transaction is complete, even if there is a system failure. Any changes that result from a transaction are stored permanently.
HPE data lakehouse solutions
HPE Ezmeral Unified Analytics is the first cloud-native solution to bring Kubernetes-based Apache Spark analytics and the simplicity of unified data lakehouses using Delta Lake on-premises. The service modernizes legacy data and applications to optimize data-intensive workloads from edge to cloud to deliver the scale and elasticity required for advanced analytics. Built from the ground up to be open and hybrid, its 100% open source stack frees organizations from vendor lock-in for their data platform.
Instead of requiring all an organization’s data to be stored in a public cloud, HPE Ezmeral Unified Analytics is optimized for on-premises and hybrid deployments and uses open source software to ensure as-needed data portability. Its flexibility and scale can accommodate large enterprises data sets, or lakehouses, so customers have the elasticity they need for advanced analytics, everywhere.
Available on the HPE GreenLake edge-to-cloud platform, this unified data experience allows teams to securely connect to data where it resides today without disrupting existing data access patterns. It includes a scale-up data lakehouse platform optimized for Apache Spark that is deployed on premises. Data scientists are able to leverage an elastic, unified analytics platform for data and applications on-premises, across the edge, and throughout public clouds, enabling them to accelerate AI and ML workflows.