XtremeData is a scalable, distributed processing solution that makes it economic and simple to continuously profile millions of full and large datasets. XtremeData is an ab initio approach that lets the data speak for itself without bias and ambiguity: evaluates every single data value in a dataset to provide comprehensive metadata insights including structure, content, and relationships.
Ezmeral Runtime Version
Ezmeral Data Fabric Version
HPE Ezmeral Runtime Enterprise
HPE Ezmeral Data Fabric
- Additional Information
Data lakes are growing exponentially in scale and costs. Metadata insights from data lakes are vital to increase efficiencies for data engineering, science, quality, privacy and more. Conventional sampling-based data profiling is not viable for large datasets. XtremeData is a deep data profiling solution for full and large datasets. XtremeData helps you rapidly organize, monitor and extract value from millions of datasets in your exabyte-scale data lakes for quality, privacy, anomaly, drift and more. XtremeData opens metadata insights that are locked up in your datasets to help you accelerate and automate your AI, Analytics, and data governance journeys.
XtremeData is a scalable, distributed processing solution that makes it economic and simple to continuously profile millions of full and large datasets. XtremeData is an ab initio approach that lets the data speak for itself without bias and ambiguity: evaluates every single data value in a dataset to provide comprehensive metadata insights including structure, content, and relationships. XtremeData is comprised of three modules: discovery, analysis, and remediation. Discovery generates metadata from full and large datasets including schema inferencing. Analysis provides a rich SQL API-enabled toolkit to develop insights, detection, and alerting. Remediation corrects data quality issues in source data using metadata in dbX, or export metadata to other SQL databases for data remediation.
XtremeData is blazingly fast and requires no development and optimization and generates metadata directly from datasets in open formats (CSV, Text, Parquet, ORC and Avro), and from databases (Oracle, SQL Server, Teradata, Snowflake, others) using ETL, CDC and data virtualization tools. It takes >50x less compute than custom queries in Spark and SQL solutions, enabling full fidelity metadata for millions of datasets at-rest and in-motion in exabyte-scale data lakes.
- Eliminates need to write complex custom SQL for each dataset
- Reduces need to scan/query source data
- Provides comprehensive metadata insights for engineers, scientists, analysts and more
- Discovers metadata from full dataset (no sampling)
- Discovers schema and generates DDL directly from files for a dataset
- Describe exact schema for each file for a dataset
- Directly from CSV, Text, Parquet, ORC & Avro formats
- Data lakes, object stores, HDFS, files and more
- No schema or table creation; no copying or persistence of source data
- Metadata can be exported in JSON, CSV & Text
- 230+ simple APIs for metadata analysis
- Monitor data lakes for data quality, privacy and more
- Monitor ETL, ELT pipelines to detect anomalies before data is ingested in data warehouses
- Monitor AI/ML pipelines for data and schema drift
- Reduce data duplication and associated processing in data platform
- Generate high-quality test data
- Automate external table creation
- Accelerate data conversions & migrations
- Consolidate data across multiple databases