Data scientists take the mystery out of data fabrics
It's well understood that data is among an enterprise's most valuable assets. Yet, making good use of data—which requires storing, managing, moving, and accessing it—is often harder than it should be. That's in part because enterprise data is typically scattered across many cloud systems, data centers, and data clusters.
This is where a data fabric comes in. A data fabric enables data that is dispersed across various locations and used by different applications to be accessed and analyzed in real time within a unifying data layer, under the same management and security.
The term data fabric is not self-explanatory. While it's a metaphor for how data is accessed and consumed, it's also a very real resource. To get insight into what exactly a data fabric is, we spoke with two experts: Ted Dunning, chief technologist for data fabric at Hewlett Packard Enterprise, and Ellen Friedman, principal technologist at Hewlett Packard Enterprise.
Both Dunning and Friedman worked at MapR Technologies before HPE acquired the company in 2019. Friedman has extensive experience in the field of large-scale data. Dunning, who was chief technology officer at MapR Technologies, has built large-scale data and machine learning systems for a number of companies.
I think there is a lot of misunderstanding out there. Why don't you start by defining what exactly a data fabric is?
Ted Dunning: There are roughly three definitions in the industry. I'll start with the wrong ones first. The first definition is that a data fabric is just a metadata management system. This is a system that virtualizes and hides other data sources. A number of things handicap a system like this. First, it's handicapped by having to discover metadata and the inconsistency between the underlying storage platforms and, finally, by the fact that it's almost always dead slow. Because of the impositions of that kind of data fabric, that's not what we mean when we say data fabric.
A second meaning, which is much closer, describes the accessibility of file-level data from any machine in your data center. That's pretty close, but it's not a comprehensive definition of data fabric.
Please read: A data fabric enables a comprehensive data strategy
We believe in a third definition: that a data fabric should go further. A data fabric applies that accessibility across multiple clusters that serve data, including large, small, remote, or local clusters. They are all mutually addressable via the same path names. In our vision of a data fabric, you can also move computation to those places if that's appropriate, and it often is. Or, if it's more suitable for the data fabric to move data toward a local point, it can be stored and processed locally.
Often, this means that you have data sources in many places across the enterprise and that you need to gather the data all at the edge, bring it together so you can learn globally, but then send stuff back out to act locally.
This third definition of data fabric provides you planetary scale.
In what way is this kind of data fabric genuinely innovative? Not just in the sense of what it is, but what capabilities it enables for enterprises.
Ellen Friedman: People use the word innovative loosely for everything. But a data fabric is actually innovative, and it is an excellent technology for a very new idea.
Those who designed and engineered our data fabric knew what they wanted it to be able to do. They had a particular goal in mind, but part of what makes an innovation work is not just whether those who build it have vision, but also whether the people who will use it also have vision. And it's such a new idea; part of the point of the term data fabric is to help people get the concept and to help users understand how to take full advantage of a data fabric.
A data fabric makes the world work differently than how users previously thought. And to take full advantage of that, enterprises need to perceive what they're doing in a different way.
The fundamentally different thing is that the data fabric works across the whole range of what people need to do with data—what individuals need to do as developers, analysts, data scientists, and IT team members. The data fabric also addresses the needs of the entire organization collectively.
Dunning: Yes, we also separate app developers' needs and DevOps needs from the IT infrastructure support teams. That provides very concise ways for DevOps to express an infrastructural reality without getting into all the details underneath.
The key management tool for this separation of concerns is a data fabric volume. This volume is a construct that is similar to a directory but has special management capabilities. We may have some data that needs to be accessed from SSDs and some data that needs to be in San Diego initially but mirrored to Tokyo. Simultaneously, other data may need superhigh performance for three days and then be managed more economically. With a data fabric, each one of those different requirements is met by using different volumes. That's an easy thing to express for developers and application operators.
Please read: How data and AI will shape the post-pandemic future
The administrators just see this as a kind of sea of infrastructural capability. They see their separate data fabric volumes and configure them without caring what's in them. They have a straightforward interface that they can interact with as if they are thinking about the infrastructure.
That means if the infrastructure team changes or upgrades the hardware or moves a storage cluster to another area, operations teams or developers won't care. They can just consume the data they need.
Friedman: Part of what you're referring to is based on the data fabric's capability for self-healing. If a disk or machine goes down, the overall system is absolutely reliable and available. You don't lose data, and you don't get any interruption.
A system administrator gets a signal, and they realize the machine or multiple machines are out, and that means at some point, they'll want to take action to fix that. But the data fabric, just like in the example you're giving, just covers that. The systems that are dependent on the data fabric just keep running, and the users don't have any realization or need to realize that they are now accessing data on different machines. That applies even if you are adding or changing hardware as well.
What about different types of data in various formats? How do enterprises consume all of the different kinds of data they have?
Dunning: That's mostly an application concern, and so the data fabric just deals with that any way you like it. Your application software is what has to think about data formats. We provide enough richness in the actual data access APIs so you can do that. For instance, consider tar, which is a tape archiving thing on Unix, or HDF5 [Hierarchical Data Format version 5], which is a scientific archive format. Well, both kinds of format can be used on the data fabric. Nobody needs to point out specifically that we put tar or HDF5 support into the data fabric—it just works. We put in capabilities sufficient so that applications that need access to those data types can do so.
Friedman: For the data fabric to be this unifying data layer that not only deals with many different data formats, it also must allow access via multiple open APIs. And applications written with different tools, such as legacy applications, machine learning tools, applications that use POSIX file access, or even things like Apache Spark that would use HDFS, all can access data in the data fabric directly.
It doesn't have to be copied to a special-purpose system. And so, having this multi-API access that all goes back to the same data fabric is actually a massive innovation. You don't have to have a separate system for each kind of API.
Why would a CIO care about what we're talking about?
Friedman: One of the things that we talk about is that you need a system that handles not only current scale (and these can be very large systems), but you want a system with scalability—one that grows not just with increasing data size but also with the increasing complexity of the types of applications, and to do that without having to rearchitect the system.
Dunning: Companies are experiencing a lot of success with data fabrics. For instance, major auto manufacturers are developing autonomous vehicles faster and more efficiently through their ability to manage, store, and move global test data.
The manufacturer can collect data from their test vehicles and easily synchronize it to their central infrastructure. They can then share data across the company's many development sites so data scientists and developers can collaborate.
Friedman: There are also several very large retailers that have been using data fabrics for years. Two important ways these retailers are leveraging the data fabric are meeting their data demands and improving collaboration.
Retailers have regular seasonal patterns where business increases tremendously during the holiday season. To deal with that, they typically have to set up a special war room and bring in extra IT resources to deal with that massive holiday traffic, which puts an additional burden on the system. But the parts of their businesses that were running on the data fabric didn't require special resources for holiday traffic. With our data fabric, they had a system that could handle the spikes in data usage and only required a few system administrators.
Retailers have also been able to improve collaboration because of the data sharing by analytics teams, business teams, and people doing machine learning.
Dunning: There's an excellent story about this. In their lunchroom, a couple of people from different teams were having lunch together. When they sat down, one of them said, "Darn, we could make this price match feature if only we had this comprehensive web crawl. But there's no way that we could do that crawl within our budget." Someone at the table from another team replied, "But we've already done that!"
Since they were on the same data fabric, they could prototype that new capability that afternoon. That single feature was ultimately evaluated to be responsible for a billion, with a B, dollars in marginal revenue.
Lessons for leaders
- Making data easily accessible where needed creates opportunities for cost savings and revenue increases.
- Modern enterprises are complex and need systems that unify data infrastructure and foster sharing.
- A data fabric makes applications resilient against changes and errors in data sources.
Major auto manufacturers are developing autonomous vehicles faster and more efficiently through their ability to manage, store, and move global test data.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.