A data fabric enables a comprehensive data strategy
For the modern enterprise, data is both the core asset and a big problem. The whole body of company data contains the sum of company knowledge of its products, its assets, its customers, and its people.
And yet, that data likely exists in a large variety of formats, managed by a large variety of applications, residing all over the world. Gaining access to the data an employee needs can be difficult and fraught with risks attendant to redundancy.
There may well be no person or group that has a full picture of the company's data, and indeed it may be impossible to achieve that. It requires advanced software infrastructure. That's where data fabric comes in. With a data fabric, an enterprise can make the full assortment of company data available to all who have rights to it, in the form they need it, no matter where the data resides.
The challenge of the unmanaged data ecosystem
Where is your enterprise's data stored? Chances are that some of it is in a cloud, or multiple clouds, managed by different cloud systems. Some of it is in SQL databases on corporate servers. Some of it is in Microsoft Office documents. Some of it is in text files.
There may be several copies of some of the data. Administrators may consider it too risky to allow direct access to the main store of a critical database and give the user a snapshot, or subset, of it. The employee may prefer this anyway, because they wish to work with the data in different formats, and maybe the software they want to use can't work with the main database directly, at least not efficiently.
In the end, the user gets their snapshot, but by then, the data may no longer be current. Any results they derive from it may already be unrelated to the current state of the data.
The normal data mess is not acceptable
If you've been working with computers long enough, this situation may just seem normal to you, maybe even inevitable. But it's not. It is the result of a lack of a comprehensive data strategy that supports a truly multi-tenant system.
Certainly, it is not optimal for users not to be able to access data using the applications they prefer. Neither is it optimal for users not to be able to reach data on the company network when they have the need for it and the rights to it. Nor for users to be working with partial datasets of uncertain accuracy.
Nor is it optimal or necessary for data requests to present a burden to IT, making it a chokepoint for those just trying to get their work done. Nor for all of these inefficiencies to impose high costs for new projects, for innovation, and for the ability to pivot as situations change.
All of these problems, owing to the lack of a managed, consistent strategy for data, threaten companies by endangering service-level agreements they have with customers and partners. When legitimate but stressful applications like machine learning and large, analytical queries are running, the afflicted enterprise cannot ensure that scheduled events will start and complete on schedule, as promised in the SLA.
It's hard for even the most skilled IT professionals to anticipate all the ways users will want to access data. This is why so much shadow IT persists: because the facilities provided by the company don't meet users' needs. How much better would it be if the company really could provide access to the data users need, using the software they prefer?
What does a comprehensive data strategy look like?
A comprehensive data strategy, in contrast, makes it practical and affordable to run a multipurpose system that takes full advantage of the value of data, bringing useful applications (projects) into production in a timely manner. Analysts, developers, and data scientists are able to work with a comprehensive and consistent collection of data and to add new data sources without either breaking the bank or overwhelming IT.
This comprehensive approach makes it possible to optimize resource use by avoiding unnecessary duplications of hardware or system administration as well as simplifying how people architect a solution.
To do all this, a data fabric must have certain important capabilities:
- A global namespace: All data must be available through the single, consistent global namespace, whether it resides on premises or in a public or private cloud or is spread around the network edge.
- Multiple protocols and data formats: It must implement a broad variety of protocols, data formats, and open APIs, including HDFS, POSIX, NFS, S3, REST, JSON, HBase, and Kafka.
- Automatic policy-based optimization of storage and access: The data fabric must provide a way for the enterprise to specify when data is stored in hot, warm, or cold storage methods or in a cloud or on premises, among other important storage policies.
- Rapidly scalable distributed data store: Enterprise data needs can grow quickly and precipitously; the data fabric needs to make this happen, not obstruct it.
- Multi-tenancy and security: The data fabric must have a security scheme that implements authentication, authorization, and access control in a consistent manner, no matter where the data is or what type of system it runs on.
- Resiliency at scale: Even under high usage, it must provide instant snapshots, and all applications must have the same view of the data when they are taken.
The security aspect is worth repeating and emphasizing: A data fabric needs to present a consistent security framework for all data, across the enterprise. It should have cluster-level permissions and full Boolean expressions for defining access control. Siloed applications don't have this luxury.
What a data fabric won't do
Relational database management systems are very good at certain tasks, and they should be used for those tasks. It would be as foolish to shoehorn RDBMS work into the fabric as it would be to force applications into an RDBMS that is poorly suited for those applications, an all too common endeavor.
In fact, RDBMS systems will become an increasingly specialized tool rather than the general one it is often seen as. A data fabric with access to a variety of formats and protocols will allow developers to choose the best solution to the problem, rather than the default one.
There are operations, such as ETL (extract, transform, load), that will not go away but the need for which will decrease. ETL is the copying of data from one system into another, which will represent the data differently. This will still make sense at times, particularly when dealing with an RDBMS or other specialized systems that are not part of the fabric. But in the absence of a data fabric, ETL is often necessary for any access of data.
Data fabrics: Lessons for leaders
- There is a difference between data and storage. The strategies for each are different and diverging.
- IT professionals can't know all the ways users will want to work with data.
- A data strategy should allow users to work with a master dataset without fear of endangering it.
Related stories:
How data and AI will shape the post-pandemic future
Podcast: 'Know thy user' a key tenet of modern IT design
New tools boost automation, self-optimization of IT operational services
A data fabric manages, transfers, and secures data across multiple remote and incompatible deployments and so is a critical component of a multicloud data strategy.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.