AI is the key to unlock insights from unstructured data
As organizations, communities, businesses, and products grow ever more intelligent, the number of data-generating endpoints grows massively. The more we make sense of that data, its potential crystalizes: Data can have a tangible impact on the way we work and live our lives. Smart cars are a great example—the more data that is stored and analyzed, the more we're able to iterate new safety features. The ability to glean actionable insights and derive new paths forward, however, rests on our ability to make sense of that data in highly specific and specialized ways. It has to be organized and structured.
As data-generating capabilities proliferate across industries ranging from healthcare to consumer products, much of that data is increasingly unstructured. Consider massive data sources such as conversations held across internal messaging platforms or even the human genome. These examples don't necessarily fit into analytics models as we currently understand them, but if their potential can be harnessed, we can redefine what it means to truly be able to access the right information at the right time and use it to achieve better outcomes.
Unstructured and semi-structured data represents a new multibillion-dollar opportunity, one that will create value by delivering new levels of access, services, and insights. The seeds have already been planted across well-known brands we encounter every day: Netflix offers services to stream video content; Uber and Lyft offer ride-share services by combining maps and driver-availability and traffic information; and Meta and Twitter offer social interaction platforms for sharing images and ideas. Within all these organizations, the deployment of artificial intelligence across unstructured datasets plays a critical role in operations. Recommendation engines, fake-news detection tools, and dynamic pricing models are all born from insights gathered through the analysis of unstructured data.
Please listen to: How data and AI will shape the post-pandemic future
This is only the beginning. The National Geospatial-Intelligence Agency currently compiles 2 exabytes of geospatial data, including satellite scans and elevation maps, every day, and by 2025, sequenced human genomes alone will be a 40 exabyte dataset. Both of these examples represent hugely valuable data stores that we have only just begun to tap into.
However, important hurdles must be overcome before we can realize the promise of unstructured data. Do we understand the complexities of these new modalities? Are we ready to handle the size and complexity of these datasets? How should we prepare to achieve faster, better, and more useful insights from such data in the future? More important, how can we better process unstructured data, using automation and AI algorithms, to deliver timely services by routing the right information at the right time to the right stakeholders?
Understanding unstructured data
In its raw form, unstructured data is decidedly unwieldy. For example, while a single credit card transaction generates a few bytes of data, a single sample of a human genome from a sequencer is nearly 200 GB in size (109 times bigger). Further complicating the situation is the shape of unstructured data when compared with conventional data. Unstructured data could be stored as sequences, point clouds, images, irregular meshes, and so on, and in any number of different shapes, including multi-resolution, multi-channel, non-tabular, and sparse.
Because of these factors, unstructured data is inherently harder to analyze. Previously established methods and techniques are often not applicable. For example, AI algorithms trained to automatically detect a car or a human from a 2D video captured as pixels will not work with 3D video data, which is captured as a 3D point cloud—even though both videos are digital representations of the same real-world environment. This is true across different use cases, like creating real-time metaverse environments or using genomes for personalized precision medicine.
Please read: Advancing medicine with AI at the edge
Currently, there are no standard schemas nor query languages for exploring and sifting through unstructured data structures like we have for structured SQL databases and NoSQL key-value stores today. The mathematics and statistics for analyzing shapes of unstructured data are complex and remain in their early stages.
Today's state-of-the-art practice is to store data in a data lake and use that to conduct search and exploratory analysis. Data scientists translate this information into a structured representation of unstructured data that can then be analyzed using conventional analysis and machine learning algorithms. Unfortunately, workflows involving unstructured data are often complex and computationally expensive, and they rarely offer an impressive return on investment when compared with conventional data analysis.
Preparing for AI on unstructured data
AI is going to play a vital role in using unstructured data to solve business problems and identify new opportunities. To bridge the gap between unstructured data's innate challenges and AI's current maturity, some adaptability will be needed across system architectures, storage and analytic services, programming models, and user experience.
For example, technologies will be required to do a better job of handling different shapes of data at different scales to produce analytic results at the needed speed and accuracy. This includes developing highly specialized offerings designed with performance and scalability at the forefront. Further, the algorithms themselves will require a new perspective that marries emerging techniques with federated learning processes. In short, we need a platform that approaches search in a way that accounts for the what is, what if, what else, and what could be.
Please read: Exascale computing: What it is and what it could be
The ideal solution to handle unstructured data at the necessary scale, size, and complexity level will require some key features, namely:
- Is able to host and serve data in different shapes.
- Allows pattern search on the hosted data with AI models.
- Supports a query language to orchestrate database retrieval (exact search), pattern search using machine learning (approximate search), and user-defined functions (domain-specific search).
- Offers easy-to-use programming interfaces for database operations.
- Runs on emerging server architectures (shared-memory, distributed-memory, or fabric-attached-memory technologies).
- Embeds high-performance computing constructs that can scale out to handle ever-increasing data sizes and scale up to reduce time to insight.
Among organizations that have begun experimenting with unstructured data, the early benchmarks have been positive. The specificity with which they can understand customers, systems, and the organization at large indicates room for tremendous potential. To date, however, the full-scale adoption of high-performance systems has not yet been achieved. As we move toward greater integration of AI and these types of unstructured datasets, it is essential to rethink traditional modalities and interfaces. For the enterprise to be successful at getting value out of this data, there's no time to spare in making that happen.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.