How ontologies help data science make sense of disparate data
We all know the basic challenges of sharing business data: departmental silos, incompatible software, legal and regulatory hurdles, colleagues who sincerely don’t have time to make it a priority, and sadly, colleagues who passive-aggressively won’t make it a priority. On top of all that, we can be tripped up by each other’s mental models.
As a simple example of this complex issue: How do you define “close of business”? When someone promises to send feedback by close of business today, does it mean 5 p.m., 6 p.m., midnight? In whose time zone? Imagine what happens if you don’t realize until it's too late that you’re thinking 6 p.m. while your boss is thinking 5 p.m.
Worse than multiple definitions, we also create dozens of terms that essentially mean the same thing. That problem is especially vexing in the data-swamped world of academic science. Fortunately, scientists have developed successful ways to navigate the deluge that can be of great benefit to enterprise organizations.
A FAIR choice
University labs are surprisingly independent realms of purpose-built devices and localized vocabularies, with results entered into private databases (or Excel spreadsheets or even paper notebooks). This worked fine in its jerry-rigged way when scientists were merely expected to publish methods and results. But recently, there has been a push among both funders and journals to share raw data.
In response, and very much in spirit, a working group of scientists, scholarly publishers, funding agencies, and corporate representatives developed the FAIR Guiding Principles for scientific data management and stewardship. The principles promote data management as “the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process.”
Ontologies are a critical tool for creating and managing knowledge bases that help us communicate and verify the knowledge we think we have.
The FAIR acronym stands for findable, accessible, interoperable, reusable. Arguably, the third principle is key to the other three. There are now several data repositories where researchers in specific disciplines can upload files, allowing a broader and deeper understanding of research questions.
Unfortunately, there’s such a tremendous amount of data that no human can search it. And the semantic confusion isn’t a simple one-to-one correspondence. Rather, it’s a multidimensional mess that includes spatial, temporal, and methodological differences, along with functional definitions. The obvious answer is machine learning, but it’s an enormous semantic challenge to parse hundreds of bespoke terms.
Is there an answer for scientists—and other data-challenged professionals—that’s more productive than giving up in despair? FAIR co-author Maryann Martone, a neuroscience professor at the University of California, San Diego, is a specialist in ontologies. She is optimistic, particularly so for someone looking at what might at first appear to be an intractable problem.
According to Martone, “Ontologies are a critical tool for creating and managing knowledge bases that help us communicate and verify the knowledge we think we have. If someone says X is in a motor region, how does a machine know what a motor region is? There’s an ontologic structure that says some features obligatorily appear together.”
In an ontology, individual terms are tagged to central concepts called Uniform Resource Identifiers (URIs), with no weight to the tagging. Dog, Canis lupus familiaris, and Mr. Fluffy all map to the same URI. Because ideas can overlay each other, explains Martone, “ontologies allow you to construct a reasonable theory. Previously, we couldn’t even bring all the data together because we first had to navigate all the terms.”
Beyond buckets of data
Finding those co-occurrences leaves room for humans to sort the bigger questions: Yes, you can now see that X apparently relates to Y, but is that meaningful in any causal or other sense? For example, a medical researcher can search through a data repository and have a chance to understand that dozens of terms that may have once seemed unrelated are all evidence of a specific condition. Equally important, URIs can help researchers detect gaps and outliers in data. If everything else says "rose" equates to pink or light red, what to make of the study that maps it to blue?
“Ontologies aren’t hierarchies. They don’t force you into categories; they just put some structure around the experimental edge of science,” says Martone. “It’s just a data pattern. You can compare the patterns and analyze them—and maybe learn there’s nothing fundamentally different or maybe the distinction is important.”
URIs are possible because every field has common understandings—at least about big ideas. “The centroids of our concepts are usually pretty clear,” says Martone. “We don’t confuse a penguin with a starling. We have a shared understanding, and ontology is good at expressing that in a computable way.”
Nevertheless, it would seem the scheme might fall apart as you move away from common understanding: I get “cat” and you get “cat,” and you get “tabby” and I get “tabby,” but we can argue for years over “silver tabby” or “grey tabby.”
Ontologies can help answer those arguments. If we agree on separate URIs for silver tabby and grey tabby, we can tie our theoretical distinctions to actual data and then have an evidence-based comparison. Most important, because they link assertions to evidence that can be examined by others, ontologies tied to data are objective. “An ontology just builds up evidence for you; it won’t win your argument,” Martone says. “If you keep calling your tabbies silver, but the rest of the world says grey, the ontology plus the data merely says sorry, but the rest of the world won.”
Of course, as more data is entered into the repositories, the preponderance of evidence may shift—and ideas can be altered or disproven. For accuracy’s sake, says Martone, “you have to manage these as computational artifacts with clear change logs.” Otherwise, you can’t capture the consistency in data that at first appears to be distinct. For example, “Manhattan” and “New Amsterdam” are geographically but not temporally the same.
From meta-analysis to mega-analysis
For all the positives of data repositories, sharing still can make researchers nervous. The old fear was that you might be scooped by a rival who saw more in your data than you did. The new fear is that your work might not be reproducible—calling into question whether it was valid at all. But as any scientist knows, hundreds of conditions can affect reproducibility—and that’s under controlled lab conditions. It’s even worse if you’re working with human patients, where the best you can hope for is broad similarities.
But Martone and her open-standards colleagues have begun to rethink the idea of reproducibility. It may not mean redoing one set of experiments from one lab—it could mean comparing a cluster of experiments that all shed light on each other. “The reproducibility lies in this multidimensional space; there’s correlations between variables that you didn’t know existed that start to come out of the data,” she explains.
The gold standard used to be the meta-analysis, a deeply researched paper that compared the published literature to decide what the preponderance of evidence suggested. (For example, 798 out of 800 studies concluded the good effects of chocolate outweigh the bad.)
In the new world of data sharing, researchers are beginning to consider the mega-analysis, a comparison of the pools of raw data, which show correlations among variables collected across multiple experiments. Each individual experiment may not be reproducible, but the pooled data may show robust correlations that can shed light on what is happening in individual experiments. In essence, a meta-analysis compares the tops of icebergs and a mega-analysis compares the bottoms, thus increasing the breadth and depth of our understanding.
“You have to ask what serves the science best—I’ve seen nothing that says open data isn’t the future. AI and reproducible science need it,” concludes Martone.
Ontologies: Lessons for leaders
- Terms can be treacherous. Your idea of what X means may differ in subtle but important ways from your colleague’s idea. Worse, definitions for the same term can change over time or differ between sectors.
- To solve that problem, create ontologies when you engage in data sharing and map all similar terms to a Uniform Resource Identifier.
- Treat ontologies the way you do change logs. If changes in definitions occur over time, there’s a record that ensures all terms still map correctly. For example, New Amsterdam = New York.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.