How supercomputers are identifying COVID-19 therapeutics
Christopher Rickett, Kristi Maschhoff, and Sreenivas Sukumar were investigating potential therapies for COVID-19 when an unusual data point presented itself: Individuals exposed to COVID-19 who had been previously vaccinated for tetanus displayed fewer and less severe symptoms. One recent study of pregnant women found that 88 percent of those who tested positive for the virus were asymptomatic, a rate approximately double that of the general population. Was it possible that the TDaP vaccine, which is commonly administered to pregnant women, was offering an unexpected—and unintuitive—level of immunity? The article detailing the research and proposing the theory has been accepted for publication in the journal Medical Hypotheses.
What's unusual about these findings, aside from their sheer novelty, is that Rickett and Maschhoff are not medical researchers. They're engineers for Cray, the supercomputing arm of Hewlett Packard Enterprise. Prior to COVID-19, they had no experience with medical research, but earlier this year, they theorized that the powerful, massively parallel-processing graph database of Cray's supercomputers could be leveraged to investigate therapies for the emerging COVID-19 pandemic—and on a far more efficient scale than had ever been done before.
"We didn't have much information, but we wondered how we could make reasonable sense out of all the data regarding something that's so new," says Rickett. "The idea that we came up with was creating the ability to do a protein sequence analysis, a comparison for similarities between one protein sequence [known as the COVID-19 Spike protein] and all the rest of them in the known universe. If we could find a way to map that information back to something that medicine already knew more about, we could then look for compounds that are more useful as treatments because they target a similar protein."
A challenge of epic scale
That challenge is immense because the scale of data involved in COVID-19 research is so vast. This includes millions of known proteins to model against the critical COVID spike, 30 terabytes of accumulated medical data to process, and more than 150 billion facts of medical knowledge available for analysis. For a human researcher, digesting even a small fraction of this information is impossible, even if armed with a state-of-the-art computer. Attempting to model protein structures and drug interactions with a single target molecule is a process that can take months. But the Cray Graph Engine (CGE) approach slices through the job by leveraging hundreds or thousands of CPU cores simultaneously to evaluate millions of molecules in a matter of seconds, potentially changing the game for COVID-19 therapeutic research and paving the way for a breakthrough.
"These evaluations typically take a long time," says Maschhoff, "but this is a problem that we knew needed an immediate solution."
Conceptually, the CGE works by building a database of data points called triples, each of which is simple a collection of three facts in the form of subject, verb, object. For example: COVID-19, causes, fever. These triples are drawn from nine gargantuan and evolving medical datasets that combined to total more than 155 billion data points. This is a job that would be impossible to even fathom in a conventional computing environment, but the extreme power of the CGE means all of this data can be loaded into its memory banks and ready for analysis in less than an hour.
With the data prepared, Rickett and team then put this knowledge graph to work, using artificial intelligence algorithms to look for connections and commonalities secreted throughout the various datasets. They began with an attempt to determine if the protein sequence that makes up the COVID-19 virus overlaps with any other known viruses. From there, the researchers queried the data to find whether there were any existing drugs that had already been used successfully to treat disorders with those overlapping protein sequences. In traditional medical research, this type of work would have to be done one database at a time. But within the CGE environment, the Cray team was able to search all databases simultaneously and find cross-database connections that would be invisible to standard research tactics.
"By having all of this information integrated into the same database, we're simplifying the query time—and we can write more complex queries that span multiple datasets," Maschhoff says.
Searching for connections
The Cray knowledge graph involves complex logic that connects those nine databases through a series of logical questions. Drug databases that catalog the properties of an individual chemical are cross-referenced to find that drug's interactions with protein sequences that are similar to that of the COVID spike. Another database is queried to determine what the side effects of this potential treatment might be, and another is examined to determine whether the drug has been used in a previous clinical trial and whether it was deemed effective and safe. Yet another dataset is used to determine whether it is feasible and how to synthesize the drug. After all the relevant facts in each of these datasets is considered, the AI then infers whether a compound is worth considering as a potential treatment—and then it moves on to the next candidate.
In the end, the CGE whittled down the data to unearth about 160 drugs that showed promising interaction with COVID-19 analogs, including a number that had been identified by other researchers (such as Dexamethasone and Lopinavir) and that are already undergoing clinical trial evaluation. The potential link between the tetanus vaccine and reduced symptoms was also uncovered through this analysis. In total, more than 49 million protein sequences were compared against the COVID spike protein.
"Our process scales fairly linearly," says Rickett. "Each time you double the number of nodes, it takes half as long." With a single-process computer, this analysis would have taken days. Using Cray's CGE platform, the team eventually got this analysis down to less than 20 seconds, with room for additional performance improvements down the road.
This is a boon for researchers because supercomputing is being leveraged widely across the COVID-19 research landscape, including at Exscalate4CoV, a partnership of 50 entities spread across Europe. Earlier this year, Exscalate4CoV used four separate supercomputers to test 400,000 molecules for their potential to interface with the COVID-19 virus and eventually settled on one pharmaceutical, Raloxifene, as its most promising candidate. Clinical trials were announced at the end of October and will last for 12 weeks. The prospect of speeding up this research via the CGE approach is a boon, as anything that can accelerate the analysis of these massive datasets can be helpful in the development of a successful therapy.
COVID-19 research and beyond
Naturally, Rickett, Maschhoff, and Sukumar's research has implications beyond COVID-19 drug repurposing. With the entirety of nine medical databases loaded into the knowledge graph, the consolidated dataset can be queried for, well, anything. While the immediate focus is on finding COVID-19 treatments that can be tested in vivo—the team is sharing its findings with medical researchers and pharmacologists—the CGE research highlights how AI can be a critical tool in finding cures for emerging illnesses with drugs that already exist.
All it takes is a smartly designed system to sniff out the right connections. And about 20 seconds.
Lessons for leaders
- Supercomputers can solve problems that seem out of reach for conventional systems.
- Sophisticated analytics with enough data can find truths that seem counterintuitive to experts.
- Supercomputers and big data are changing the way scientific research is done.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.