Design, deliver, and run enterprise blockchain workloads quickly and easily.
All servers and systems
It would be an overstatement to say the days of bubbling beakers and test tubes in medical research labs are gone. But today, you're as likely to see a supercomputer in the laboratory as you are racks of tissue samples.
Advanced modern medical research has become computational. Nowhere can you see this better than in the work the German Center for Neurodegenerative Diseases (DZNE) is doing on Alzheimer's research.
According to Prof. Joachim Schultze, funding director of DZNE's Platform for Single Cell Genomics and Epigenomics (PRECISE), the computing demands of just one aspect of Alzheimer's research—genomics—are enormous.
A neurodegenerative disease like Alzheimer's requires an understanding of the role of genetics. To do that, researchers survey an immense number of individual genomes and assemble those genomes back into comprehensible pictures.
This requires significant computational power. To reassemble one genome into a genetic picture of an individual requires 180 uncompressed gigabytes, while computation requirements on that genome add 500 GB and long-term storage requires an extra 100 GB.
This next-generation genomics work is part of what is known as systems medicine. Systems medicine, says Schultze, consists of collecting and measuring a great deal of data from a great many patients.
That permits doctors and researchers to identify "biomarkers that divide into people with better and worse prognoses, who therefore respond better to different drugs or treatment," says Schultze. "You combine radiology, imaging, blood, and genomics data to identify treatment targets. That is the development cycle in systems medicine."
But systems medicine requires researchers to apprehend a tremendous amount of data and find nuanced patterns within it. It is, says Schultze, beyond the scope of traditional research to do so.
What you might know as big data, or the data flood, Schultze calls the "knowledge gain." Medicine has not been spared the inundation of information. Systems medicine is a solution to the problem, but it requires high-performance computing to practice. As an example, in the next decades, says Schultze, the number of genomes sequenced will rise from 1 million to 1 billion.
"There is currently no way of accessing all the data at hand," says Schultze. "We need systems that help us utilize this knowledge gain. There will be ethical and data safety issues, but if we can figure it out, once we have access, we can learn so much."
The urge to refine its research led DZNE to The Machine, Hewlett Packard Enterprise's 160 terabyte, memory-centric prototype. DZNE uses HPE's innovative new Memory-Driven Computing architecture to tame its data. Doing so has made it 100 times faster to run its genomics pipeline.
Its process is worth exploring.
To understand how a disease works, you have to understand how a cell works, says Dr. Matthias Becker, a postdoctoral researcher at the University of Bonn. To do that, he says, you have to look at the blueprint for the cell's proteins, which means sequencing its DNA.
To turn that sequencing from mathematics to information a researcher can use, the snippets of data gathered from a single person have to be reassembled. The snippets of genetic data are aligned to a reference genome, a complete genome that serves as a guide. This is "a computationally expensive process," Becker says. To make it as time-effective as possible, Becker's team uses an open source "pseudo-alignment tool" called Kallisto, developed at Caltech.
Earlier tools took almost two days to process 30 million "reads," which involves reviewing and assembling 30 million snippets and storing those using the FASTQ format. This process, using a dataset of 127 million reads on its existing hardware, took 22 minutes. When DZNE began to use Kallisto in 2016, running it on HPE's Superdome X server and using Memory-Driven Computing tools, the same data was processed in 13 seconds.
To get this improvement, the researchers and computer scientists modified the k-mer access and memory management using librarian file system (LFS) instead of traditional storage. The team also considered what could be shared among multiple instances; as a result, it ran the data on nodes to access the index in parallel, moving the FASTQ files to LFS so different tools could work on the same datasets. They used memory mapping, which, unlike linear file reading, allows data to go to any available processing node without waiting.
By holding data in memory, DZNE got rid of the initialization of the reference genome, since it stayed the same. The researchers split the reads into shorter k-mers and read the graph produced as a hash table. Finally, they reduced a hard-coded load factor of 95 percent in the hash table by taking advantage of the huge memory pool available.
At this point, DZNE asked, "What would new, better hardware do to speed this process up even more?" The application was now limited by the scale-up architecture of the Superdome X. However, The Machine prototype had a memory fabric that scaled as additional processing nodes were added. Although the prototype was slower than the Superdome X, because it used FPGAs rather than ASICs, results suggested that the application performance would scale linearly as additional nodes were added to the problem.
“Happily, that turned out to be the case,” says Schultze. This enabled DZNE to do more in less time with a smaller computing cost.
It was also a qualitative change.
The shortened processing time changed the team's workflow. When DZNE's researchers ran their pipelines before, they had to fill five or six days between the start and the results.
"You have to do something else in that time," says Schultze. The researchers might work on several projects at a time, each of which might be quite different. Not everything multitasks well. "When the results come back you wonder, 'What was my question?'" he adds. "Now, we can think of doing research completely differently, not interrupting our thought processes anymore. Creativity increased with reduced data analysis time."
This experience and these experiments are just the beginning. Genomic data is not the only data DZNE needs to work with to understand the behavior of dementia and Alzheimer's, according to Becker. Clinical data, labs, imaging, and environmental data all have to be brought together. They need to be stored locally under the proper anonymized identities but remain parsable by different criteria in order to find patterns that can lead to effective treatment.
In the future, systems medicine will employ data integration, machine learning, neural networks, and visualization.
As just one example, Schultze points out, computational modeling has uncovered possible treatments for spinal injuries.
"Neurons are either growing or functioning; they're not capable of doing both," he says. "We knew there were molecular switches which moved the neurons from one state to the other. By classic laboratory experiments, we discovered that one pain medication seemed to switch the neurons into growth mode. By processing a large amount of data from clinics across Europe that treated spinal cord injuries, we verified that people who had been prescribed that drug had much better outcomes than those who did not."
Because of the access to a large amount of data and the tools to recognize patterns in that data, Schultze's team achieved enlightening results that they published in a scientific journal. The drug is now being reviewed for clinical trials.
In any endeavor, but particularly in medical research, a data flood is a challenge. But, properly met, it provides hope to those facing otherwise insoluble diagnoses. Faster and more powerful computers can help. The Machine, and other ways around the end of Moore's Law, are more than an engaging puzzle or the business of business. The need to move forward is a part of our very DNA.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.