Inside the Broad Institute's genomic research infrastructure
High-powered computing, cloud computing, and a new generation of gene sequencing technologies are ushering in tremendous breakthroughs in genetics research. But it’s not only hardware leading this revolution. Specialized software designed for genetics research is also key to advancing our understanding of biology and the hunt for therapies for diseases.
Nowhere is this truer than at the Broad Institute of MIT and Harvard, one of the largest genetics research institutes in the world. Founded in 2004 to take full advantage of the Human Genome Project’s mapping of the human genome, the nonprofit institute takes a cross-disciplinary approach to understanding the biological basis of disease, and generating knowledge that can be used for more effective prevention, diagnosis, and treatment methods.
In addition to laboratory science, Broad teams also develop software to aid research efforts, which is shared openly with researchers around the world. At the forefront of this work is the creation and release of software to standardize and speed up the complex genetic analyses conducted by researchers globally.
One suite of tools is the Genome Analysis Toolkit (GATK), released as an open source version 4 in early 2018. This article examines GATK4, details its benefits, and provides an overview of the hardware designed to run it.
Looking for genetic variants
GATK4 was designed to help researchers identify human genetic variants, which can be a difficult, time-consuming task. This process starts after sequence data has been produced from actual biological DNA, using a technique called next-generation sequencing. Typically, the technique creates massive image files that need to be pieced together to decipher a person’s genome. Christopher Davidson, manager of Hewlett Packard Enterprise's Life Science Solutions, explains the process this way: "Putting together a human genome from these images is like doing a puzzle, but the puzzle has 3 billion pieces.”
The first draft sequence of the human genome was released in 2001 by the Human Genome Project, an international effort that created what is referred to as a reference genome. The reference genome includes the entire human set of genes and spans approximately 3 billion “letters” of DNA. It is used by researchers as a standard of comparison when they are studying individual genome sequences. For example, in the reference genome, specific genes have been found to determine eye color, but within eye color genes, different people have different variants for blue eyes, brown eyes, and so on. People may also have genetic variants or mutations that can predispose them to diseases such as breast cancer or early-onset Alzheimer’s disease.
Researchers try to discover the link between mutations, or variants, and human health. Sequencing the genomes of many people with a particular health condition, for example, can help scientists find out more about the effects of gene variants, deepen their understanding of the biology of the disease, and potentially open a pathway for the development of diagnoses or prevention.
That’s where GATK4 comes in
"With the Genome Analysis Toolkit, you can more easily find those variants related to what you’re studying and then perform an analysis on it,” Davidson says. "With it, you look for variance patterns across the population.”
GATK4 lets researchers perform these analyses for very specific purposes—for example, looking for genetic causes of diabetes. GATK4 can be used to analyze an individual or massive populations. In China, there is a project to sequence the genomes of 100 million people, and researchers will use GATK4 to perform the analysis on it.
Digging deeper into GATK4
GATK4 isn’t a single, one-size-fits-all piece of software. Instead, it’s a suite of more than 100 tools that can be strung together in a pipeline. When GATK4 performs its work in the first tool, that information is automatically sent to the second one, whose output is then sent to the third, and so on. Different pipelines can be used for different types of research, such as for cancer research, genomics research, and many others. Depending on the pipeline researchers run, researchers may use only 10 to 15 of the tools.
GATK4, developed by Broad Institute software engineers and data scientists at the Intel-Broad Center for Genomic Data Engineering, is designed to be used by a wide variety of people, from researchers to clinicians, explains Lee Lichtenstein, Broad's associate director for somatic computational methods.
"There are a lot of users out there using it in many different ways,” Lichtenstein explains. "They generally fall into several broad categories—for example, researchers who are identifying mutations that can cause disease. And it can also be used by clinicians who are analyzing the genome of a patient to better inform treatment decisions.”
GATK4 has been significantly improved over previous versions, Lichtenstein says. Particularly useful is its ability to do parallel processing using a new parallelization framework that significantly speeds up its work. "Right out of the box, most tools work about 30 percent faster,” he says. "It more efficiently leverages parallelization and the cloud.”
In addition, GATK4 makes it easier and faster to analyze tremendous amounts of genomic data. Some tasks that used to take months now take only a few weeks, Lichtenstein says, and Broad is looking to speed it up even more. The increased speed of processing, he says, can lead to significant breakthroughs because researchers can perform more work in a shorter amount of time.
"There are things we can do with GATK4 that simply couldn’t be accomplished in earlier GATK versions,” Lichtenstein says.
The new tools offer more benefits than that. Researchers can do all their work in a single suite rather than having to cobble together tools from multiple sources. This allows them to concentrate on their work rather than on the tools and infrastructure. "Researchers can go to one place and not have to worry about things such as their software being supported by only one graduate student who has since gone off to other things," Lichtenstein says. "We want to have a unified place where researchers can focus only on their work and not face distractions."
That benefit is echoed by Eric Banks, senior director of the data sciences platform at Broad and a creator of the original GATK software package. "We wanted to remove traditional barriers of scale while offering the same high level of data quality our users expect,” he says. "Thanks to the rapid adoption of cloud computing, researchers can finally do away with many of the infrastructure-related complications that have hampered progress, especially at smaller institutions and startups.”
Why open source?
It was important to Broad that this latest iteration of GATK be released as open source, and the research community clearly agrees. Many scientists have said how important it is to have GATK released as open source. Jeremy Freeman, manager of computational biology at the Chan Zuckerberg Initiative, says, "Open sourcing the GATK is a big deal for open genomics and for open science in general. Not only does it make this critical tool available to as broad as possible an audience for use, reuse, inspection, and contribution, it provides a powerful example to the community for how an existing project can embrace open source.”
The importance of BIGstack
Broad Institute researchers know that software by itself can’t do the job. Hardware is also crucially important. After all, the institute is one of the largest producers of human genomic data in the world—approximately 24 terabytes of new data every day. Overall, it manages more than 50 petabytes of data.
To ensure that GATK4 operates properly and efficiently, the institute worked with Intel to establish a reference software-hardware architecture for running GATK4. Called the Broad-Intel Genomics Stack (BIGstack), its use has led to a fivefold performance improvement compared with previous versions and has reduced the amount of time it takes to deploy genomic workflow infrastructure. BIGstack is made up of a variety of software to improve genomics-analysis workflow as well as high-performance data analytics computing clusters and other high-end hardware, including Intel solid-state drives and field-programmable gate arrays.
HPE has participated in the project as well, and offers compute and storage solutions for it, including HPE Apollo systems and HPE ProLiant servers.
Ultimately, the benefits of GATK4 and BIGstack go well beyond feeds and speeds and letting researchers collaborate using open source tools. More than 55,000 teams of researchers use GATK4. That means that breakthroughs in genetics that dramatically improve human health are closer than ever today—and likely to speed up even more as GATK4 offers benefits to even more researchers around the world.
Broad Institute's genomic research infrastructure: Lessons for leaders
- Including the community leads to better results.
- Open sourcing the software toolkit encourages greater participation and a broader range of results.
- Collaboration on hardware technologies is required for optimum results.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.