Behind the scenes on building an exascale supercomputer
OCTOBER 18, 2023 • BLOG POST • TRISH DAMKROGER, CHIEF PRODUCT OFFICER AND SENIOR VICE PRESIDENT, HPC & AI, AT HEWLETT PACKARD ENTERPRISE
IN THIS ARTICLE
- HPE celebrates Exascale Day 2023 by recognizing teams and the milestones achieved to surpass an exaflop
- HPE reflects on powerful public-private collaboration with the U.S. Department of Energy’s national laboratories to overcome challenges in building new technologies, preparing data centers, and constructing and testing world-first innovation
This Exascale Day, HPE’s Trish Damkroger reflects on the journey to exascale and highlights the massive accomplishments and key learnings to realizing a first-of-its-kind innovation
More than a decade ago, after accomplishing many world-first scientific and engineering milestones using petascale supercomputers, scientists, researchers, and engineers still required more compute to solve problems.
That is when exascale computers – which are ten times more powerful than those petascale machines – were envisioned. There is a greater need for more compute to better understand full systems, whether it is the human body from inside a cell to its full organ, or weather patterns from meters to kilometers.
That is why every year on Exascale Day, we celebrate those visionaries. We celebrate the power of 1018 – or one billion billion – calculations per second and its transformative impact on scientific discovery and innovation.
Exascale Day 2023 - Celebrating those who continue to ask what if, why not, and what’s next [VIDEO]
Since ushering in exascale supercomputing more than a year ago with Frontier, for the U.S. Department of Energy’s Oak Ridge National Laboratory (ORNL), researchers have already dramatically sped up discovery to achieve the following:
- Improving weather forecasting predictions by simulating a year’s worth of global climate data, in just a day
- Accelerating patient diagnosis and care by making millions of biomedical literature accessible in an online search
- Analyzing tens of thousands of COVID-19 mutations, and predicting their spread, from one week to just 24 hours
- GE Aerospace accelerated its understanding of a new design for jet engines that could reduce CO2 emissions by 20%
This is just the beginning. We have come incredibly far as an industry, and I look forward to seeing many future discoveries made by Frontier and upcoming exascale systems Aurora at Argonne National Laboratory (ANL) and El Capitan at Lawrence Livermore Laboratory (LLNL).
Reflecting on the journey to exascale
However, getting to exascale was no easy feat.
It came with many challenges, from design and engineering, to facing unpredictable constraints triggered by a global pandemic.
Nonetheless, these are important learnings that demonstrate the power of our community and strong collaboration across public and private organizations to realize groundbreaking innovation.
I want to reflect on some of these obstacles and how we navigated them.
1. Designing completely new technologies for the exascale era
One of the biggest challenges to achieving exascale was addressing its immense power consumption. At the time, using existing technologies to build an exaflop computer – to deliver 10 times faster speed – would have consumed more than 600 megawatts. That’s an energy equivalent of a typical power plant.1
The U.S. DOE national laboratories, which for decades, we have partnered with to power critical testbed projects for advanced computing, explored ways to address energy efficiency on existing petaflop systems.
Oak Ridge National Laboratory’s engineers experimented by optimizing compute resources. To offload some of the specialized, data-intensive workloads from traditional compute (CPU) to accelerated compute (GPU), they coupled the technologies together.
Another approach was when Argonne National Laboratory transitioned from air-cooling to liquid-cooling in one of its supercomputers to eliminate the use of fans that consumed more electricity.
While these experiments proved to make a difference, there was still a larger need for end-to-end technologies to deliver a reliable, resilient system to support the magnitude of exascale, with greater efficiency.
After a multi-year, nationwide R&D project involving a strong private-public partnership of government and industry leaders, we developed a powerful end-to-end platform, built from the ground up, that uniquely packages purpose-built compute, accelerated compute, networking, storage, software, and liquid cooling to support the scale and performance of exascale.
In fact, when it debuted, Frontier, an HPE Cray EX supercomputer, achieved the world’s fastest title with green leadership, ranking as the number one world’s most efficient supercomputer.
In addition to powering the U.S. exascale systems, these new supercomputing platforms have been deployed around the world in next-generation systems to power a range of scientific and AI-driven projects.
2. Preparing an exascale-class data center and accompanying facilities
Exascale supercomputers require advanced data centers to support higher requirements for power, cooling, and overall structural support. Due to the massive scale and performance, data center facilities need to undergo a significant electrical and structural upgrade.
This large and complex project starts by making room in existing data center facilities, where older systems are hosted. Each of the national laboratories hosting the U.S. exascale systems had to take on this task, followed by a complete outfitting. This meant decommissioning older supercomputers and stripping everything out of the room, including piping, electrical infrastructure, and even the data center’s floor.
When preparing for Frontier, ORNL’s Oak Ridge Leadership Computing Facility (OLCF) decommissioned its Titan supercomputer, a Cray system, after seven years of service to free up a 20,000-square-foot room. An HPE team of technicians worked for a month just to dismantle, remove, and recycle 430,000 pounds of Titan’s components before revamping the room to accommodate Frontier.
In addition to the facility, OLFC had to outfit a mechanical room to include 130,000 gallons worth of cooling water towers and 350-horsepower pumps that can each move more than 6,000 gallons of water per minute throughout Frontier.
Similarly, ANL and LLNL had to undergo major facility upgrades to prepare for their exascale systems.
LLNL, in particular, had to completely modernize the Livermore Computing Center, which was originally built for terascale computing that utilized far less power than exascale computers.
Its latest infrastructure upgrade will support 85 megawatts of power and 28,000 tons of water to cool LLNL’s upcoming El Capitan supercomputer.
Working through the COVID-19 pandemic, construction crews installed a 115 kilovolt (KV) transmission line, air switches, substation transformers, a switchgear, relay control enclosures, 13.8 kV secondary feeders and cooling towers, supplying the computing facility with enough energy to power a medium-sized city.
The upgrade will allow LLNL to run El Capitan and future exascale-class machines. Additionally, for more than two years, HPE and LLNL employees have worked together to prepare the computing floor for El Capitan, running cables and routing pipes for power and water, and installing the system’s data storage and compute components.
As a California-based laboratory, prone to potential impact from earthquake aftershocks, the teams also proofed the data center to prevent shifts and damages to El Capitan.
3. Constructing an exascale computer for the very first time
We began our work to deliver, build and test an exascale system, with Frontier, for the very first time, during the COVID-19 pandemic.
Shelter-in-place mandates and shutdowns around the world presented many constraints, from supply chains to safely staffing projects to meet aggressive timelines.
Frontier, and the upcoming Aurora and El Capitan systems, are also uniquely built in that they are completely constructed and tested onsite at their respective laboratories. These projects require large teams, across various functions, to carefully install the many pieces.
The systems are also comprised of tens of millions of components, including hundreds of networking cables and dozens of HPE Cray EX cabinets, each weighing more than 8,000 lbs.
These brand new technologies had to be transported in multiple truckloads with delivery spread out by weeks, and sometimes even months. The resource and engineering-intensive tasks added to the complex journey as we worked with our trusted suppliers to source and ensure timely delivery during the pandemic.
It took a village, to say the least.
With Frontier, more than 100 team members across ORNL, HPE, and AMD, worked around the clock, up to the final minutes and seconds of system performance testing to surpass an exaflop.
Similarly, a massive team across Argonne, Intel, and HPE is working endlessly to ready Aurora.
Here’s a look at how, together with the labs and our industry partners, we build and install these exascale systems:
How to celebrate Exascale Day
Our collective efforts with exascale helped us arrive at a pinnacle moment in the supercomputing industry that will only take us farther.
Exascale will greatly benefit scientific and research communities. I welcome you to celebrate this Exascale Day by learning more from experts on what’s in store for the next era of scientific discovery, including how we can further our efforts in sparking new, creative ideas by fostering a diverse environment.
- Unlocking the Future of Science with Aurora at Argonne National Laboratory [VIDEO]
- Revolutionizing Supercomputing with El Capitan at Lawrence Livermore National Laboratory [VIDEO]
- Advancing Science and Innovation with Frontier at Oak Ridge National Laboratory [VIDEO}
- Empowering diversity in supercomputing and AI [VIDEO]
1 Exascale Study Group report “Technology Challenges in Achieving Exascale Systems”
Hewlett Packard Enterprise ushers in new era with world’s first and fastest exascale supercomputer “Frontier” for the U.S. Department of Energy’s Oak Ridge National Laboratory
HPE and Tokyo Tech collaborate to build the next generation TSUBAME4.0 supercomputer for artificial intelligence, scientific research, and innovation
Hewlett Packard Enterprise selected to build new supercomputer for the U.S. Department of Energy’s National Renewable Energy Laboratory to accelerate discovery of renewable power