Exascale computing and its enabling technologies
As supercomputing moves from petascale to exascale performance, it brings a new generation of capabilities to fields such as scientific exploration, artificial intelligence, and data analysis. The first U.S.-made exascale computers have been contracted for by the Lawrence Livermore, Oak Ridge, and Argonne National Laboratories. Two of these systems will be coming directly from HPE Cray, while the third is being built as part of a partnership with Intel.
The seemingly sudden shift to next-generation supercomputing isn’t just a matter of winning the title in a speeds and feeds race; the technologies that are being deployed in these next-generation supercomputers are focused on making sure they can deliver on the promise exascale computing brings, not just being a benchmark champion.
Where do I plug this in?
Power has been a major concern as supercomputers have scaled up to reach greater performance, and recent developments have focused on delivering performance improvements that outpace increases in required power. As an example, the Japanese Fugaku supercomputer, which is currently the world’s fastest, is relatively efficient as well, ranking ninth in the world on the Green 500 list, which ranks supercomputers on energy efficiency (GFLOPS/watt is the measurement). It’s 7.3 million processor cores require more than 28 megawatts of power to actively process data. While that is a huge number, it represents twice the efficiency of the current No. 4 supercomputer, China’s Sunway MPP, which requires 50 percent of the power of Fugaku while only delivering 25 percent of the peak performance.
Clearly, there are new technologies at work here that are allowing performance to scale faster than power demand. In the case of Fugaku, one of the innovations at work is its use of an extension of the ARM architecture known as the Scalable Vector Extensions (SVE), designed in part through multinational exascale research efforts. SVE makes use of vector technologies, many pioneered by Cray architectures, that enable CPUs to more efficiently execute computationally heavy workloads while maintaining programmer productivity.
More efficient execution reduces the power demand by the processors. And SVE technology, by virtue of its inclusion in the ubiquitous ARM architecture, will eventually spread to a much larger diversity of platforms, particularly in places where efficient AI and media processing are critical applications.
One of the side effects of the industry’s work to increase the efficiency of processors is an increasing amount of silicon in every socket. This can be seen most easily in the chiplet-based products coming to market, such as AMD’s Epyc CPUs and the upcoming Intel Ponte Vecchio GPU slated to be used in the U.S. Department of Energy’s Aurora system.
Packing more silicon into a package allows compute elements to keep more compute and data close together, greatly reducing the power required to move data around. However, one consequence of this trend is that it leads to increasing the amount of power per socket and the overall power density within a system. CPU power requirements have increased from 100 to 130 to 170 and are now being delivered at more than 280 watts per socket. This trend will continue in order to maintain performance growth for the foreseeable future, which in turn puts significant pressure on server power delivery and cooling.
Modern supercomputing infrastructure has been developed to deploy better power delivery and cooling capabilities to get the most out of CPUs, memories, and accelerators. While typical power for a moderately loaded standard server rack is in the range of 30 to 40 kilowatts, liquid-cooled high-performance computing (HPC) infrastructures are capable of five to 10 times that density. Not only do these solutions reduce the data center footprint of your system, but they allow you to get full use of your CPUs and accelerators at performance levels that would not be feasible in standard racks, where power limitations often require you to leave performance on the table.
While liquid cooling has long been a staple of HPC deployments, it’s now becoming more common for platforms as diverse as desktop gaming computers to the data center industry, where deployments range from rack-mounted radiator doors to fully immersed systems where the entire rack is submerged in non-conductive fluid. The high-density infrastructure used for exascale supercomputer systems is a bellwether for the increase in adoption of liquid-cooling technologies.
Memory performance, in terms of available bandwidth, is often a major impediment in the performance of supercomputer applications. HPC and AI application analyses has shown that memory performance, with an emphasis on increased bandwidth, is a critical component of reaching exascale performance.
OpenFOAM, a leading fluid dynamics framework used by many engineering and science applications, particularly in the automotive industry, is a prime example of the problem. Analysis of OpenFOAM shows its performance improves almost exclusively in proportion to memory bandwidth on modern platforms, while seeing very little improvement from increasing floating-point operations per second.
Typical server memories, such as DDR4, are incapable of accomplishing exascale bandwidth goals. Common DDR4 memory modules are neither cost effective nor, more importantly, sufficiently energy efficient for deployment in exascale computing. A memory system designed to deliver the necessary performance for exascale applications using DDR4 would consume more than 50 MW in DIMM power, with the cost of the memory alone exceeding $1 billion.
The industry realized that to reach these system-level goals in a power- and cost-effective way, a more bandwidth-focused memory technology was needed, which was a driving factor in the development of High Bandwidth Memory (HBM).
HBM technology provides an order of magnitude more bandwidth than DDR4 SDRAM and has now been adopted across the industry in solutions that target HPC and AI applications. HBM differs from typical memories in that it is integrated directly within the processor or accelerator socket instead of being purchased and populated separately. This tight integration is what enables HBM’s bandwidth to be achieved in a much more power-efficient way. Further, the bandwidth focus of the technology means much less of it is required to meet system-level bandwidth goals, greatly reducing the overall cost. Memory solutions using HBM are provided widely in GPUs and AI accelerators (such as Google’s TPU) today and will become more widespread over the next few years as other platforms tackle these same challenges.
Anyone have a map?
Seymour Cray liked to say, “Anyone can build a fast CPU. The trick is to build a fast system.” And as John Gage of Sun Microsystems famously said back in the 1980s, “The network is the computer.” Modern supercomputing relies heavily on technology that allows the millions of the processor cores involved in a single system to be interconnected and effectively communicate. The larger the number of processors in a supercomputer, the more difficult it becomes to deconstruct problems to make efficient use of parallel processing. When running applications at exascale, you need to keep consistent, low-jitter communications running at high speed to keep the CPUs and accelerators moving forward and not wasting time and energy simply waiting for data transfers.
This is especially challenging because modern HPC and AI workloads tend to have hybrid components, which require multiple traffic types on a network―from high-speed message passing to high-throughput parallel storage and data ingest. As a result, the lines between highly performant HPC fabrics and the data center standard of Ethernet are beginning to blur as more diverse programming paradigms push the need for more standard network support. Further, failure to isolate these different workflow elements will result in applications being swamped by communication stragglers that were held up in traffic.
Solutions are emerging in the industry to handle network congestion. While data center networks have had basic congestion notification mechanisms (such as ECN) for a while, these technologies didn’t react to congestion at the speed required to handle high-performance traffic. New congestion management technologies that dynamically mitigate or prevent network traffic interference at HPC-relevant speeds are being brought to market now, with HPE’s Slingshot as one example. These fabrics will enable reliable quality of service on high-performance traffic at the system level, which will keep communications flowing in heavily used systems.
HPC interconnect solutions that address demands for higher speed and lower latency are an important component for making a collection of compute elements into a fast system. However, as scalable multinode workloads―particularly AI―become more common in as-a-Service deployments in the data center, exascale-style interconnects will be needed to maintain good performance guarantees.
Coding for performance
Work environments that are doing technical computing and AI typically invest in a lot of intellectual capital―scientists, engineers, or data scientists―who bring creativity and insight to the organization. Supercomputers are the tools these people need to do their jobs, but just handing them a great hardware solution without thinking through, and investing in, the software development environment can sometimes do more harm to their productivity than good. What use is a well-designed supercomputer if no one can effectively program it to solve your technical or business problems?
Thankfully, most technology vendors provide some level of software to help use their hardware effectively: compilers to optimize code, tuned versions of commonly used libraries, and tools to analyze performance and debug problems. However, care must be taken to avoid locking your developers into a vendor-specific solution, which could provide a major obstacle in moving to new and better technologies in the future.
The industry has made significant advancements in high-performance languages and frameworks, which can make porting between platforms a much easier task. This is particularly true in the AI space, where domain-specific frameworks are widely supported across nearly every hardware vendor. Compatibility with these open source projects is now seen as the price of entry for hardware vendors, while simultaneously making it easier for them to support high-performance workloads by not having to invent their programming paradigms from scratch.
This trend is also true of network-heavy data analytics use cases, with open source solutions such as Arkouda being developed to give users the tools necessary to focus on their technical outcomes instead of the logistics of full-system programming.
Perhaps more important, while prior HPC-specific language efforts focused on making their languages friendly to newcomers, these new efforts are bringing HPC performance capabilities to the languages most users already work with, such as Python. This, more than any hardware advancement, is enabling the democratization of HPC technologies all the way down to edge and mobile devices.
And here we are
The exascale era is arriving soon, with new technologies set to deliver advancements across energy, climate, healthcare, and anywhere else science and engineering can improve our world. What these efforts also show is that exascale computing technologies will contribute much more broadly to the coming age of digital transformation.
CPU power requirements have increased from 100 to 130 to 170 and are now being delivered at more than 280 watts per socket.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.