What is an HPC Cluster?
An HPC cluster, or high-performance computing cluster, is a combination of specialized hardware, including a group of large and powerful computers, and a distributed processing software framework configured to handle massive amounts of data at high speeds with parallel performance and high availability.
How do you build an HPC cluster?
While building an HPC cluster is fairly straightforward, it requires an organization to understand the level of compute power needed on a daily basis to determine the setup. You need to carefully assess questions such as: how many servers are required; what software layer can handle the workloads efficiently; where the cluster will be housed; and what the system’s power and cooling requirements are. Once these are decided, you can proceed with building the cluster, following the steps listed below:
- Build a compute node: Configure a head node by installing tools for monitoring and resource management as well as high-speed interconnect drivers/software. Create a shared cluster directory, capture an image of the compute node, and clone the image out to the rest of the cluster that will run the workloads.
- Configure IP addresses: For peak efficiency, HPC clusters contain a high-speed interconnect network that uses a dedicated IP subnet. As you connect worker nodes to the head node, you will assign additional IP addresses for each node.
- Configure jobs as CMU user groups: As workloads arrive in the queue, you will need a script to dynamically create CMU user groups for each currently running job.
What are the key components of an HPC cluster?
There are three basic components to an HPC cluster that each have different requirements: compute hardware, software, and facilities.
Compute hardware includes servers, storage, and a dedicated network. Typically, you will need to provision at least three servers that function as primary, worker, and client nodes. With such a limited setup, you’ll need to invest in high-end servers with ample processors and storage for more compute capacity in each. But you can scale that up by virtualizing multiple servers, making more compute power available to the cluster. The networking infrastructure to support them will require high-bandwidth TCP/IP network equipment, such as Gigabit Ethernet, NICs, and switches.
The software layer includes the tools you intend to use to monitor, provision, and manage your HPC cluster. Software stacks comprise libraries, compilers, debuggers, and file systems as well to execute cluster management functions. You may decide to adopt an HPC framework such as Hadoop, which carries out the same functions, but it’s fault-tolerant and can detect failed systems and redirect traffic to available systems automatically.
To house your HPC cluster, you need actual physical floor space to hold and support the weight of racks of servers, which can include up to 72 blade-style servers and five top-of-rack switches weighing in at up to 1,800 pounds. You also must have enough power to operate and cool the servers, which can demand up to 43 kW.
HPE and HPC clusters
HPE offers an industry-leading portfolio of HPC solutions to help organizations of all sizes improve efficiency, reduce downtime, and accelerate productivity.
HPE Performance Cluster Manager provides everything you need to manage your HPE cluster to keep it running at peak performance. With a comprehensive set of tools that are fully integrated for HPE HPC/AI systems, it’s a flexible, easy-to-use system management solution that’s been used by hundreds of customers around the globe for more than a decade. Scalable to manage systems of any size, from dozens of nodes up to Exascale in both on-premises and hybrid HPC environments, you can go to production in minutes and run health checks and tests on a regular schedule to make the most use of available resources.
HPE Slingshot is a modern high-performance interconnect for HPC and AI clusters that delivers industry-leading performance, bandwidth, and low latency for HPC, AI/ML, and data analytics applications. It tracks real-time information on load across each switch-to-switch path and dynamically re-routes traffic to balance load.
HPE GreenLake delivers the flexibility, scalability, and control you need for your HPC environment with a cloud service consumption model on-premises. And you can use our skilled experts to implement and operate the environment for you, helping you reduce the cost and complexity of maintaining your own HPC architecture.