Speed up time to market for AI

HPE Ezmeral Container Platform, Run:AI team up to enhance GPU utilization, accelerate AI initiatives

+ show more
Solution brief
TAP IMAGE TO ZOOM IN

Demand for GPUs that data scientists use for deep learning (DL) and machine learning (ML) is on the rise. As demand increases, providing efficient and granular GPU access to teams of researchers running hundreds of jobs on many clusters over weeks (and even months) is quickly becoming a huge challenge.

Run:AI creates a virtualization and acceleration layer over GPU resources that manage granular scheduling, prioritization, and allocation of compute power for the HPE Ezmeral Container Platform. Run:AI provides a dedicated batch scheduler, running on top of HPE Ezmeral Container Platform to manage GPU-based workloads. Run:AI includes mechanisms for creating multiple GPU queues, setting fixed and enabling resource quotas, managing priorities, policies, and multi-node training.

 

It provides an efficient approach to simplify complex ML scheduling processes and GPU-as-a-service solutions. Hewlett Packard Enterprise and Run:AI have partnered to simplify the orchestration of artificial intelligence (AI) workloads and enhance the utilization of scarce GPU resources.

  • Removing resource limitations

    The HPE Ezmeral Container Platform and Run:AI GPU orchestration solution enables dynamic provisioning of GPUs so that resources can be easily shared, efficiently orchestrated, and optimally used. With HPE Ezmeral Container Platform and Run:AI, end users such as data scientists can seamlessly and efficiently consume large amounts of GPU resources to accelerate their research.

TAP IMAGE TO ZOOM IN

Figure 1. Run:AI and HPE Ezmeral Container help optimize GPU orchestration

  • Achieve faster model accuracy

    By automating the scheduling of workloads and creating policies and priorities that dynamically allocate resources, workloads created on the HPE Ezmeral Container Platform can efficiently access a massive amount of GPU resources. Run:AI applies distributed computing principles, enabling quota allocation of GPU-dependent projects to achieve faster results.

  • Manage GPU infrastructure based on business goals

    Run:AI’s quota and fairness mechanisms enable users to control, prioritize, and align computing needs with business goals. HPE Ezmeral Container Platform can use Run:AI’s advanced scheduling, queueing mechanisms, and automatic pre-emption of jobs based on priorities to enable IT organizations to better support service-level agreements (SLAs) based on business objectives.

  • Gain granular control over GPU resources and reduce costs

    Run:AI’s platform includes a multicluster holistic view of how resources are consumed and jobs are orchestrated, including monitoring metrics such as GPU utilization, workload run time, wait times, and more. These metrics can help improve operational efficiencies and scale.

TAP IMAGE TO ZOOM IN

Figure 2. Run:AI analytics dashboard

  • Benefits

    Current standards for orchestrating AI/ML workloads rely on static resource allocations and cannot schedule fine-grained access to GPU. HPE Ezmeral Container Platform and Run:AI together provide efficient GPU resource allocation that enables:

    • Efficient use of HPE Ezmeral Container Platform GPU resources—jobs run on as many GPUs as they need, based on the availability of the entire environment, allowing predefined quota and scaling of these GPU resources
    • Simplified GPU sharing—dynamic resource allocation enables IT organizations to easily create and update GPU resource allocation policies for HPE Ezmeral Container Platform workloads
    • Fractional GPU—multiple workloads can share a single GPU for more efficient resource utilization, allowing users to perform more tasks on the same GPU hardware
    • Automated job scheduling—jobs can run concurrently if there are available resources, greatly reducing the time for training tasks such as hyperparameter tuning
    • Granular monitoring of GPU usage—by cluster, node, project, and job, giving HPE Ezmeral Container Platform users detailed insights and data to better manage and profile their GPU workloads

     

    HPE Ezmeral Machine Learning Ops addresses the entire ML pipeline from data preparation to model building, training, deployment, and monitoring. Together with Run:AI, HPE Ezmeral Machine Learning Ops customers are able to deliver AI solutions to market faster by helping to maximize the utilization of their GPU clusters. DL workloads running within HPE Ezmeral Machine Learning Ops based on frameworks such as TensorFlow and PyTorch, benefit from the use of GPUs to achieve faster time to model accuracy while Run:AI enables optimal sharing of these GPU resources concurrently by many users.

Download the PDF