Helps organizations manage Graphics Processing Unit (GPU) resource allocation and increase cluster utilization. Applies advanced scheduling mechanisms to dynamically set policies and orchestrate jobs for optimal resource utilization for IT Ops teams and data scientists.
Run:AI Orchestration Platform
HPE Ezmeral Runtime Version:
- Additional Information
Run:AI’s software builds off of powerful distributed computing and scheduling concepts implemented as a simple plugin to HPE Ezmeral Runtime. Together, the HPE Ezmeral Runtime and Run:AI GPU orchestration solution enable dynamic provisioning of GPUs so that resources can be easily shared, for more efficient orchestration of AI/ML workloads and optimized use of resources. With Run:AI data scientists can seamlessly consume massive amounts of GPU power to accelerate their research.
Run:AI creates a virtualization and acceleration layer over GPU resources that manages granular scheduling, prioritization, and allocation of compute power. A dedicated batch scheduler, running on top of HPE Ezmeral Runtime, manages GPU based workloads and includes mechanisms for creating multiple queues, setting fixed and guaranteed resource quotas, and managing priorities, policies, and multi-node training. It provides an elegant solution to simplify complex ML scheduling processes.
Current standards for orchestrating AI workloads rely on static resource allocations and lack the ability to schedule dynamic access to GPU. Run:AI provides GPU resource optimization that enables:
● Efficient use of resources – jobs run on as many GPUs as they need, based on availability of the entire environment, essentially getting a ‘guaranteed quota’ of compute resources from a shared pool
● Simplified GPU sharing - dynamic resource allocation removes static allocation hassles
● Fractional GPU – multiple workloads can share a single GPU for more efficient resource utilization
● Automated job scheduling – jobs run concurrently as long as there are available resources, greatly reducing the time for training tasks like hyper parameter tuning
● Granular monitoring of GPU usage - by cluster, node, project, and job