NVIDIA Triton Inference Server
NVIDIA Triton Inference Server
NVIDIA Triton™ Inference Server simplifies the deployment of AI models at scale in production. It is an Open-source inference serving software, that lets teams deploy trained AI deep learning and machine learning models from any framework (TensorFlow, NVIDIA® TensorRT®, PyTorch, ONNX Runtime, OpenVINO, XgBoost or custom) on any GPU- or CPU-based infrastructure (cloud, data center, or edge) with ease and high performance.
Product Name
NVIDIA Triton Inference Server
HPE Ezmeral Runtime Version
5.3
Product Version
NA (monthly releases)
Product Category
AI Inference Serving (Model serving)
Last updated
August 2021
- Overview
- Resources
- Additional Information
NVIDIA Triton Inference Server is an open-source inference serving software that simplifies inference serving for an organization. Triton provides a single standardized inference platform which can support running inference on multi-framework models, on both CPU and GPU, and in different deployment environments such as data center, cloud, embedded devices, and virtualized environments.
- It natively supports multiple framework backends like TensorFlow, PyTorch, ONNX Runtime, Python, OpenVINO and even custom backends.
- It supports different types of inference queries through advanced batching and scheduling algorithms, supports live model updates, and runs models on both CPUs and GPUs.
- It increases inference performance by maximizing hardware utilization through concurrent model execution and dynamic batching.
- Concurrent execution allows you to run multiple copies of a model, and multiple different models, in parallel on the same GPU or CPU.
- Triton ensembles represent a pipeline of one or more models and the connection of input and output tensors between those models. Triton can easily manage the execution of the entire pipeline just with a single inference request to an ensemble from the client application.
- In addition to the popular AI backends, Triton also supports execution of custom C++ & Python backends. These are useful to create special logic like pre and post processing or even regular models.
- Triton has a model control API that can be used to dynamically load and unload models. This allows the device to use the models when required by the application. Also, when a model gets retrained with new data it can be deployed by Triton for inference seamlessly without any application restarts or disruption to the service.
Explore the industry’s first enterprise-grade container platform for cloud-native and distributed non-cloud native applications, HPE Ezmeral Container Platform.
Interested in learning more about the HPE Ezmeral Container Platform and Nvidia? Please contact us to learn more.