NVIDIA Triton™ Inference Server simplifies the deployment of AI models at scale in production. It is an Open-source inference serving software, that lets teams deploy trained AI deep learning and machine learning models from any framework (TensorFlow, NVIDIA® TensorRT®, PyTorch, ONNX Runtime, OpenVINO, XgBoost or custom) on any GPU- or CPU-based infrastructure (cloud, data center, or edge) with ease and high performance.

NVIDIA Triton Inference Server is an open-source inference serving software that simplifies inference serving for an organization. Triton provides a single standardized inference platform which can support running inference on multi-framework models, on both CPU and GPU, and in different deployment environments such as data center, cloud, embedded devices, and virtualized environments.

  • It natively supports multiple framework backends like TensorFlow, PyTorch, ONNX Runtime, Python, OpenVINO and even custom backends.
  • It supports different types of inference queries through advanced batching and scheduling algorithms, supports live model updates, and runs models on both CPUs and GPUs.
  • It increases inference performance by maximizing hardware utilization through concurrent model execution and dynamic batching.
  • Concurrent execution allows you to run multiple copies of a model, and multiple different models, in parallel on the same GPU or CPU.
  • Triton ensembles represent a pipeline of one or more models and the connection of input and output tensors between those models. Triton can easily manage the execution of the entire pipeline just with a single inference request to an ensemble from the client application.
  • In addition to the popular AI backends, Triton also supports execution of custom C++ & Python backends. These are useful to create special logic like pre and post processing or even regular models.
  • Triton has a model control API that can be used to dynamically load and unload models. This allows the device to use the models when required by the application. Also, when a model gets retrained with new data it can be deployed by Triton for inference seamlessly without any application restarts or disruption to the service.
