NVIDIA Triton Inference Server — model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: triton-inference-server description: "NVIDIA Triton Inference Server — model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM."
Triton Inference Server
Multi-framework model serving platform. Triton's value is the model repository abstraction, pipeline composition, and fine-grained serving controls — not just running a single model.
Model Repository Structure
model_repository/
├── text_encoder/
│ ├── config.pbtxt
│ ├── 1/
│ │ └── model.onnx # version 1
│ └── 2/
│ └── model.onnx # version 2 (latest by default)
├── image_classifier/
│ ├── config.pbtxt
│ └── 1/
│ └── model.plan # TensorRT engine
├── preprocessing/
│ ├── config.pbtxt
│ └── 1/
│ └── model.py # Python backend
└── ensemble_pipeline/
├── config.pbtxt # no model file — orchestration only
└── 1/
└── (empty)
Repository sources — local path, S3, GCS, or Azure Blob:
# Local
tritonserver --model-repository=/models
# S3
tritonserver --model-repository=s3://bucket/models
# Multiple repos (merged namespace)
tritonserver \
--model-repository=/models/core \
--model-repository=s3://bucket/experimental
config.pbtxt Essentials
Minimal Config (TensorRT)
# image_classifier/config.pbtxt
platform: "tensorrt_plan"
max_batch_size: 32
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
ONNX with Optimization
platform: "onnxruntime_onnx"
max_batch_size: 64
input [
{ name: "input_ids" data_type: TYPE_INT64 dims: [ -1 ] }
{ name: "attention_mask" data_type: TYPE_INT64 dims: [ -1 ] }
]
output [
{ name: "logits" data_type: TYPE_FP32 dims: [ -1, 768 ] }
]
# Use TensorRT EP for GPU acceleration within ONNX Runtime
optimization {
execution_accelerators {
gpu_execution_accelerator: [
{ name: "tensorrt" }
]
}
}
Python Backend
# preprocessing/config.pbtxt
backend: "python"
max_batch_size: 0 # model handles batching internally
input [
{ name: "RAW_TEXT" data_type: TYPE_STRING dims: [ 1 ] }
]
output [
{ name: "TOKENS" data_type: TYPE_INT64 dims: [ -1 ] }
]
instance_group [
{ count: 2 kind: KIND_CPU }
]
Python model file (1/model.py):
import triton_python_backend_utils as pb_utils
import numpy as np
from transformers import AutoTokenizer
class TritonPythonModel:
def initialize(self, args):
self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def execute(self, requests):
responses = []
for request in requests:
raw = pb_utils.get_input_tensor_by_name(request, "RAW_TEXT")
text = raw.as_numpy().flatten()[0].decode("utf-8")
tokens = self.tokenizer.encode(text, return_tensors="np").astype(np.int64)
out = pb_utils.Tensor("TOKENS", tokens)
responses.append(pb_utils.InferenceResponse(output_tensors=[out]))
return responses
Auto-Generated Config
For supported backends (ONNX, TensorRT, TF SavedModel), skip config.pbtxt:
tritonserver --model-repository=/models --strict-model-config=false
Retrieve the generated config: curl localhost:8000/v2/models/<name>/config
Dynamic Batching
Triton accumulates requests and batches them for GPU efficiency:
max_batch_size: 64
dynamic_batching {
preferred_batch_size: [ 16, 32 ] # target these sizes
max_queue_delay_microseconds: 100 # wait up to 100µs to fill batch
priority_levels: 2 # 0=default, 1=high priority
default_priority_level: 1
}
Padding inputs — when inputs vary in length, enable ragged batching or pad:
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [ -1 ] # -1 = variable dimension
allow_ragged_batch: true
}
]
Sequence Batching
For stateful models (RNNs, streaming ASR, chatbots):
max_batch_size: 8
sequence_batching {
max_sequence_idle_microseconds: 5000000 # 5s timeout
control_input [
{
name: "START"
control [ { kind: CONTROL_SEQUENCE_START fp32_false_true: [ 0, 1 ] } ]
},
{
name: "END"
control [ { kind: CONTROL_SEQUENCE_END fp32_false_true: [ 0, 1 ] } ]
},
{
name: "CORRID"
control [ { kind: CONTROL_SEQUENCE_CORRID data_type: TYPE_UINT64 } ]
}
]
# Direct strategy: each sequence pinned to a slot
direct { }
}
Instance Groups and Resource Management
# 2 instances on GPU 0, 1 on GPU 1, 1 CPU fallback
instance_group [
{ count: 2 kind: KIND_GPU gpus: [ 0 ] },
{ count: 1 kind: KIND_GPU gpus: [ 1 ] },
{ count: 1 kind: KIND_CPU }
]
Rate limiting across models sharing a GPU:
rate_limiter {
resources [
{ name: "GPU_MEMORY" global: true count: 1 }
]
}
Ensemble Models
Chain models as a DAG without custom code:
# ensemble_pipeline/config.pbtxt
platform: "ensemble"
max_batch_size: 0
input [ { name: "RAW_TEXT" data_type: TYPE_STRING dims: [ 1 ] } ]
output [ { name: "LOGITS" data_type: TYPE_FP32 dims: [ -1 ] } ]
ensemble_scheduling {
step [
{
model_name: "preprocessing"
model_version: -1
input_map { key: "RAW_TEXT" value: "RAW_TEXT" }
output_map { key: "TOKENS" value: "preprocessed_tokens" }
},
{
model_name: "text_encoder"
model_version: -1
input_map { key: "input_ids" value: "preprocessed_tokens" }
output_map { key: "logits" value: "LOGITS" }
}
]
}
When to use ensembles vs BLS:
- Ensemble: Simple linear/DAG pipelines, no conditionals, Triton manages data flow
- BLS: Loops, if/else branching, dynamic model selection, async fan-out
Business Logic Scripting (BLS)
BLS runs inside a Python backend model and can invoke other Triton models:
import triton_python_backend_utils as pb_utils
import asyncio, numpy as np
class TritonPythonModel:
async def execute(self, requests):
responses = []
for req in requests:
image = pb_utils.get_input_tensor_by_name(req, "IMAGE")
# Step 1: classify
cls_req = pb_utils.InferenceRequest(
model_name="classifier",
requested_output_names=["label", "confidence"],
inputs=[image])
cls_resp = await cls_req.async_exec()
label = pb_utils.get_output_tensor_by_name(cls_resp, "label").as_numpy()
confidence = pb_utils.get_output_tensor_by_name(cls_resp, "confidence").as_numpy()
# Step 2: conditional routing
if confidence[0] < 0.7:
detail_req = pb_utils.InferenceRequest(
model_name="detailed_classifier",
requested_output_names=["label"],
inputs=[image])
detail_resp = await detail_req.async_exec()
label = pb_utils.get_output_tensor_by_name(detail_resp, "label").as_numpy()
out = pb_utils.Tensor("RESULT", label)
responses.append(pb_utils.InferenceResponse(output_tensors=[out]))
return responses
BLS caveat: A model cannot call itself (deadlock). Use async_exec for parallel fan-out.
Model Management API
# Explicit model control mode
tritonserver --model-repository=/models --model-control-mode=explicit
# Load / unload / reload
curl -X POST localhost:8000/v2/repository/models/my_model/load
curl -X POST localhost:8000/v2/repository/models/my_model/unload
# Load with override parameters (e.g., switch model file)
curl -X POST localhost:8000/v2/repository/models/my_model/load \
-d '{"parameters": {"config": "{\"backend\": \"onnxruntime\", \"default_model_filename\": \"model_v2.onnx\"}"}}'
# Repository index (list all models + status)
curl -X POST localhost:8000/v2/repository/index
Version policy in config:
version_policy {
specific { versions: [ 1, 3 ] } # serve only versions 1 and 3
# OR: latest { num_versions: 2 } # serve latest 2 versions
# OR: all { } # serve all versions
}
Health, Metrics, and Profiling
Endpoints
| Endpoint | Purpose |
|---|---|
GET /v2/health/live | Server liveness |
GET /v2/health/ready | Server readiness (models loaded) |
GET /v2/models/{name}/ready | Specific model readiness |
GET /metrics | Prometheus metrics (port 8002) |
Key Metrics
# Request throughput & latency
nv_inference_request_success, nv_inference_request_failure
nv_inference_request_duration_us (histogram)
# Batching efficiency
nv_inference_exec_count # actual batch executions
nv_inference_request_count # individual requests in those batches
# Queue time
nv_inference_queue_duration_us
# GPU utilization
nv_gpu_utilization, nv_gpu_memory_used_bytes
perf_analyzer
# Basic throughput test
perf_analyzer -m my_model -u localhost:8001 --concurrency-range 1:16:2
# Specify input shape
perf_analyzer -m text_encoder -u localhost:8001 \
--shape input_ids:128 --shape attention_mask:128 \
-b 8 --concurrency-range 4:32:4
# Measure latency percentiles
perf_analyzer -m my_model -u localhost:8001 \
--percentile=95 --measurement-interval 10000
# Generate report
perf_analyzer -m my_model -u localhost:8001 \
-f results.csv --concurrency-range 1:64:4
Client SDK (Python)
HTTP
import tritonclient.http as httpclient
import numpy as np
client = httpclient.InferenceServerClient(url="localhost:8000")
# Check server/model
assert client.is_server_ready()
assert client.is_model_ready("text_encoder")
# Infer
inputs = [httpclient.InferInput("input_ids", [1, 128], "INT64")]
inputs[0].set_data_from_numpy(np.ones([1, 128], dtype=np.int64))
outputs = [httpclient.InferRequestedOutput("logits")]
result = client.infer("text_encoder", inputs, outputs=outputs)
logits = result.as_numpy("logits")
gRPC (streaming)
import numpy as np
import tritonclient.grpc as grpcclient
client = grpcclient.InferenceServerClient(url="localhost:8001")
inputs = [grpcclient.InferInput("input_ids", [1, 128], "INT64")]
inputs[0].set_data_from_numpy(np.ones([1, 128], dtype=np.int64))
result = client.infer("text_encoder", inputs)
logits = result.as_numpy("logits")
For async streaming (e.g., decoupled models):
def callback(result, error):
if error:
print(f"Error: {error}")
else:
print(result.as_numpy("output"))
client.start_stream(callback=callback)
client.async_stream_infer("streaming_model", inputs)
# ... later
client.stop_stream()
Backend Quick Reference
| Backend | platform / backend | Model file | Notes |
|---|---|---|---|
| TensorRT | tensorrt_plan | model.plan | Fastest GPU inference; engine tied to specific GPU arch |
| ONNX Runtime | onnxruntime_onnx | model.onnx | Cross-platform; supports TRT EP acceleration |
| PyTorch | pytorch_libtorch | model.pt | TorchScript only; see naming conventions |
| TensorFlow | tensorflow_savedmodel | model.savedmodel/ | SavedModel or GraphDef |
| Python | backend: "python" | model.py | Custom pre/post-processing, BLS |
| vLLM | backend: "vllm" | uses HF model | LLM serving via Triton; see vllm backend docs |
| OpenVINO | backend: "openvino" | model.xml | Intel CPU/GPU optimized |
Launching Triton (Container)
# Mount model repository and run
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /path/to/model_repository:/models \
nvcr.io/nvidia/tritonserver:25.12-py3 \
tritonserver --model-repository=/models
# With explicit model control (don't auto-load all models)
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /path/to/model_repository:/models \
nvcr.io/nvidia/tritonserver:25.12-py3 \
tritonserver --model-repository=/models \
--model-control-mode=explicit \
--load-model=text_encoder \
--load-model=preprocessing
Cross-References
- tensorrt-llm — TRT-LLM backend for optimized LLM inference on Triton
- nvidia-nim — NIM packages Triton with pre-configured model profiles
- vllm — Alternative inference backend; can run alongside Triton in mixed deployments
- nvidia-dynamo — Dynamo orchestration layer can front Triton-served models
References
- references/backends-and-optimization.md — Backend-specific tuning, TensorRT EP options, model warmup, response cache
- references/kubernetes-deployment.md — Production K8s deployment: PVC/S3 model repos, health probes, Prometheus metrics, HPA on inference throughput
- references/troubleshooting.md — Server startup failures, model loading issues, inference errors, and performance tuning
- gpu-operator — GPU driver and device plugin prerequisites
- leaderworkerset — Multi-node Triton deployment for large models
- prometheus-grafana — Scrape Triton Prometheus metrics
- keda — Autoscale Triton deployments based on request metrics
- harbor — Store Triton model server images
- model-formats — Model format details for Triton backends
More by tylertitsworth
View alluv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.
TensorRT-LLM — engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.
LLM evaluation with lm-evaluation-harness — MMLU, HumanEval, GSM8K benchmarks, custom tasks, vLLM/HF/OpenAI backends, metrics, and LLM-as-judge. Use when evaluating or benchmarking language models. NOT for training, fine-tuning, dataset preprocessing, or model serving.
KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).
