Ray Serve β ServeConfigV2, autoscaling, request batching, multi-model composition, async inference, custom request routing, custom autoscaling policies, RayService on K8s, health checks, and PD disaggregation. Use when serving models with Ray Serve. NOT for inference engine config (see vllm, sglang).
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: ray-serve description: "Ray Serve β ServeConfigV2, autoscaling, request batching, multi-model composition, async inference, custom request routing, custom autoscaling policies, RayService on K8s, health checks, and PD disaggregation. Use when serving models with Ray Serve. NOT for inference engine config (see vllm, sglang)."
Ray Serve
Ray Serve is a scalable model serving framework built on Ray. Deploys on Kubernetes via the RayService CRD. Version: 2.54.0 β Async Inference for long-running workloads, Custom Request Routing API, Custom Autoscaling policies, External Scaling API.
ServeConfigV2
The Serve config file is the production deployment format, embedded in the RayService CRD's serveConfigV2 field. Covers proxy location, HTTP/gRPC options, logging, and application definitions. See references/serve-config-schema.md for the full schema with all settings tables.
applications:
- name: my-app
route_prefix: /
import_path: my_module:app
runtime_env:
pip: [torch, transformers]
env_vars:
MODEL_ID: meta-llama/Llama-3.1-8B
external_scaler_enabled: false
deployments:
- name: MyDeployment
# ... deployment settings (see below)
Deployment Settings
Every @serve.deployment accepts these settings. Set them in code (decorator or .options()) or override in the config file. Config file takes highest priority, then code, then defaults.
Core Deployment Settings
| Setting | Purpose | Default |
|---|---|---|
name | Deployment name (must match code) | Class/function name |
num_replicas | Fixed replica count, or "auto" for autoscaling | 1 |
max_ongoing_requests | Max concurrent requests per replica | 5 |
max_queued_requests | Max queued requests per caller (experimental) | -1 (no limit) |
user_config | JSON-serializable config passed to reconfigure() | None |
logging_config | Per-deployment logging override | Global config |
Ray Actor Options
deployments:
- name: LLMDeployment
num_replicas: 2
ray_actor_options:
num_cpus: 4
num_gpus: 1
accelerator_type: A100
memory: 34359738368 # 32 GiB in bytes
Key settings: num_cpus, num_gpus, memory, accelerator_type, resources, runtime_env. ray_actor_options is replaced as a whole dict in config (not merged with code). Graceful shutdown defaults: graceful_shutdown_wait_loop_s: 2, graceful_shutdown_timeout_s: 20.
Health Checks
Implement check_health() on a deployment β Serve calls it every health_check_period_s (default 10s):
@serve.deployment
class MyModel:
def check_health(self):
if not self.model_loaded:
raise RuntimeError("Model not loaded")
Autoscaling Configuration
Set num_replicas: "auto" or provide explicit autoscaling_config:
autoscaling_config Settings
Key settings: min_replicas (default 1), max_replicas (default 1, 100 with auto), initial_replicas, target_ongoing_requests (default 2), metrics_interval_s (10s), look_back_period_s (30s), smoothing_factor (1.0), upscale_smoothing_factor, downscale_smoothing_factor, upscale_delay_s (30s), downscale_delay_s (600s), downscale_to_zero_delay_s, aggregation_function (MEAN|MAX).
deployments:
- name: LLMDeployment
max_ongoing_requests: 10
autoscaling_config:
min_replicas: 1
max_replicas: 8
target_ongoing_requests: 3
upscale_delay_s: 10
downscale_delay_s: 300
upscale_smoothing_factor: 2.0 # aggressive upscale
downscale_smoothing_factor: 0.5 # conservative downscale
metrics_interval_s: 5
look_back_period_s: 15
Tuning guidelines:
target_ongoing_requestsβ lower = lower latency, higher = higher throughputupscale_delay_sβ lower for bursty traffic, higher for steady trafficdownscale_delay_sβ keep high (300-600s) to avoid thrashingsmoothing_factor> 1 = more aggressive scaling, < 1 = more conservativemin_replicas: 0β enables scale-to-zero (adds cold start latency)
Request Batching (@serve.batch)
| Setting | Purpose | Default |
|---|---|---|
max_batch_size | Max requests per batch | 10 |
batch_wait_timeout_s | Max wait for a full batch | 0.01 |
max_concurrent_batches | Max batches running concurrently | 1 |
batch_size_fn | Custom function to compute batch size | None |
@serve.deployment
class BatchModel:
@serve.batch(max_batch_size=32, batch_wait_timeout_s=0.1, max_concurrent_batches=2)
async def handle_batch(self, requests: list[str]) -> list[str]:
# Process entire batch at once (e.g., batched GPU inference)
return self.model.predict(requests)
async def __call__(self, request):
return await self.handle_batch(request.query_params["text"])
Tuning: Set max_batch_size to your model's optimal batch size. Set batch_wait_timeout_s low for latency-sensitive, higher for throughput-sensitive. Increase max_concurrent_batches if GPU can handle multiple batches.
user_config (Dynamic Reconfiguration)
Update deployment behavior without restarting replicas:
@serve.deployment
class Model:
def reconfigure(self, config: dict):
"""Called on initial deploy and every config update."""
self.model_path = config["model_path"]
self.temperature = config.get("temperature", 1.0)
self.model = load_model(self.model_path)
deployments:
- name: Model
user_config:
model_path: meta-llama/Llama-3.1-8B
temperature: 0.7
Update user_config in the config file and re-apply β replicas call reconfigure() without restart. Useful for: model version swaps, A/B test weights, feature flags, hyperparameters.
Model Composition (DeploymentHandle)
Chain multiple deployments in a pipeline:
from ray.serve.handle import DeploymentHandle
@serve.deployment
class Preprocessor:
async def __call__(self, text: str) -> list[int]:
return tokenize(text)
@serve.deployment
class Model:
async def __call__(self, tokens: list[int]) -> str:
return self.model.generate(tokens)
@serve.deployment
class Pipeline:
def __init__(self, preprocessor: DeploymentHandle, model: DeploymentHandle):
self.preprocessor = preprocessor
self.model = model
async def __call__(self, request) -> str:
tokens = await self.preprocessor.remote(request.query_params["text"])
return await self.model.remote(tokens)
app = Pipeline.bind(Preprocessor.bind(), Model.bind())
Streaming Responses
Return a StreamingResponse from a deployment for token-by-token output. Use an async generator β passing a sync iterable directly to StreamingResponse may fail with async Serve handlers:
from starlette.responses import StreamingResponse
@serve.deployment
class StreamModel:
async def __call__(self, request):
async def generate():
for token in self.model.stream(request.query_params["prompt"]):
yield token
return StreamingResponse(generate())
Multiplexed Models (Multi-LoRA)
Load multiple model variants on the same replica:
@serve.deployment(num_replicas=2)
class MultiLoRAModel:
def __init__(self):
self.base_model = load_base_model()
self.adapters = {}
@serve.multiplexed(max_num_models_per_replica=10)
async def get_model(self, model_id: str):
if model_id not in self.adapters:
self.adapters[model_id] = load_adapter(model_id)
return self.adapters[model_id]
async def __call__(self, request):
model_id = serve.get_multiplexed_model_id()
adapter = await self.get_model(model_id)
return self.base_model.generate(request, adapter)
Async Inference (Long-Running Workloads)
Handle batch processing, transcription, multi-step agent workflows, and other long-running tasks without client timeouts. Integrates with message brokers (Redis, SQS) for at-least-once processing.
from ray import serve
from ray.serve.schema import TaskProcessorConfig, CeleryAdapterConfig
from ray.serve.task_consumer import task_consumer, task_handler
# 1. Configure message broker
processor_config = TaskProcessorConfig(
queue_name="inference_queue",
adapter_config=CeleryAdapterConfig(
broker_url="redis://localhost:6379/0",
backend_url="redis://localhost:6379/1",
),
max_retries=5,
failed_task_queue_name="failed_tasks",
)
# 2. Define task consumer β polls queue and processes tasks
@serve.deployment
@task_consumer(task_processor_config=processor_config)
class TranscriptionConsumer:
@task_handler(name="transcribe")
def transcribe(self, audio_url):
return self.model.transcribe(audio_url)
# 3. Ingress for task submission and status polling
from ray.serve.task_consumer import instantiate_adapter_from_config
from fastapi import FastAPI
app = FastAPI()
@serve.deployment
@serve.ingress(app)
class API:
def __init__(self, consumer_handle, config):
self.adapter = instantiate_adapter_from_config(config)
@app.post("/submit")
def submit(self, request):
task = self.adapter.enqueue_task_sync()
return {"task_id": task.id}
@app.get("/status/{task_id}")
def status(self, task_id: str):
return self.adapter.get_task_status_sync(task_id)
TaskConsumers scale via queue-depth autoscaling β set external_scaler_enabled: true or use the custom autoscaling API below.
Custom Request Routing
Override Serve's default routing to implement cache affinity, latency-aware routing, session stickiness, or content-based routing:
from ray.serve.request_router import PendingRequest, RequestRouter, RunningReplica
from typing import List, Optional
class PrefixCacheRouter(RequestRouter):
"""Route requests to replicas with hot prefix caches."""
async def choose_replicas(
self,
candidate_replicas: List[RunningReplica],
pending_request: Optional[PendingRequest] = None,
) -> List[List[RunningReplica]]:
# Return ranked lists β Serve tries first list, then falls back
prefix = pending_request.request_args.get("prefix", "") # verify .request_args attr against RequestRouter base class for your Ray version
scored = [(r, self._cache_score(r, prefix)) for r in candidate_replicas]
scored.sort(key=lambda x: x[1], reverse=True)
return [[r for r, _ in scored]]
def on_request_routed(self, pending_request, replica):
"""Callback after routing β update cache tracking."""
pass
Attach to a deployment:
from ray.serve.config import RequestRouterConfig
@serve.deployment(
request_router_config=RequestRouterConfig(
request_router_class="my_module:PrefixCacheRouter"
),
num_replicas=10,
)
class MyModel:
def record_routing_stats(self) -> dict:
"""Optional: emit stats accessible in choose_replicas."""
return {"cache_hit_rate": self.cache_hit_rate}
Custom Autoscaling Policies
Define scaling logic driven by custom metrics instead of request counts. Useful for queue-depth scaling, GPU utilization, cross-deployment coordination, or scheduled scaling.
Step 1 β Emit custom metrics from replicas:
@serve.deployment
class GPUModel:
def record_autoscaling_stats(self) -> dict:
return {"gpu_util": get_gpu_utilization(), "queue_depth": self.queue.qsize()}
Step 2 β Define a policy function:
from ray.serve.config import AutoscalingContext
from ray.serve import DeploymentID # public API (ray.serve._private.common is internal, avoid)
from typing import Dict, Tuple
def gpu_aware_policy(
ctxs: Dict[DeploymentID, AutoscalingContext],
) -> Tuple[Dict[DeploymentID, int], Dict]:
decisions = {}
for deployment_id, ctx in ctxs.items():
avg_util = sum(s.get("gpu_util", 0) for s in ctx.replica_stats) / max(len(ctx.replica_stats), 1)
if avg_util > 0.85:
decisions[deployment_id] = ctx.current_num_replicas + 1
elif avg_util < 0.3 and ctx.current_num_replicas > ctx.min_replicas:
decisions[deployment_id] = ctx.current_num_replicas - 1
else:
decisions[deployment_id] = ctx.current_num_replicas
return decisions, {"avg_gpu_util": avg_util}
Step 3 β Attach in config YAML:
applications:
- name: my-app
import_path: app:deployment
autoscaling_policy:
policy_function: my_module:gpu_aware_policy
Application-level policies receive all deployment contexts, enabling coordinated scaling (e.g., scale preprocessing in sync with model replicas).
Disaggregated Prefill-Decode Serving
Ray Serve provides native PD disaggregation via build_pd_openai_app, which manages separate prefill and decode vLLM instances with automatic KV cache transfer routing via NixlConnector.
Key concepts:
- Prefill replicas process prompts and transfer KV cache (compute-bound, fewer replicas)
- Decode replicas generate tokens using transferred KV cache (memory-bound, more replicas)
PDProxyServerorchestrates routing between prefill β decode automatically- Both configs must use the same model;
kv_transfer_configis required in both
Quick start:
from ray.serve.llm import LLMConfig, build_pd_openai_app
app = build_pd_openai_app({
"prefill_config": LLMConfig(
model_loading_config={"model_id": "llama-8b", "model_source": "meta-llama/Llama-3.1-8B-Instruct"},
accelerator_type="A100",
engine_kwargs={"tensor_parallel_size": 2,
"kv_transfer_config": {"kv_connector": "NixlConnector", "kv_role": "kv_both"}},
),
"decode_config": LLMConfig(
model_loading_config={"model_id": "llama-8b", "model_source": "meta-llama/Llama-3.1-8B-Instruct"},
accelerator_type="A100",
engine_kwargs={"gpu_memory_utilization": 0.95,
"kv_transfer_config": {"kv_connector": "NixlConnector", "kv_role": "kv_both"}},
),
})
For full PD configuration (YAML configs, LLMConfig field reference, engine_kwargs tuning, independent autoscaling, DP+PD, component API, and constraints), see references/disaggregated-pd.md.
See also assets/serve_pd_config.yaml for a complete deployment template.
Kubernetes Deployment (RayService)
The ServeConfigV2 is embedded in the RayService CRD:
apiVersion: ray.io/v1
kind: RayService
metadata:
name: llm-service
spec:
serviceUnhealthySecondThreshold: 900 # time before marking service unhealthy
deploymentUnhealthySecondThreshold: 300 # time before marking deployment unhealthy
serveConfigV2: |
applications:
- name: llm
route_prefix: /
import_path: serve_app:app
runtime_env:
pip: [vllm, transformers]
deployments:
- name: VLLMDeployment
num_replicas: 2
max_ongoing_requests: 24
ray_actor_options:
num_cpus: 4
num_gpus: 1
autoscaling_config:
min_replicas: 1
max_replicas: 4
target_ongoing_requests: 8
rayClusterConfig:
headGroupSpec:
template:
spec:
containers:
- name: ray-head
resources:
limits:
cpu: "4"
memory: 8Gi
workerGroupSpecs:
- groupName: gpu-workers
replicas: 2
minReplicas: 1
maxReplicas: 4
template:
spec:
containers:
- name: ray-worker
resources:
limits:
cpu: "8"
memory: 32Gi
nvidia.com/gpu: "1"
Upgrade Behavior
Changes to serveConfigV2 apply in-place (hot update, no cluster rebuild). Changes to rayClusterConfig trigger a zero-downtime cluster upgrade. See references/upgrade-strategies.md for upgrade strategies, the serviceUnhealthySecondThreshold footgun with large model cold starts, and monitoring upgrade status.
RayService-Specific Settings
| Setting | Purpose | Default |
|---|---|---|
serviceUnhealthySecondThreshold | Seconds before marking service unhealthy | 900 |
deploymentUnhealthySecondThreshold | Seconds before marking deployment unhealthy | 300 |
High Availability
For HA, set max_replicas_per_node: 1 to spread replicas across nodes:
deployments:
- name: MyDeployment
num_replicas: 3
max_replicas_per_node: 1
Priority of Settings
Config file > application code > defaults. ray_actor_options, user_config, and autoscaling_config are each replaced as whole dicts (not merged) when specified in the config file.
References
serve-config-schema.mdβ Full ServeConfigV2 schema: proxy, HTTP/gRPC, logging, application settingsupgrade-strategies.mdβ RayService upgrade strategies,serviceUnhealthySecondThresholdtuningdisaggregated-pd.mdβ Prefill-decode disaggregation with vLLM and KV cache transferperformance.mdβ Autoscaling, batching, and performance troubleshooting
Cross-References
- nvidia-dynamo β Alternative distributed inference orchestration; comparison for LLM serving
- vllm β Serve vLLM models with Ray Serve for autoscaling
- sglang β Alternative model backend deployable behind Ray Serve
- gateway-api-inference β K8s-native inference routing; alternative to Ray Serve's built-in routing
- keda β Event-driven autoscaling; complements Ray Serve's built-in autoscaler for K8s
- ray-core β Ray actors powering Serve deployments
- kuberay β Deploy Serve on Kubernetes via RayService CRD
Reference
- Serve config files
- Deployment configuration
- AutoscalingConfig API
- Advanced autoscaling
- RayService on K8s
references/performance.mdβ performance tuning and troubleshootingreferences/disaggregated-pd.mdβ full PD config: LLMConfig fields, engine_kwargs, autoscaling, DP+PD, component APIassets/serve_config.yamlβ ServeConfigV2 example with multi-model deployment and autoscalingassets/serve_pd_config.yamlβ Disaggregated prefill-decode config with NixlConnector and independent scaling- gpu-operator β GPU driver and device plugin for Ray Serve workers
- prometheus-grafana β Scrape Ray Serve Prometheus metrics
More by tylertitsworth
View alluv β fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.
TensorRT-LLM β engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.
NVIDIA Triton Inference Server β model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.
KubeRay operator β RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).
