ray-serve

@tylertitsworth/ray-serve

0 forks

Updated 4/1/2026

Ray Serve — ServeConfigV2, autoscaling, request batching, multi-model composition, async inference, custom request routing, custom autoscaling policies, RayService on K8s, health checks, and PD disaggregation. Use when serving models with Ray Serve. NOT for inference engine config (see vllm, sglang).

Installation

$npx agent-skills-cli install @tylertitsworth/ray-serve

Claude Code

Cursor

Copilot

Codex

Antigravity

Details

Repositorytylertitsworth/skills

Pathray-serve/SKILL.md

Branchmain

Scoped Name@tylertitsworth/ray-serve

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions

name: ray-serve description: "Ray Serve — ServeConfigV2, autoscaling, request batching, multi-model composition, async inference, custom request routing, custom autoscaling policies, RayService on K8s, health checks, and PD disaggregation. Use when serving models with Ray Serve. NOT for inference engine config (see vllm, sglang)."

Ray Serve

Ray Serve is a scalable model serving framework built on Ray. Deploys on Kubernetes via the RayService CRD. Version: 2.54.0 — Async Inference for long-running workloads, Custom Request Routing API, Custom Autoscaling policies, External Scaling API.

ServeConfigV2

The Serve config file is the production deployment format, embedded in the RayService CRD's serveConfigV2 field. Covers proxy location, HTTP/gRPC options, logging, and application definitions. See references/serve-config-schema.md for the full schema with all settings tables.

applications:
  - name: my-app
    route_prefix: /
    import_path: my_module:app
    runtime_env:
      pip: [torch, transformers]
      env_vars:
        MODEL_ID: meta-llama/Llama-3.1-8B
    external_scaler_enabled: false
    deployments:
      - name: MyDeployment
        # ... deployment settings (see below)

Deployment Settings

Every @serve.deployment accepts these settings. Set them in code (decorator or .options()) or override in the config file. Config file takes highest priority, then code, then defaults.

Core Deployment Settings

Setting	Purpose	Default
`name`	Deployment name (must match code)	Class/function name
`num_replicas`	Fixed replica count, or `"auto"` for autoscaling	`1`
`max_ongoing_requests`	Max concurrent requests per replica	`5`
`max_queued_requests`	Max queued requests per caller (experimental)	`-1` (no limit)
`user_config`	JSON-serializable config passed to `reconfigure()`	None
`logging_config`	Per-deployment logging override	Global config

Ray Actor Options

deployments:
  - name: LLMDeployment
    num_replicas: 2
    ray_actor_options:
      num_cpus: 4
      num_gpus: 1
      accelerator_type: A100
      memory: 34359738368  # 32 GiB in bytes

Key settings: num_cpus, num_gpus, memory, accelerator_type, resources, runtime_env. ray_actor_options is replaced as a whole dict in config (not merged with code). Graceful shutdown defaults: graceful_shutdown_wait_loop_s: 2, graceful_shutdown_timeout_s: 20.

Health Checks

Implement check_health() on a deployment — Serve calls it every health_check_period_s (default 10s):

@serve.deployment
class MyModel:
    def check_health(self):
        if not self.model_loaded:
            raise RuntimeError("Model not loaded")

Autoscaling Configuration

Set num_replicas: "auto" or provide explicit autoscaling_config:

autoscaling_config Settings

Key settings: min_replicas (default 1), max_replicas (default 1, 100 with auto), initial_replicas, target_ongoing_requests (default 2), metrics_interval_s (10s), look_back_period_s (30s), smoothing_factor (1.0), upscale_smoothing_factor, downscale_smoothing_factor, upscale_delay_s (30s), downscale_delay_s (600s), downscale_to_zero_delay_s, aggregation_function (MEAN|MAX).

deployments:
  - name: LLMDeployment
    max_ongoing_requests: 10
    autoscaling_config:
      min_replicas: 1
      max_replicas: 8
      target_ongoing_requests: 3
      upscale_delay_s: 10
      downscale_delay_s: 300
      upscale_smoothing_factor: 2.0      # aggressive upscale
      downscale_smoothing_factor: 0.5    # conservative downscale
      metrics_interval_s: 5
      look_back_period_s: 15

Tuning guidelines:

target_ongoing_requests — lower = lower latency, higher = higher throughput
upscale_delay_s — lower for bursty traffic, higher for steady traffic
downscale_delay_s — keep high (300-600s) to avoid thrashing
smoothing_factor > 1 = more aggressive scaling, < 1 = more conservative
min_replicas: 0 — enables scale-to-zero (adds cold start latency)

Request Batching (@serve.batch)

Setting	Purpose	Default
`max_batch_size`	Max requests per batch	`10`
`batch_wait_timeout_s`	Max wait for a full batch	`0.01`
`max_concurrent_batches`	Max batches running concurrently	`1`
`batch_size_fn`	Custom function to compute batch size	None

@serve.deployment
class BatchModel:
    @serve.batch(max_batch_size=32, batch_wait_timeout_s=0.1, max_concurrent_batches=2)
    async def handle_batch(self, requests: list[str]) -> list[str]:
        # Process entire batch at once (e.g., batched GPU inference)
        return self.model.predict(requests)

    async def __call__(self, request):
        return await self.handle_batch(request.query_params["text"])

Tuning: Set max_batch_size to your model's optimal batch size. Set batch_wait_timeout_s low for latency-sensitive, higher for throughput-sensitive. Increase max_concurrent_batches if GPU can handle multiple batches.

user_config (Dynamic Reconfiguration)

Update deployment behavior without restarting replicas:

@serve.deployment
class Model:
    def reconfigure(self, config: dict):
        """Called on initial deploy and every config update."""
        self.model_path = config["model_path"]
        self.temperature = config.get("temperature", 1.0)
        self.model = load_model(self.model_path)

deployments:
  - name: Model
    user_config:
      model_path: meta-llama/Llama-3.1-8B
      temperature: 0.7

Update user_config in the config file and re-apply — replicas call reconfigure() without restart. Useful for: model version swaps, A/B test weights, feature flags, hyperparameters.

Model Composition (DeploymentHandle)

Chain multiple deployments in a pipeline:

from ray.serve.handle import DeploymentHandle

@serve.deployment
class Preprocessor:
    async def __call__(self, text: str) -> list[int]:
        return tokenize(text)

@serve.deployment
class Model:
    async def __call__(self, tokens: list[int]) -> str:
        return self.model.generate(tokens)

@serve.deployment
class Pipeline:
    def __init__(self, preprocessor: DeploymentHandle, model: DeploymentHandle):
        self.preprocessor = preprocessor
        self.model = model

    async def __call__(self, request) -> str:
        tokens = await self.preprocessor.remote(request.query_params["text"])
        return await self.model.remote(tokens)

app = Pipeline.bind(Preprocessor.bind(), Model.bind())

Streaming Responses

Return a StreamingResponse from a deployment for token-by-token output. Use an async generator — passing a sync iterable directly to StreamingResponse may fail with async Serve handlers:

from starlette.responses import StreamingResponse

@serve.deployment
class StreamModel:
    async def __call__(self, request):
        async def generate():
            for token in self.model.stream(request.query_params["prompt"]):
                yield token
        return StreamingResponse(generate())

Multiplexed Models (Multi-LoRA)

Load multiple model variants on the same replica:

@serve.deployment(num_replicas=2)
class MultiLoRAModel:
    def __init__(self):
        self.base_model = load_base_model()
        self.adapters = {}

    @serve.multiplexed(max_num_models_per_replica=10)
    async def get_model(self, model_id: str):
        if model_id not in self.adapters:
            self.adapters[model_id] = load_adapter(model_id)
        return self.adapters[model_id]

    async def __call__(self, request):
        model_id = serve.get_multiplexed_model_id()
        adapter = await self.get_model(model_id)
        return self.base_model.generate(request, adapter)

Async Inference (Long-Running Workloads)

Handle batch processing, transcription, multi-step agent workflows, and other long-running tasks without client timeouts. Integrates with message brokers (Redis, SQS) for at-least-once processing.

from ray import serve
from ray.serve.schema import TaskProcessorConfig, CeleryAdapterConfig
from ray.serve.task_consumer import task_consumer, task_handler

# 1. Configure message broker
processor_config = TaskProcessorConfig(
    queue_name="inference_queue",
    adapter_config=CeleryAdapterConfig(
        broker_url="redis://localhost:6379/0",
        backend_url="redis://localhost:6379/1",
    ),
    max_retries=5,
    failed_task_queue_name="failed_tasks",
)

# 2. Define task consumer — polls queue and processes tasks
@serve.deployment
@task_consumer(task_processor_config=processor_config)
class TranscriptionConsumer:
    @task_handler(name="transcribe")
    def transcribe(self, audio_url):
        return self.model.transcribe(audio_url)

# 3. Ingress for task submission and status polling
from ray.serve.task_consumer import instantiate_adapter_from_config
from fastapi import FastAPI

app = FastAPI()

@serve.deployment
@serve.ingress(app)
class API:
    def __init__(self, consumer_handle, config):
        self.adapter = instantiate_adapter_from_config(config)

    @app.post("/submit")
    def submit(self, request):
        task = self.adapter.enqueue_task_sync()
        return {"task_id": task.id}

    @app.get("/status/{task_id}")
    def status(self, task_id: str):
        return self.adapter.get_task_status_sync(task_id)

TaskConsumers scale via queue-depth autoscaling — set external_scaler_enabled: true or use the custom autoscaling API below.

Custom Request Routing

Override Serve's default routing to implement cache affinity, latency-aware routing, session stickiness, or content-based routing:

from ray.serve.request_router import PendingRequest, RequestRouter, RunningReplica
from typing import List, Optional

class PrefixCacheRouter(RequestRouter):
    """Route requests to replicas with hot prefix caches."""

    async def choose_replicas(
        self,
        candidate_replicas: List[RunningReplica],
        pending_request: Optional[PendingRequest] = None,
    ) -> List[List[RunningReplica]]:
        # Return ranked lists — Serve tries first list, then falls back
        prefix = pending_request.request_args.get("prefix", "")  # verify .request_args attr against RequestRouter base class for your Ray version
        scored = [(r, self._cache_score(r, prefix)) for r in candidate_replicas]
        scored.sort(key=lambda x: x[1], reverse=True)
        return [[r for r, _ in scored]]

    def on_request_routed(self, pending_request, replica):
        """Callback after routing — update cache tracking."""
        pass

Attach to a deployment:

from ray.serve.config import RequestRouterConfig

@serve.deployment(
    request_router_config=RequestRouterConfig(
        request_router_class="my_module:PrefixCacheRouter"
    ),
    num_replicas=10,
)
class MyModel:
    def record_routing_stats(self) -> dict:
        """Optional: emit stats accessible in choose_replicas."""
        return {"cache_hit_rate": self.cache_hit_rate}

Custom Autoscaling Policies

Define scaling logic driven by custom metrics instead of request counts. Useful for queue-depth scaling, GPU utilization, cross-deployment coordination, or scheduled scaling.

Step 1 — Emit custom metrics from replicas:

@serve.deployment
class GPUModel:
    def record_autoscaling_stats(self) -> dict:
        return {"gpu_util": get_gpu_utilization(), "queue_depth": self.queue.qsize()}

Step 2 — Define a policy function:

from ray.serve.config import AutoscalingContext
from ray.serve import DeploymentID  # public API (ray.serve._private.common is internal, avoid)
from typing import Dict, Tuple

def gpu_aware_policy(
    ctxs: Dict[DeploymentID, AutoscalingContext],
) -> Tuple[Dict[DeploymentID, int], Dict]:
    decisions = {}
    for deployment_id, ctx in ctxs.items():
        avg_util = sum(s.get("gpu_util", 0) for s in ctx.replica_stats) / max(len(ctx.replica_stats), 1)
        if avg_util > 0.85:
            decisions[deployment_id] = ctx.current_num_replicas + 1
        elif avg_util < 0.3 and ctx.current_num_replicas > ctx.min_replicas:
            decisions[deployment_id] = ctx.current_num_replicas - 1
        else:
            decisions[deployment_id] = ctx.current_num_replicas
    return decisions, {"avg_gpu_util": avg_util}

Step 3 — Attach in config YAML:

applications:
  - name: my-app
    import_path: app:deployment
    autoscaling_policy:
      policy_function: my_module:gpu_aware_policy

Application-level policies receive all deployment contexts, enabling coordinated scaling (e.g., scale preprocessing in sync with model replicas).

Disaggregated Prefill-Decode Serving

Ray Serve provides native PD disaggregation via build_pd_openai_app, which manages separate prefill and decode vLLM instances with automatic KV cache transfer routing via NixlConnector.

Key concepts:

Prefill replicas process prompts and transfer KV cache (compute-bound, fewer replicas)
Decode replicas generate tokens using transferred KV cache (memory-bound, more replicas)
PDProxyServer orchestrates routing between prefill → decode automatically
Both configs must use the same model; kv_transfer_config is required in both

Quick start:

from ray.serve.llm import LLMConfig, build_pd_openai_app

app = build_pd_openai_app({
    "prefill_config": LLMConfig(
        model_loading_config={"model_id": "llama-8b", "model_source": "meta-llama/Llama-3.1-8B-Instruct"},
        accelerator_type="A100",
        engine_kwargs={"tensor_parallel_size": 2,
                       "kv_transfer_config": {"kv_connector": "NixlConnector", "kv_role": "kv_both"}},
    ),
    "decode_config": LLMConfig(
        model_loading_config={"model_id": "llama-8b", "model_source": "meta-llama/Llama-3.1-8B-Instruct"},
        accelerator_type="A100",
        engine_kwargs={"gpu_memory_utilization": 0.95,
                       "kv_transfer_config": {"kv_connector": "NixlConnector", "kv_role": "kv_both"}},
    ),
})

For full PD configuration (YAML configs, LLMConfig field reference, engine_kwargs tuning, independent autoscaling, DP+PD, component API, and constraints), see references/disaggregated-pd.md.

See also assets/serve_pd_config.yaml for a complete deployment template.

Kubernetes Deployment (RayService)

The ServeConfigV2 is embedded in the RayService CRD:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: llm-service
spec:
  serviceUnhealthySecondThreshold: 900    # time before marking service unhealthy
  deploymentUnhealthySecondThreshold: 300  # time before marking deployment unhealthy
  serveConfigV2: |
    applications:
      - name: llm
        route_prefix: /
        import_path: serve_app:app
        runtime_env:
          pip: [vllm, transformers]
        deployments:
          - name: VLLMDeployment
            num_replicas: 2
            max_ongoing_requests: 24
            ray_actor_options:
              num_cpus: 4
              num_gpus: 1
            autoscaling_config:
              min_replicas: 1
              max_replicas: 4
              target_ongoing_requests: 8
  rayClusterConfig:
    headGroupSpec:
      template:
        spec:
          containers:
            - name: ray-head
              resources:
                limits:
                  cpu: "4"
                  memory: 8Gi
    workerGroupSpecs:
      - groupName: gpu-workers
        replicas: 2
        minReplicas: 1
        maxReplicas: 4
        template:
          spec:
            containers:
              - name: ray-worker
                resources:
                  limits:
                    cpu: "8"
                    memory: 32Gi
                    nvidia.com/gpu: "1"

Upgrade Behavior

Changes to serveConfigV2 apply in-place (hot update, no cluster rebuild). Changes to rayClusterConfig trigger a zero-downtime cluster upgrade. See references/upgrade-strategies.md for upgrade strategies, the serviceUnhealthySecondThreshold footgun with large model cold starts, and monitoring upgrade status.

RayService-Specific Settings

Setting	Purpose	Default
`serviceUnhealthySecondThreshold`	Seconds before marking service unhealthy	`900`
`deploymentUnhealthySecondThreshold`	Seconds before marking deployment unhealthy	`300`

High Availability

For HA, set max_replicas_per_node: 1 to spread replicas across nodes:

deployments:
  - name: MyDeployment
    num_replicas: 3
    max_replicas_per_node: 1

Priority of Settings

Config file > application code > defaults. ray_actor_options, user_config, and autoscaling_config are each replaced as whole dicts (not merged) when specified in the config file.

References

serve-config-schema.md — Full ServeConfigV2 schema: proxy, HTTP/gRPC, logging, application settings
upgrade-strategies.md — RayService upgrade strategies, serviceUnhealthySecondThreshold tuning
disaggregated-pd.md — Prefill-decode disaggregation with vLLM and KV cache transfer
performance.md — Autoscaling, batching, and performance troubleshooting

Cross-References

nvidia-dynamo — Alternative distributed inference orchestration; comparison for LLM serving
vllm — Serve vLLM models with Ray Serve for autoscaling
sglang — Alternative model backend deployable behind Ray Serve
gateway-api-inference — K8s-native inference routing; alternative to Ray Serve's built-in routing
keda — Event-driven autoscaling; complements Ray Serve's built-in autoscaler for K8s
ray-core — Ray actors powering Serve deployments
kuberay — Deploy Serve on Kubernetes via RayService CRD

Reference

Serve config files
Deployment configuration
AutoscalingConfig API
Advanced autoscaling
RayService on K8s
references/performance.md — performance tuning and troubleshooting
references/disaggregated-pd.md — full PD config: LLMConfig fields, engine_kwargs, autoscaling, DP+PD, component API
assets/serve_config.yaml — ServeConfigV2 example with multi-model deployment and autoscaling
assets/serve_pd_config.yaml — Disaggregated prefill-decode config with NixlConnector and independent scaling
gpu-operator — GPU driver and device plugin for Ray Serve workers
prometheus-grafana — Scrape Ray Serve Prometheus metrics

More by tylertitsworth

View all

uv — fast Python package/project manager, lockfiles, Python versions, uvx tool runner, Docker/CI integration. Use for Python dependency management. NOT for package publishing.

tensorrt-llm

TensorRT-LLM — engine building, quantization (FP8/FP4/INT4/AWQ), Python LLM API, AutoDeploy, KV cache tuning, in-flight batching, disaggregated serving with HTTP cluster management, Ray orchestrator, sparse attention (RocketKV), Triton backend. Use when optimizing directly with TRT-LLM. NOT for NIM deployment or vLLM/SGLang setup.

triton-inference-server

NVIDIA Triton Inference Server — model repository, config.pbtxt, ensemble/BLS pipelines, backends (TensorRT/ONNX/Python), dynamic batching, model management API, perf_analyzer. Use when serving models with Triton Inference Server. NOT for K8s deployment patterns. NOT for NIM.

kuberay

KubeRay operator — RayCluster, RayJob, RayService, GPU scheduling, autoscaling, auth tokens, Label Selector API, GCS fault tolerance, TLS, observability, and Kueue/Volcano integration. Use when deploying Ray on Kubernetes. NOT for Ray Core programming (see ray-core).