Agent SkillsAgent Skills
tylertitsworth

ray-serve

@tylertitsworth/ray-serve
tylertitsworth
0
0 forks
Updated 4/1/2026
View on GitHub

Ray Serve — ServeConfigV2, autoscaling, request batching, multi-model composition, async inference, custom request routing, custom autoscaling policies, RayService on K8s, health checks, and PD disaggregation. Use when serving models with Ray Serve. NOT for inference engine config (see vllm, sglang).

Installation

$npx agent-skills-cli install @tylertitsworth/ray-serve
Claude Code
Cursor
Copilot
Codex
Antigravity

Details

Pathray-serve/SKILL.md
Branchmain
Scoped Name@tylertitsworth/ray-serve

Usage

After installing, this skill will be available to your AI coding assistant.

Verify installation:

npx agent-skills-cli list

Skill Instructions


name: ray-serve description: "Ray Serve — ServeConfigV2, autoscaling, request batching, multi-model composition, async inference, custom request routing, custom autoscaling policies, RayService on K8s, health checks, and PD disaggregation. Use when serving models with Ray Serve. NOT for inference engine config (see vllm, sglang)."

Ray Serve

Ray Serve is a scalable model serving framework built on Ray. Deploys on Kubernetes via the RayService CRD. Version: 2.54.0 — Async Inference for long-running workloads, Custom Request Routing API, Custom Autoscaling policies, External Scaling API.

ServeConfigV2

The Serve config file is the production deployment format, embedded in the RayService CRD's serveConfigV2 field. Covers proxy location, HTTP/gRPC options, logging, and application definitions. See references/serve-config-schema.md for the full schema with all settings tables.

applications:
  - name: my-app
    route_prefix: /
    import_path: my_module:app
    runtime_env:
      pip: [torch, transformers]
      env_vars:
        MODEL_ID: meta-llama/Llama-3.1-8B
    external_scaler_enabled: false
    deployments:
      - name: MyDeployment
        # ... deployment settings (see below)

Deployment Settings

Every @serve.deployment accepts these settings. Set them in code (decorator or .options()) or override in the config file. Config file takes highest priority, then code, then defaults.

Core Deployment Settings

SettingPurposeDefault
nameDeployment name (must match code)Class/function name
num_replicasFixed replica count, or "auto" for autoscaling1
max_ongoing_requestsMax concurrent requests per replica5
max_queued_requestsMax queued requests per caller (experimental)-1 (no limit)
user_configJSON-serializable config passed to reconfigure()None
logging_configPer-deployment logging overrideGlobal config

Ray Actor Options

deployments:
  - name: LLMDeployment
    num_replicas: 2
    ray_actor_options:
      num_cpus: 4
      num_gpus: 1
      accelerator_type: A100
      memory: 34359738368  # 32 GiB in bytes

Key settings: num_cpus, num_gpus, memory, accelerator_type, resources, runtime_env. ray_actor_options is replaced as a whole dict in config (not merged with code). Graceful shutdown defaults: graceful_shutdown_wait_loop_s: 2, graceful_shutdown_timeout_s: 20.

Health Checks

Implement check_health() on a deployment — Serve calls it every health_check_period_s (default 10s):

@serve.deployment
class MyModel:
    def check_health(self):
        if not self.model_loaded:
            raise RuntimeError("Model not loaded")

Autoscaling Configuration

Set num_replicas: "auto" or provide explicit autoscaling_config:

autoscaling_config Settings

Key settings: min_replicas (default 1), max_replicas (default 1, 100 with auto), initial_replicas, target_ongoing_requests (default 2), metrics_interval_s (10s), look_back_period_s (30s), smoothing_factor (1.0), upscale_smoothing_factor, downscale_smoothing_factor, upscale_delay_s (30s), downscale_delay_s (600s), downscale_to_zero_delay_s, aggregation_function (MEAN|MAX).

deployments:
  - name: LLMDeployment
    max_ongoing_requests: 10
    autoscaling_config:
      min_replicas: 1
      max_replicas: 8
      target_ongoing_requests: 3
      upscale_delay_s: 10
      downscale_delay_s: 300
      upscale_smoothing_factor: 2.0      # aggressive upscale
      downscale_smoothing_factor: 0.5    # conservative downscale
      metrics_interval_s: 5
      look_back_period_s: 15

Tuning guidelines:

  • target_ongoing_requests — lower = lower latency, higher = higher throughput
  • upscale_delay_s — lower for bursty traffic, higher for steady traffic
  • downscale_delay_s — keep high (300-600s) to avoid thrashing
  • smoothing_factor > 1 = more aggressive scaling, < 1 = more conservative
  • min_replicas: 0 — enables scale-to-zero (adds cold start latency)

Request Batching (@serve.batch)

SettingPurposeDefault
max_batch_sizeMax requests per batch10
batch_wait_timeout_sMax wait for a full batch0.01
max_concurrent_batchesMax batches running concurrently1
batch_size_fnCustom function to compute batch sizeNone
@serve.deployment
class BatchModel:
    @serve.batch(max_batch_size=32, batch_wait_timeout_s=0.1, max_concurrent_batches=2)
    async def handle_batch(self, requests: list[str]) -> list[str]:
        # Process entire batch at once (e.g., batched GPU inference)
        return self.model.predict(requests)

    async def __call__(self, request):
        return await self.handle_batch(request.query_params["text"])

Tuning: Set max_batch_size to your model's optimal batch size. Set batch_wait_timeout_s low for latency-sensitive, higher for throughput-sensitive. Increase max_concurrent_batches if GPU can handle multiple batches.

user_config (Dynamic Reconfiguration)

Update deployment behavior without restarting replicas:

@serve.deployment
class Model:
    def reconfigure(self, config: dict):
        """Called on initial deploy and every config update."""
        self.model_path = config["model_path"]
        self.temperature = config.get("temperature", 1.0)
        self.model = load_model(self.model_path)
deployments:
  - name: Model
    user_config:
      model_path: meta-llama/Llama-3.1-8B
      temperature: 0.7

Update user_config in the config file and re-apply — replicas call reconfigure() without restart. Useful for: model version swaps, A/B test weights, feature flags, hyperparameters.

Model Composition (DeploymentHandle)

Chain multiple deployments in a pipeline:

from ray.serve.handle import DeploymentHandle

@serve.deployment
class Preprocessor:
    async def __call__(self, text: str) -> list[int]:
        return tokenize(text)

@serve.deployment
class Model:
    async def __call__(self, tokens: list[int]) -> str:
        return self.model.generate(tokens)

@serve.deployment
class Pipeline:
    def __init__(self, preprocessor: DeploymentHandle, model: DeploymentHandle):
        self.preprocessor = preprocessor
        self.model = model

    async def __call__(self, request) -> str:
        tokens = await self.preprocessor.remote(request.query_params["text"])
        return await self.model.remote(tokens)

app = Pipeline.bind(Preprocessor.bind(), Model.bind())

Streaming Responses

Return a StreamingResponse from a deployment for token-by-token output. Use an async generator — passing a sync iterable directly to StreamingResponse may fail with async Serve handlers:

from starlette.responses import StreamingResponse

@serve.deployment
class StreamModel:
    async def __call__(self, request):
        async def generate():
            for token in self.model.stream(request.query_params["prompt"]):
                yield token
        return StreamingResponse(generate())

Multiplexed Models (Multi-LoRA)

Load multiple model variants on the same replica:

@serve.deployment(num_replicas=2)
class MultiLoRAModel:
    def __init__(self):
        self.base_model = load_base_model()
        self.adapters = {}

    @serve.multiplexed(max_num_models_per_replica=10)
    async def get_model(self, model_id: str):
        if model_id not in self.adapters:
            self.adapters[model_id] = load_adapter(model_id)
        return self.adapters[model_id]

    async def __call__(self, request):
        model_id = serve.get_multiplexed_model_id()
        adapter = await self.get_model(model_id)
        return self.base_model.generate(request, adapter)

Async Inference (Long-Running Workloads)

Handle batch processing, transcription, multi-step agent workflows, and other long-running tasks without client timeouts. Integrates with message brokers (Redis, SQS) for at-least-once processing.

from ray import serve
from ray.serve.schema import TaskProcessorConfig, CeleryAdapterConfig
from ray.serve.task_consumer import task_consumer, task_handler

# 1. Configure message broker
processor_config = TaskProcessorConfig(
    queue_name="inference_queue",
    adapter_config=CeleryAdapterConfig(
        broker_url="redis://localhost:6379/0",
        backend_url="redis://localhost:6379/1",
    ),
    max_retries=5,
    failed_task_queue_name="failed_tasks",
)

# 2. Define task consumer — polls queue and processes tasks
@serve.deployment
@task_consumer(task_processor_config=processor_config)
class TranscriptionConsumer:
    @task_handler(name="transcribe")
    def transcribe(self, audio_url):
        return self.model.transcribe(audio_url)

# 3. Ingress for task submission and status polling
from ray.serve.task_consumer import instantiate_adapter_from_config
from fastapi import FastAPI

app = FastAPI()

@serve.deployment
@serve.ingress(app)
class API:
    def __init__(self, consumer_handle, config):
        self.adapter = instantiate_adapter_from_config(config)

    @app.post("/submit")
    def submit(self, request):
        task = self.adapter.enqueue_task_sync()
        return {"task_id": task.id}

    @app.get("/status/{task_id}")
    def status(self, task_id: str):
        return self.adapter.get_task_status_sync(task_id)

TaskConsumers scale via queue-depth autoscaling — set external_scaler_enabled: true or use the custom autoscaling API below.

Custom Request Routing

Override Serve's default routing to implement cache affinity, latency-aware routing, session stickiness, or content-based routing:

from ray.serve.request_router import PendingRequest, RequestRouter, RunningReplica
from typing import List, Optional

class PrefixCacheRouter(RequestRouter):
    """Route requests to replicas with hot prefix caches."""

    async def choose_replicas(
        self,
        candidate_replicas: List[RunningReplica],
        pending_request: Optional[PendingRequest] = None,
    ) -> List[List[RunningReplica]]:
        # Return ranked lists — Serve tries first list, then falls back
        prefix = pending_request.request_args.get("prefix", "")  # verify .request_args attr against RequestRouter base class for your Ray version
        scored = [(r, self._cache_score(r, prefix)) for r in candidate_replicas]
        scored.sort(key=lambda x: x[1], reverse=True)
        return [[r for r, _ in scored]]

    def on_request_routed(self, pending_request, replica):
        """Callback after routing — update cache tracking."""
        pass

Attach to a deployment:

from ray.serve.config import RequestRouterConfig

@serve.deployment(
    request_router_config=RequestRouterConfig(
        request_router_class="my_module:PrefixCacheRouter"
    ),
    num_replicas=10,
)
class MyModel:
    def record_routing_stats(self) -> dict:
        """Optional: emit stats accessible in choose_replicas."""
        return {"cache_hit_rate": self.cache_hit_rate}

Custom Autoscaling Policies

Define scaling logic driven by custom metrics instead of request counts. Useful for queue-depth scaling, GPU utilization, cross-deployment coordination, or scheduled scaling.

Step 1 — Emit custom metrics from replicas:

@serve.deployment
class GPUModel:
    def record_autoscaling_stats(self) -> dict:
        return {"gpu_util": get_gpu_utilization(), "queue_depth": self.queue.qsize()}

Step 2 — Define a policy function:

from ray.serve.config import AutoscalingContext
from ray.serve import DeploymentID  # public API (ray.serve._private.common is internal, avoid)
from typing import Dict, Tuple

def gpu_aware_policy(
    ctxs: Dict[DeploymentID, AutoscalingContext],
) -> Tuple[Dict[DeploymentID, int], Dict]:
    decisions = {}
    for deployment_id, ctx in ctxs.items():
        avg_util = sum(s.get("gpu_util", 0) for s in ctx.replica_stats) / max(len(ctx.replica_stats), 1)
        if avg_util > 0.85:
            decisions[deployment_id] = ctx.current_num_replicas + 1
        elif avg_util < 0.3 and ctx.current_num_replicas > ctx.min_replicas:
            decisions[deployment_id] = ctx.current_num_replicas - 1
        else:
            decisions[deployment_id] = ctx.current_num_replicas
    return decisions, {"avg_gpu_util": avg_util}

Step 3 — Attach in config YAML:

applications:
  - name: my-app
    import_path: app:deployment
    autoscaling_policy:
      policy_function: my_module:gpu_aware_policy

Application-level policies receive all deployment contexts, enabling coordinated scaling (e.g., scale preprocessing in sync with model replicas).

Disaggregated Prefill-Decode Serving

Ray Serve provides native PD disaggregation via build_pd_openai_app, which manages separate prefill and decode vLLM instances with automatic KV cache transfer routing via NixlConnector.

Key concepts:

  • Prefill replicas process prompts and transfer KV cache (compute-bound, fewer replicas)
  • Decode replicas generate tokens using transferred KV cache (memory-bound, more replicas)
  • PDProxyServer orchestrates routing between prefill → decode automatically
  • Both configs must use the same model; kv_transfer_config is required in both

Quick start:

from ray.serve.llm import LLMConfig, build_pd_openai_app

app = build_pd_openai_app({
    "prefill_config": LLMConfig(
        model_loading_config={"model_id": "llama-8b", "model_source": "meta-llama/Llama-3.1-8B-Instruct"},
        accelerator_type="A100",
        engine_kwargs={"tensor_parallel_size": 2,
                       "kv_transfer_config": {"kv_connector": "NixlConnector", "kv_role": "kv_both"}},
    ),
    "decode_config": LLMConfig(
        model_loading_config={"model_id": "llama-8b", "model_source": "meta-llama/Llama-3.1-8B-Instruct"},
        accelerator_type="A100",
        engine_kwargs={"gpu_memory_utilization": 0.95,
                       "kv_transfer_config": {"kv_connector": "NixlConnector", "kv_role": "kv_both"}},
    ),
})

For full PD configuration (YAML configs, LLMConfig field reference, engine_kwargs tuning, independent autoscaling, DP+PD, component API, and constraints), see references/disaggregated-pd.md.

See also assets/serve_pd_config.yaml for a complete deployment template.

Kubernetes Deployment (RayService)

The ServeConfigV2 is embedded in the RayService CRD:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: llm-service
spec:
  serviceUnhealthySecondThreshold: 900    # time before marking service unhealthy
  deploymentUnhealthySecondThreshold: 300  # time before marking deployment unhealthy
  serveConfigV2: |
    applications:
      - name: llm
        route_prefix: /
        import_path: serve_app:app
        runtime_env:
          pip: [vllm, transformers]
        deployments:
          - name: VLLMDeployment
            num_replicas: 2
            max_ongoing_requests: 24
            ray_actor_options:
              num_cpus: 4
              num_gpus: 1
            autoscaling_config:
              min_replicas: 1
              max_replicas: 4
              target_ongoing_requests: 8
  rayClusterConfig:
    headGroupSpec:
      template:
        spec:
          containers:
            - name: ray-head
              resources:
                limits:
                  cpu: "4"
                  memory: 8Gi
    workerGroupSpecs:
      - groupName: gpu-workers
        replicas: 2
        minReplicas: 1
        maxReplicas: 4
        template:
          spec:
            containers:
              - name: ray-worker
                resources:
                  limits:
                    cpu: "8"
                    memory: 32Gi
                    nvidia.com/gpu: "1"

Upgrade Behavior

Changes to serveConfigV2 apply in-place (hot update, no cluster rebuild). Changes to rayClusterConfig trigger a zero-downtime cluster upgrade. See references/upgrade-strategies.md for upgrade strategies, the serviceUnhealthySecondThreshold footgun with large model cold starts, and monitoring upgrade status.

RayService-Specific Settings

SettingPurposeDefault
serviceUnhealthySecondThresholdSeconds before marking service unhealthy900
deploymentUnhealthySecondThresholdSeconds before marking deployment unhealthy300

High Availability

For HA, set max_replicas_per_node: 1 to spread replicas across nodes:

deployments:
  - name: MyDeployment
    num_replicas: 3
    max_replicas_per_node: 1

Priority of Settings

Config file > application code > defaults. ray_actor_options, user_config, and autoscaling_config are each replaced as whole dicts (not merged) when specified in the config file.

References

Cross-References

  • nvidia-dynamo — Alternative distributed inference orchestration; comparison for LLM serving
  • vllm — Serve vLLM models with Ray Serve for autoscaling
  • sglang — Alternative model backend deployable behind Ray Serve
  • gateway-api-inference — K8s-native inference routing; alternative to Ray Serve's built-in routing
  • keda — Event-driven autoscaling; complements Ray Serve's built-in autoscaler for K8s
  • ray-core — Ray actors powering Serve deployments
  • kuberay — Deploy Serve on Kubernetes via RayService CRD

Reference