LLM routing with Ray Serve

Make routing decisions explicit for online LLM applications.

LLM routing with Ray Serve LLM serving adds model size, accelerator placement, tokenizer behavior, and request routing to the usual service concerns. Ray Serve gives teams a place to express those concerns as deployments. Routing responsibilities A production LLM endpoint often needs more than one model replica. It may route by model name, tenant, adapter, request size, or latency target. @serve.deployment class Router: def __init__(self, small_model, large_model): self.small_model = small_model self.large_model = large_model async def __call__(self, request): body = await request.json() target = self.large_model if body.get("reasoning") else self.small_model return await target.remote(body) Practical controls Enforce maximum input tokens. Emit queueing and generation latency separately. Track model, adapter, and prompt template versions. Keep fallback behavior explicit.

LLM routing