Ray Serve for Online AI
Back to modules
Course progress25%
article
LLM routing with Ray Serve
Make routing decisions explicit for online LLM applications.
LLM routing with Ray Serve
LLM serving adds model size, accelerator placement, tokenizer behavior, and request routing to the usual service concerns. Ray Serve gives teams a place to express those concerns as deployments.
Routing responsibilities
A production LLM endpoint often needs more than one model replica. It may route by model name, tenant, adapter, request size, or latency target.
@serve.deployment
class Router:
def __init__(self, small_model, large_model):
self.small_model = small_model
self.large_model = large_model
async def __call__(self, request):
body = await request.json()
target = self.large_model if body.get("reasoning") else self.small_model
return await target.remote(body)
Practical controls
- Enforce maximum input tokens.
- Emit queueing and generation latency separately.
- Track model, adapter, and prompt template versions.
- Keep fallback behavior explicit.