Ray Serve for Online AI
Back to modules
Course progress25%
article
Serve deployment graph
Split routing, preprocessing, inference, and postprocessing cleanly.
Ray Serve deployment graph
Ray Serve models online applications as deployments that can be composed into a graph. This makes it possible to split routing, preprocessing, model inference, and postprocessing into independently scalable pieces.
Minimal deployment
from ray import serve
@serve.deployment(num_replicas=2)
class SentimentModel:
def __init__(self):
self.model = load_model()
async def __call__(self, request):
payload = await request.json()
return {"label": self.model.predict(payload["text"])}
app = SentimentModel.bind()
serve.run(app)
Scaling knobs
- Replica count controls parallel request handling.
- Autoscaling reacts to request pressure.
- Resource annotations place GPU models on GPU nodes.
- Request batching improves throughput for compatible models.
Review questions
Before production, ask whether the application has health checks, dependency pinning, request limits, rollback plans, and model version metadata.
1
Serve deployment graph
Serve deployments