Serve deployment graph

Split routing, preprocessing, inference, and postprocessing cleanly.

Ray Serve deployment graph Ray Serve models online applications as deployments that can be composed into a graph. This makes it possible to split routing, preprocessing, model inference, and postprocessing into independently scalable pieces. Minimal deployment from ray import serve @serve.deployment(num_replicas=2) class SentimentModel: def __init__(self): self.model = load_model() async def __call__(self, request): payload = await request.json() return {"label": self.model.predict(payload["text"])} app = SentimentModel.bind() serve.run(app) Scaling knobs Replica count controls parallel request handling. Autoscaling reacts to request pressure. Resource annotations place GPU models on GPU nodes. Request batching improves throughput for compatible models. Review questions Before production, ask whether the application has health checks, dependency pinning, request limits, rollback plans, and model version metadata.

Serve deployments