Ray Serve for Online AI

1Serve deployment graph2Deployment graph check
Back to modules
Course progress25%
article

Serve deployment graph

Split routing, preprocessing, inference, and postprocessing cleanly.

Ray Serve deployment graph

Ray Serve models online applications as deployments that can be composed into a graph. This makes it possible to split routing, preprocessing, model inference, and postprocessing into independently scalable pieces.

Minimal deployment

from ray import serve

@serve.deployment(num_replicas=2)
class SentimentModel:
    def __init__(self):
        self.model = load_model()

    async def __call__(self, request):
        payload = await request.json()
        return {"label": self.model.predict(payload["text"])}

app = SentimentModel.bind()
serve.run(app)

Scaling knobs

  • Replica count controls parallel request handling.
  • Autoscaling reacts to request pressure.
  • Resource annotations place GPU models on GPU nodes.
  • Request batching improves throughput for compatible models.

Review questions

Before production, ask whether the application has health checks, dependency pinning, request limits, rollback plans, and model version metadata.

Serve deployment graph

Serve deployments