Ray Data for Batch AI
Back to modules
Course progress50%
article
Ray Data ingestion patterns
Turn shared storage into parallel dataset pipelines.
Ray Data ingestion patterns
Ray Data is built for parallel ingest, preprocessing, and batch inference. The practical habit is to describe a pipeline as a set of dataset operations and let Ray execute it across workers.
Read from shared storage
Start from files, tables, or object storage that workers can access directly.
import ray
ds = ray.data.read_parquet("s3://ml-platform/events/date=2026-06-18/")
Transform in batches
Batch transforms keep Python overhead manageable and make it natural to use vectorized libraries.
def normalize(batch):
batch["amount_z"] = (batch["amount"] - batch["amount"].mean()) / batch["amount"].std()
return batch
features = ds.map_batches(normalize, batch_format="pandas")
Production shape
- Ingest data from worker-accessible paths.
- Normalize schemas before expensive model calls.
- Persist intermediate datasets when downstream teams need reproducibility.
- Keep dashboards focused on throughput, failed blocks, and memory pressure.
1
Ray Data ingestion patterns
Data ingestion