Ray Data ingestion patterns

Turn shared storage into parallel dataset pipelines.

Ray Data ingestion patterns Ray Data is built for parallel ingest, preprocessing, and batch inference. The practical habit is to describe a pipeline as a set of dataset operations and let Ray execute it across workers. Read from shared storage Start from files, tables, or object storage that workers can access directly. import ray ds = ray.data.read_parquet("s3://ml-platform/events/date=2026-06-18/") Transform in batches Batch transforms keep Python overhead manageable and make it natural to use vectorized libraries. def normalize(batch): batch["amount_z"] = (batch["amount"] - batch["amount"].mean()) / batch["amount"].std() return batch features = ds.map_batches(normalize, batch_format="pandas") Production shape Ingest data from worker-accessible paths. Normalize schemas before expensive model calls. Persist intermediate datasets when downstream teams need reproducibility. Keep dashboards focused on throughput, failed blocks, and memory pressure.

Data ingestion