Ray Data for Batch AI

Back to modules
Course progress50%
article

Ray Data ingestion patterns

Turn shared storage into parallel dataset pipelines.

Ray Data ingestion patterns

Ray Data is built for parallel ingest, preprocessing, and batch inference. The practical habit is to describe a pipeline as a set of dataset operations and let Ray execute it across workers.

Read from shared storage

Start from files, tables, or object storage that workers can access directly.

import ray

ds = ray.data.read_parquet("s3://ml-platform/events/date=2026-06-18/")

Transform in batches

Batch transforms keep Python overhead manageable and make it natural to use vectorized libraries.

def normalize(batch):
    batch["amount_z"] = (batch["amount"] - batch["amount"].mean()) / batch["amount"].std()
    return batch

features = ds.map_batches(normalize, batch_format="pandas")

Production shape

  • Ingest data from worker-accessible paths.
  • Normalize schemas before expensive model calls.
  • Persist intermediate datasets when downstream teams need reproducibility.
  • Keep dashboards focused on throughput, failed blocks, and memory pressure.

Ray Data ingestion patterns

Data ingestion