Ray Train and Tune for ML Teams
Back to modules
Course progress0%
article
Ray Train worker groups
Coordinate framework training code across workers.
Ray Train worker groups
Ray Train coordinates distributed training jobs while letting framework code stay recognizable. The key idea is that each worker runs the same training function with a distributed context.
Training loop shape
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
def train_loop(config):
model = build_model(config)
train_dataset = ray.train.get_dataset_shard("train")
for epoch in range(config["epochs"]):
for batch in train_dataset.iter_torch_batches(batch_size=64):
loss = train_step(model, batch)
ray.train.report({"loss": float(loss)})
trainer = TorchTrainer(
train_loop_per_worker=train_loop,
scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
)
result = trainer.fit()
What the platform team owns
- Worker count and GPU selection.
- Runtime environment and dependencies.
- Checkpoint storage.
- Metrics export and failure policies.
What the model team owns
- Model code.
- Dataset and feature contracts.
- Training metrics.
- Checkpoint validation.
1
Ray Train worker groups
Worker groups