Mount storage in training jobs

Training jobs can mount PersistentVolumeClaims for input datasets, model checkpoints, and output artifacts. This guide covers attaching storage during training job submission.

Attach PVCs to Workbench training jobs for input data and checkpoint output.

TimeAbout 5 minutes

You will needA Kubernetes cluster, a PVC, and a configured training runtime

OutcomeStorage mounted in a training job

Mount a PVC as input data

Navigate to Training Jobs

Navigate to Workbench > Training Jobs and click Submit Training Job.

Configure the job

Configure the runtime, sizing, and script as usual.

Add a volume

In the Storage section, click Add Volume.

Select the PVC

Select the PVC containing your training data.

Set the mount path

Set the Mount path (for example, /data/input).

Set access mode

Optionally mark it as read-only if the job should not modify the source data.

Submit the job

Complete the rest of the form and click Submit.

Success looks like this: the training job is queued, the storage is mounted at the specified path, and the job can read input data during execution.

Your training script can then read data from the mount path:

data = load_dataset("/data/input/train.csv")

Mount a PVC for checkpoint output

Follow the same steps, but mount a writable PVC at a checkpoint path:

Add a checkpoint volume

Add a second volume mount with the path /checkpoints.

Save checkpoints in your training script

In your training script, save checkpoints to that path:

torch.save(model.state_dict(), "/checkpoints/epoch_10.pt")

Using PVCs for checkpointing lets you resume training from the last checkpoint if a job fails or is pre-empted. Mount the same PVC on the resumed job and load the latest checkpoint file.

Difference from Slurm job mounting

Slurm jobs reference storage via file paths on the cluster's shared file system (typically NFS at /nfs). Training jobs use Kubernetes PVC mounts instead, which are namespace-scoped and attached at the pod level.

For mounting storage in Slurm jobs, see Mount storage in jobs.

Mount a PVC as input data​

Navigate to Training Jobs​

Configure the job​

Add a volume​

Select the PVC​

Set the mount path​

Set access mode​

Submit the job​

Mount a PVC for checkpoint output​

Add a checkpoint volume​

Save checkpoints in your training script​

Difference from Slurm job mounting​

What to do next​