Skip to main content

Mount storage in training jobs

Attach PVCs to Workbench training jobs for input data and checkpoint output.

Mount storage in training jobs

Training jobs can mount PersistentVolumeClaims for input datasets, model checkpoints, and output artifacts. This guide covers attaching storage during training job submission.

Prerequisites

Mount a PVC as input data

  1. Navigate to Workbench > Training Jobs and click Submit Training Job.
  2. Configure the runtime, sizing, and script as usual.
  3. In the Storage section, click Add Volume.
  4. Select the PVC containing your training data.
  5. Set the Mount path (for example, /data/input).
  6. Optionally mark it as read-only if the job should not modify the source data.
  7. Complete the rest of the form and click Submit.

Your training script can then read data from the mount path:

data = load_dataset("/data/input/train.csv")

Mount a PVC for checkpoint output

Follow the same steps, but mount a writable PVC at a checkpoint path:

  1. Add a second volume mount with the path /checkpoints.
  2. In your training script, save checkpoints to that path:
torch.save(model.state_dict(), "/checkpoints/epoch_10.pt")
tip

Using PVCs for checkpointing lets you resume training from the last checkpoint if a job fails or is pre-empted. Mount the same PVC on the resumed job and load the latest checkpoint file.

Difference from Slurm job mounting

Slurm jobs reference storage via file paths on the cluster's shared file system (typically NFS at /nfs). Training jobs use Kubernetes PVC mounts instead, which are namespace-scoped and attached at the pod level.

For mounting storage in Slurm jobs, see Mount storage in jobs.

Ask AI
Ask a question about Vantage Compute...