Configure a training runtime
A training runtime is a pre-built environment that combines a deep learning framework, a container image, and a parallelism strategy. Runtimes are prerequisites for submitting training jobs.
Prerequisites
- A Kubernetes cluster in ready status
- Admin permissions on the workspace
Available runtimes
Vantage ships with several built-in runtimes:
| Runtime | Framework | Use case |
|---|---|---|
torch-distributed | PyTorch + torchrun | General multi-GPU training |
deepspeed-zero3 | DeepSpeed | Memory-bound LLMs with ZeRO-3 sharding |
mlx-distributed | MLX | Apple-silicon training |
huggingface-trainer | Transformers | Standard HF Trainer workflows |
View runtimes
- Click Workbench in the left sidebar.
- Navigate to Training Jobs > Runtimes (or find runtimes in the Training section).
- The list shows each runtime with its framework, image, and status.
Create a custom runtime
- On the Runtimes page, click
Create Runtime. - Configure the runtime:
- Name -- a short identifier (for example,
jax-tpuorcustom-pytorch) - Framework -- the ML framework (PyTorch, DeepSpeed, JAX, or custom)
- Container image -- the Docker image URI (for example,
pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime) - Default entrypoint -- the command or script that runs when a job starts (for example,
torchrun)
- Name -- a short identifier (for example,
- Click
Create.
Edit a runtime
- Click the runtime name to open the detail page.
- Click
Editand modify the image, entrypoint, or other settings. - Click
Save.
Set a default runtime
Click the Set as default action on a runtime row. The default runtime is pre-selected when creating new training jobs.
note
Runtimes are workspace-scoped. Admins can publish runtimes that are available to all users in the workspace.