Skip to main content

Configure a training runtime

Create and manage training runtimes that define the framework, image, and parallelism strategy for training jobs.

Configure a training runtime

A training runtime is a pre-built environment that combines a deep learning framework, a container image, and a parallelism strategy. Runtimes are prerequisites for submitting training jobs.

Prerequisites

  • A Kubernetes cluster in ready status
  • Admin permissions on the workspace

Available runtimes

Vantage ships with several built-in runtimes:

RuntimeFrameworkUse case
torch-distributedPyTorch + torchrunGeneral multi-GPU training
deepspeed-zero3DeepSpeedMemory-bound LLMs with ZeRO-3 sharding
mlx-distributedMLXApple-silicon training
huggingface-trainerTransformersStandard HF Trainer workflows

View runtimes

  1. Click Workbench in the left sidebar.
  2. Navigate to Training Jobs > Runtimes (or find runtimes in the Training section).
  3. The list shows each runtime with its framework, image, and status.

Create a custom runtime

  1. On the Runtimes page, click Create Runtime.
  2. Configure the runtime:
    • Name -- a short identifier (for example, jax-tpu or custom-pytorch)
    • Framework -- the ML framework (PyTorch, DeepSpeed, JAX, or custom)
    • Container image -- the Docker image URI (for example, pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime)
    • Default entrypoint -- the command or script that runs when a job starts (for example, torchrun)
  3. Click Create.

Edit a runtime

  1. Click the runtime name to open the detail page.
  2. Click Edit and modify the image, entrypoint, or other settings.
  3. Click Save.

Set a default runtime

Click the Set as default action on a runtime row. The default runtime is pre-selected when creating new training jobs.

note

Runtimes are workspace-scoped. Admins can publish runtimes that are available to all users in the workspace.

Ask AI
Ask a question about Vantage Compute...