Skip to main content

Reference

Training job fields, sub-job counters, and observability hooks.

Reference

Training job fields, sub-job counters, and observability hooks.

Wizard inputs

FieldDescription
runtimeThe pre-built training environment (framework + image + parallelism strategy).
sizingNumber of nodes, CPU / memory / GPU per node, and processes-per-node.
initializersOptional pre-training fetch of a dataset and base model (S3, HuggingFace, PVC, or model registry).
overridesOptional custom command, args, and env to override the runtime's default entrypoint.
output_destinationS3 bucket+prefix or PVC where final checkpoints are written.
ttl_successHow long completed-success pods are retained for log retrieval (default 1d).
ttl_failureHow long failed pods are retained for log retrieval (default 7d).

Sub-job rollout

Distributed jobs run as a master + workers fleet. The detail page shows per-sub-job counters — active, ready, succeeded, failed, suspended — so you can spot a stuck rank quickly.

Logs and observability

The Logs tab streams from the trainer master pod by default; the dropdown lets you switch container (init, dataset-init, model-init, trainer) or rank. The Observability link drops you into a Grafana dashboard pre-filtered to this job's pods.

⌘I