Submitting a job

Six-step wizard: runtime, sizing, initializers, overrides, output, TTL.

Pick a runtime. The wizard filters compute profiles to those compatible with the runtime's framework.
Set sizing. Number of nodes, CPU / memory / GPU per node, and processes-per-node (usually auto = 1 per GPU).
Initializers (optional). Tell Workbench to fetch a dataset and a base model before training starts — from S3, HuggingFace, a PVC, or your model registry. Cuts startup time and avoids cold pulls inside the training loop.
Trainer overrides (optional). Custom command, args, env. Useful when the runtime's default entrypoint isn't quite right.
Output destination. Where final checkpoints go — S3 bucket+prefix or a PVC. Defaults to s3://{workspace}/trainjobs/{name}/.
TTL. How long the completed pods stick around for log retrieval. Defaults: 1d on success, 7d on failure.