Skip to main content

Lifecycle

Pending → Running → Succeeded/Failed and the actions at each phase.

Lifecycle

Pending → Running → Succeeded/Failed and the actions at each phase.

PhaseMeaning
PendingInitializers running, pods scheduling.
RunningTrainer pods are up and at least one is making progress.
SuspendedYou hit Suspend. Pods removed, state preserved.
SucceededAll trainer ranks exited 0. Output destination has your checkpoints.
FailedOne or more ranks errored. Logs tab + Activity page have the why.

Lifecycle actions: Suspend, Resume, Retry (clones the job and re-runs from scratch), Delete.

warning

If active=N, ready<N for more than ~5 minutes, one of the worker pods isn't passing its readiness probe. Check the pod's logs through the Logs tab's container dropdown.

⌘I