Skip to main content

Partitions

Slurm job queues — configure, manage, and autoscale.

Partitions

Partitions are job queues on a Slurm cluster. Each partition targets a pool of nodes with specific characteristics — instance type, GPU count, max run time — and applies rules about who can submit jobs and at what priority.

When you submit a job through Vantage, you name a partition and Slurm places it on a qualifying node within that partition.

How partitions work

Think of a partition as a named queue that routes jobs to a specific pool of compute resources:

Slurm conceptMaps to
PartitionNameYour partition name (e.g., gpu, cpu, high-mem)
DefaultWhether this is the default partition for jobs that don't specify one
MaxTimeMaximum wall time for jobs in this partition (in seconds)
NodesAll instances in the compute node pool assigned to this partition

Cloud Slurm partitions

For cloud-provisioned Slurm clusters (AWS), you configure partitions during cluster creation. Each partition:

  • Gets a unique Partition Name.
  • Targets a Compute Node instance type selected through the instance browser.
  • Has a Maximum node count that Vantage's autoscaler uses as a ceiling.

Creating partitions during setup

In the wizard's partition step:

  1. A default partition named compute is pre-filled.
  2. Click Select Compute Node to choose the instance type for worker nodes in that partition.
  3. Set the Maximum node count — the autoscaler scales up to this limit when jobs queue up.
  4. Click Add Partition to create additional partitions (e.g., a gpu partition with p3.2xlarge nodes alongside a cpu partition with t3.large nodes).

Editing partitions post-creation

  1. Open the cluster detail page and go to the Partitions tab.
  2. Click Edit to change the instance type or max node count of an existing partition.
  3. Click Add Partition to create a new one.

Edits are applied live. The autoscaler adjusts node counts based on the new limits.

tip

Start with a low max node count per partition. You can raise it later as workload demands grow. Idle provisioned nodes bill at full rate.

Non-AWS Slurm partitions

For Slurm clusters on Azure, GCP, Cudo Compute, or on-premises, partitions are managed post-creation from the cluster detail page. These providers use Vantage-managed defaults for node sizing and networking — the partition interface is simpler.

Autoscaling

Cloud Slurm partitions use Vantage's autoscaler to manage node counts:

  • When jobs enter a partition's queue, the autoscaler provisions nodes up to the partition's max.
  • When the queue drains and nodes are idle, the autoscaler terminates nodes down to the partition's min (or zero, if min is 0).
  • The autoscaler respects the instance type configured for each partition — a gpu partition only provisions GPU instances.

Best practices

  • Separate workloads by resource profile — Create CPU and GPU partitions so batch preprocessing doesn't block GPU nodes for training.
  • Set max node counts conservatively — Idle nodes cost money. Start low and raise when you've validated your workload patterns.
  • Use the default partition — Mark one partition as default so jobs that don't specify a partition still get scheduled.
Ask AI
Ask a question about Vantage Compute...