Partitions
Partitions are job queues on a Slurm cluster. Each partition targets a pool of nodes with specific characteristics — instance type, GPU count, max run time — and applies rules about who can submit jobs and at what priority.
When you submit a job through Vantage, you name a partition and Slurm places it on a qualifying node within that partition.
How partitions work
Think of a partition as a named queue that routes jobs to a specific pool of compute resources:
| Slurm concept | Maps to |
|---|---|
PartitionName | Your partition name (e.g., gpu, cpu, high-mem) |
Default | Whether this is the default partition for jobs that don't specify one |
MaxTime | Maximum wall time for jobs in this partition (in seconds) |
| Nodes | All instances in the compute node pool assigned to this partition |
Cloud Slurm partitions
For cloud-provisioned Slurm clusters (AWS), you configure partitions during cluster creation. Each partition:
- Gets a unique Partition Name.
- Targets a Compute Node instance type selected through the instance browser.
- Has a Maximum node count that Vantage's autoscaler uses as a ceiling.
Creating partitions during setup
In the wizard's partition step:
- A default partition named
computeis pre-filled. - Click Select Compute Node to choose the instance type for worker nodes in that partition.
- Set the Maximum node count — the autoscaler scales up to this limit when jobs queue up.
- Click Add Partition to create additional partitions (e.g., a
gpupartition withp3.2xlargenodes alongside acpupartition witht3.largenodes).
Editing partitions post-creation
- Open the cluster detail page and go to the Partitions tab.
- Click Edit to change the instance type or max node count of an existing partition.
- Click Add Partition to create a new one.
Edits are applied live. The autoscaler adjusts node counts based on the new limits.
Start with a low max node count per partition. You can raise it later as workload demands grow. Idle provisioned nodes bill at full rate.
Non-AWS Slurm partitions
For Slurm clusters on Azure, GCP, Cudo Compute, or on-premises, partitions are managed post-creation from the cluster detail page. These providers use Vantage-managed defaults for node sizing and networking — the partition interface is simpler.
Autoscaling
Cloud Slurm partitions use Vantage's autoscaler to manage node counts:
- When jobs enter a partition's queue, the autoscaler provisions nodes up to the partition's max.
- When the queue drains and nodes are idle, the autoscaler terminates nodes down to the partition's min (or zero, if min is 0).
- The autoscaler respects the instance type configured for each partition — a
gpupartition only provisions GPU instances.
Best practices
- Separate workloads by resource profile — Create CPU and GPU partitions so batch preprocessing doesn't block GPU nodes for training.
- Set max node counts conservatively — Idle nodes cost money. Start low and raise when you've validated your workload patterns.
- Use the default partition — Mark one partition as default so jobs that don't specify a partition still get scheduled.