Manage a Slurm cluster
Status lifecycle
A Slurm cluster moves through these statuses:
| Status | Meaning |
|---|---|
preparing | Provisioning in progress. Vantage is creating cloud resources (if applicable) and waiting for nodes to register. |
ready | Cluster is connected, Slurm configuration is uploaded, and the cluster is accepting jobs. |
failed | Provisioning or runtime error. Check creation_status_details on the detail page for the specific error. |
deleting | Cluster teardown in progress. Vantage is deprovisioning cloud resources and removing the database record. |
Transitions:
preparing→ready— All nodes registered and Slurm config uploaded.preparing→failed— CloudFormation error, provisioning timeout, or node registration failure.ready→deleting— User initiated deletion.failed→deleting— User initiated deletion.
Detail page tabs
Click a cluster row in the Clusters list to open the detail page:
| Tab | What you'll find |
|---|---|
| Overview | Cluster name, status, provider, region, creation time, and quick actions. |
| Nodes | Live node list — hostname, status (idle/allocated/down), CPU and memory utilization, and which partitions each node belongs to. |
| Partitions | Job queues with node counts, max time limits, and default partition status. Add, edit, or remove partitions. |
| Queue | Live job queue — pending, running, and completed jobs with priority, node count, and wall time. |
| Configuration | Slurm configuration parameters and cluster metadata. |
| Monitoring | Per-cluster Grafana dashboard with live utilization, accumulated cost, and node-level metrics. |
Monitoring
The Monitoring tab embeds a Grafana dashboard scoped to your cluster. Use it to:
- Track CPU and memory utilization across all nodes.
- Monitor job queue depth and wait times.
- View accumulated spend and current burn rate.
- Identify underutilized partitions for cost optimization.
Grafana is automatically configured when the cluster is created. No additional setup required.
Editing partitions
Cloud Slurm clusters let you add and edit partitions from the UI:
- Open the cluster detail page and go to the Partitions tab.
- Click Edit on an existing partition to change its node type or max node count.
- Click Add Partition to create a new partition with a different instance type or scaling limits.
Partition changes are applied live — no restart required. Nodes scale up as jobs are submitted to the partition.
For on-premises Slurm clusters, partitions are managed through your existing Slurm configuration. The Vantage UI reflects the partition data reported by the scheduler.
Deleting a cluster
- Open the cluster detail page.
- Click the overflow menu (three dots) in the top-right corner.
- Select Delete Cluster and confirm.
Vantage tears down all cloud resources associated with the cluster — CloudFormation stack (AWS), VMs (Cudo), database records, and Keycloak clients. Deletion is irreversible.