Skip to main content

Manage a Slurm cluster

Status lifecycle, detail page, and day-to-day management.

Manage a Slurm cluster

Status lifecycle

A Slurm cluster moves through these statuses:

StatusMeaning
preparingProvisioning in progress. Vantage is creating cloud resources (if applicable) and waiting for nodes to register.
readyCluster is connected, Slurm configuration is uploaded, and the cluster is accepting jobs.
failedProvisioning or runtime error. Check creation_status_details on the detail page for the specific error.
deletingCluster teardown in progress. Vantage is deprovisioning cloud resources and removing the database record.

Transitions:

  • preparingready — All nodes registered and Slurm config uploaded.
  • preparingfailed — CloudFormation error, provisioning timeout, or node registration failure.
  • readydeleting — User initiated deletion.
  • faileddeleting — User initiated deletion.

Detail page tabs

Click a cluster row in the Clusters list to open the detail page:

TabWhat you'll find
OverviewCluster name, status, provider, region, creation time, and quick actions.
NodesLive node list — hostname, status (idle/allocated/down), CPU and memory utilization, and which partitions each node belongs to.
PartitionsJob queues with node counts, max time limits, and default partition status. Add, edit, or remove partitions.
QueueLive job queue — pending, running, and completed jobs with priority, node count, and wall time.
ConfigurationSlurm configuration parameters and cluster metadata.
MonitoringPer-cluster Grafana dashboard with live utilization, accumulated cost, and node-level metrics.

Monitoring

The Monitoring tab embeds a Grafana dashboard scoped to your cluster. Use it to:

  • Track CPU and memory utilization across all nodes.
  • Monitor job queue depth and wait times.
  • View accumulated spend and current burn rate.
  • Identify underutilized partitions for cost optimization.

Grafana is automatically configured when the cluster is created. No additional setup required.

Editing partitions

Cloud Slurm clusters let you add and edit partitions from the UI:

  1. Open the cluster detail page and go to the Partitions tab.
  2. Click Edit on an existing partition to change its node type or max node count.
  3. Click Add Partition to create a new partition with a different instance type or scaling limits.

Partition changes are applied live — no restart required. Nodes scale up as jobs are submitted to the partition.

For on-premises Slurm clusters, partitions are managed through your existing Slurm configuration. The Vantage UI reflects the partition data reported by the scheduler.

Deleting a cluster

  1. Open the cluster detail page.
  2. Click the overflow menu (three dots) in the top-right corner.
  3. Select Delete Cluster and confirm.

Vantage tears down all cloud resources associated with the cluster — CloudFormation stack (AWS), VMs (Cudo), database records, and Keycloak clients. Deletion is irreversible.

Ask AI
Ask a question about Vantage Compute...