Slurm clusters
Slurm is a workload manager for high-performance computing — batch job scheduling, parallel job execution, and queue management. Vantage provisions Slurm clusters on your choice of infrastructure: public cloud (AWS, Azure, GCP), cost-efficient GPU cloud (Cudo Compute), partner HPC providers, or your own hardware.
Vantage handles all the Slurm controller setup, node registration, and autoscaling. You interact with the cluster through the Vantage UI, CLI, SDK, or API — no direct SSH access to the head node required for day-to-day use.
How it works
When you create a Slurm cluster, Vantage:
- Validates your input and checks your subscription limits.
- Creates a Keycloak OAuth2 client for the cluster (used for authentication by Vantage Agent and integrations).
- Inserts the cluster record and partition configuration.
- Provisions infrastructure on your chosen provider — CloudFormation stack on AWS, direct boto3 API calls on other clouds, or no cloud provisioning for on-premises.
- Registers nodes as they come online via Vantage Agent.
- Transitions the cluster to
readyonce all nodes are registered and Slurm configuration is uploaded.
Provisioning is asynchronous. The cluster enters preparing status immediately and transitions to ready or failed once infrastructure is set up.
Provider comparison
| Aspect | AWS | Azure / GCP | Cudo Compute | On-premises / LXD |
|---|---|---|---|---|
| Provisioning | CloudFormation | Vantage-managed defaults | Vantage-managed defaults | Agent-based (you provide infrastructure) |
| Instance selection | EC2 instance type browser | Vantage-managed defaults | Vantage-managed defaults | Your existing hardware |
| Partitions | Configured during creation | Configured post-creation | Configured post-creation | Configured post-creation |
| SSH key required | Yes (EC2 key pair name) | No | No | No |
| Custom networking | VPC, subnet, security group | No | No | No |
Next steps
- Create a Slurm cluster — Provider-specific walkthroughs
- Manage a Slurm cluster — Status lifecycle, detail page, monitoring
- Partitions — Job queues and node pools
- Reference — Fields, limits, error codes