Concepts
Cluster types
Vantage supports three cluster types. Your choice determines the scheduler, the detail tabs, and which workloads you can run.
| Type | Scheduler | Best for |
|---|---|---|
| Slurm | Slurm (batch jobs) | Traditional HPC — simulations, MPI workloads, batch pipelines |
| Kubernetes | Kubernetes | Workbench sessions, ML training, containerized apps |
| Slurm on Kubernetes | Slurm inside K8s | HPC workloads on cloud-native, auto-scaled infrastructure |
Slurm and Slurm-on-Kubernetes clusters appear under the Slurm list in the sidebar. Kubernetes clusters appear under Kubernetes.
Status lifecycle
Every cluster moves through the same set of phases, regardless of type or provider:
| Status | Meaning |
|---|---|
| Preparing | Infrastructure provisioning in progress. Vantage is creating cloud resources, installing software, and waiting for the cluster to phone home. |
| Ready | Cluster is connected and accepting workloads. |
| Failed | Provisioning encountered an error. Check the detail page for status details. |
| Deleting | Vantage is tearing down infrastructure. |
Provisioning time varies by provider and cluster type:
- AWS Slurm — A few minutes (CloudFormation stack)
- AWS K8s — 10-15 minutes (boto3 provisioning + cloud-init)
- Cudo K8s — 10-25 minutes (VM provisioning + cloud-init)
- Azure / GCP — Varies by region and quota
- On-premises — Immediate after agent connects (infrastructure is yours)
Partitions and node groups
Partitions and node groups are how you organize compute resources:
- Partitions (Slurm) — Job queues with rules for max run time, allowed users, and priority. Each partition targets a pool of nodes.
- Node groups (Kubernetes) — Pools of identically-sized machines that the cluster autoscaler manages. Equivalent to partitions in purpose, but K8s-native.
Both concepts let you isolate workloads by resource requirements — for example, a GPU partition for ML training and a CPU partition for batch preprocessing.
Cost
Every provisioned node accumulates spend regardless of utilization. Cloud clusters with autoscaling can scale down to zero idle nodes — but only if the node group or partition minimum is set to zero.
The Monitoring tab on every cluster detail page shows live utilization and accumulated cost. Idle GPU nodes are the most common preventable expense.
Compute providers
Providers are the physical infrastructure Vantage provisions clusters on.
| Provider | What it's for |
|---|---|
| Public clouds (AWS, Azure, GCP) | Elastic capacity, global regions, spot pricing |
| Cudo Compute | Cost-efficient GPU cloud |
| On-premises / LXD | Your own hardware, maximum control |
| Vantage partners (atNorth, BuzzHPC, RCI) | Pre-integrated managed colocation and HPC |
Regions and availability
Cloud clusters run in the region you select during creation. Slurm clusters can span multiple availability zones within a region. On-premises clusters report their location as configured by your admin.
Some providers (Cudo Compute) allow per-node-group data center selection, enabling geo-distributed worker pools.