Skip to main content

Concepts

Mental models that show up across every cluster type.

Concepts

Cluster types

Vantage supports three cluster types. Your choice determines the scheduler, the detail tabs, and which workloads you can run.

TypeSchedulerBest for
SlurmSlurm (batch jobs)Traditional HPC — simulations, MPI workloads, batch pipelines
KubernetesKubernetesWorkbench sessions, ML training, containerized apps
Slurm on KubernetesSlurm inside K8sHPC workloads on cloud-native, auto-scaled infrastructure

Slurm and Slurm-on-Kubernetes clusters appear under the Slurm list in the sidebar. Kubernetes clusters appear under Kubernetes.

Status lifecycle

Every cluster moves through the same set of phases, regardless of type or provider:

StatusMeaning
PreparingInfrastructure provisioning in progress. Vantage is creating cloud resources, installing software, and waiting for the cluster to phone home.
ReadyCluster is connected and accepting workloads.
FailedProvisioning encountered an error. Check the detail page for status details.
DeletingVantage is tearing down infrastructure.

Provisioning time varies by provider and cluster type:

  • AWS Slurm — A few minutes (CloudFormation stack)
  • AWS K8s — 10-15 minutes (boto3 provisioning + cloud-init)
  • Cudo K8s — 10-25 minutes (VM provisioning + cloud-init)
  • Azure / GCP — Varies by region and quota
  • On-premises — Immediate after agent connects (infrastructure is yours)

Partitions and node groups

Partitions and node groups are how you organize compute resources:

  • Partitions (Slurm) — Job queues with rules for max run time, allowed users, and priority. Each partition targets a pool of nodes.
  • Node groups (Kubernetes) — Pools of identically-sized machines that the cluster autoscaler manages. Equivalent to partitions in purpose, but K8s-native.

Both concepts let you isolate workloads by resource requirements — for example, a GPU partition for ML training and a CPU partition for batch preprocessing.

Cost

Every provisioned node accumulates spend regardless of utilization. Cloud clusters with autoscaling can scale down to zero idle nodes — but only if the node group or partition minimum is set to zero.

The Monitoring tab on every cluster detail page shows live utilization and accumulated cost. Idle GPU nodes are the most common preventable expense.

Compute providers

Providers are the physical infrastructure Vantage provisions clusters on.

ProviderWhat it's for
Public clouds (AWS, Azure, GCP)Elastic capacity, global regions, spot pricing
Cudo ComputeCost-efficient GPU cloud
On-premises / LXDYour own hardware, maximum control
Vantage partners (atNorth, BuzzHPC, RCI)Pre-integrated managed colocation and HPC

Regions and availability

Cloud clusters run in the region you select during creation. Slurm clusters can span multiple availability zones within a region. On-premises clusters report their location as configured by your admin.

Some providers (Cudo Compute) allow per-node-group data center selection, enabling geo-distributed worker pools.

Ask AI
Ask a question about Vantage Compute...