Kubernetes clusters

Kubernetes clusters are Vantage's platform clusters, they run MicroK8s with a Vantage-managed control plane and power every higher-level Vantage product: Workbench sessions, training jobs, model endpoints, pipelines, and sweeps. They also serve as parent clusters for Slurm-on-Kubernetes deployments.

Vantage handles the full lifecycle: cloud provisioning (VPC, IAM, control plane instance), K8s installation, node autoscaling, and integration deployment. You interact with the cluster through Vantage, no kubectl or direct cluster access required for day-to-day use.

How it works

When you create a Kubernetes cluster, Vantage:

Validates input: Checks cluster name, cloud account credentials, instance types, and subscription limits.
Creates database records: Inserts the cluster record with status = preparing.
Creates a Keycloak client: Registers an OAuth2 client for the cluster.
Provisions infrastructure (background thread), This step varies by provider:
- AWS: Assumes the IAM role, creates VPC/subnets/security groups (or uses existing), creates IAM roles and instance profiles, launches a control plane EC2 instance with cloud-init that installs MicroK8s, LUKS encryption, and Vault KMS.
- Azure / GCP: Uses Vantage-managed defaults for the control plane.
- On-premises: No cloud provisioning, waits for the connector.
Transitions to ready: The control plane's cloud-init script calls markClusterReady once MicroK8s, encryption, and Vault are set up.
Deploys integrations: vdeployer-web deploys the cluster autoscaler, tunnel client, and any enabled integrations (JupyterHub, Grafana, Ray, MLflow).

Provider comparison

| Aspect | AWS | Azure / GCP | On-Premises | |---|---|---|---|---| | Control plane | EC2 instance (boto3) | Vantage-managed | Connector-based (Multipass and Juju are Slurm-only) | | Instance selection | EC2 type browser | Vantage-managed defaults | Your hardware or local VMs | | VPC / networking | VPC + subnets (auto or existing) | Vantage-managed | Your network | | GPU support | GPU instance types | GPU instance types | Your GPUs | | Custom networking | VPC, subnet, security group | No | No | | Slurm on K8s supported | Yes | No | No |

Multipass and Juju on-premises clusters only support Slurm, not Kubernetes. For on-premises Kubernetes, see On-Premises clusters and choose the Kubernetes tab.:::

Slurm on Kubernetes

You can deploy a Slurm scheduler on top of an existing Kubernetes cluster, combining HPC batch scheduling with cloud-native autoscaling. This is available for:

AWS K8s parent clusters: Compute pools use EC2 instance types selected from the instance browser. Non-AWS public cloud K8s clusters (Azure, GCP) do not currently support Slurm-on-Kubernetes.

For details, see creating a Slurm-on-Kubernetes cluster.

Next steps

Create a Kubernetes cluster: Provider-specific walkthroughs
Manage a Kubernetes cluster: Status lifecycle, detail page, monitoring
Compute pools: Compute pools and autoscaling
Integrations: JupyterHub, Grafana, Ray, MLflow
Reference: Fields, limits, error codes
On-Premises clusters: Connect your own Kubernetes infrastructure

How it works​

Provider comparison​

Slurm on Kubernetes​

Next steps​

How it works

Provider comparison

Slurm on Kubernetes

Next steps