Create a Kubernetes cluster
Prerequisites
- A Vantage account and organization.
- A configured Cloud Account for your chosen provider — see Compute Providers.
- AWS only: An SSH key pair created in the target AWS region. Vantage uses the key pair name when creating the control plane EC2 instance.
AWS
AWS K8s clusters use direct boto3 API calls (not CloudFormation) to provision infrastructure. Vantage creates the VPC, IAM roles, security groups, and launches a control plane EC2 instance with MicroK8s pre-configured.
-
Open Clusters — Click Clusters, then Prepare Cluster.
-
Choose type — Select Kubernetes and click Continue.
-
Configure the cluster:
- Enter a Cluster Name (max 27 characters, must be unique) and optionally a Description.
- Select your AWS Cloud Account.
- Pick a Region — the dropdown loads after you select the cloud account.
- Click Select Control Plane to choose an EC2 instance type for the cluster management nodes. Browse by vCPU, GPU, and price.
- Select an SSH Key Name — the list loads after you pick a region. If it's empty, create a key pair in the AWS EC2 console first.
-
Advanced networking (optional) — Click Advanced Options to specify:
- VPC ID — Deploy into an existing VPC. A new VPC is created if omitted.
- Subnet ID — Existing subnet in the VPC. Resolved automatically if omitted.
-
Select platform integrations — Choose which tools to install on the cluster. See Integrations for details.
Integration Purpose Default Notebook JupyterHub for interactive sessions Enabled Grafana + Prometheus Cluster monitoring and observability Enabled Ray Distributed ML training framework Disabled MLflow ML experiment tracking Disabled Slurm on Kubernetes Deploy a Slurm scheduler on this cluster later Disabled -
Submit — Click Prepare Cluster. The cluster enters
preparingstatus. AWS provisioning typically takes 10-15 minutes.
What Vantage provisions on AWS
| Resource | Details |
|---|---|
| VPC | 10.0.0.0/16 CIDR (auto-created if not provided) |
| Subnets | Public + private subnets |
| Internet Gateway + NAT Gateway | For outbound and inbound connectivity |
| Security groups | Default VPC security group for inter-node communication |
| IAM Role | vantage-{client_id}-node-role with EC2 trust policy |
| Instance Profile | vantage-{client_id}-instance-profile linked to the IAM role |
| IAM Policies | AmazonEBSCSIDriverPolicy, AmazonEFSCSIDriverPolicy, AmazonFSxFullAccess, plus a custom inline policy for EC2 Fleet management and launch template operations |
| Control plane EC2 instance | vantage-{client_id}-control-plane with Ubuntu 24.04, MicroK8s, LUKS encryption, Vault KMS, and Vantage Agent |
| Launch Templates | Created by the autoscaler at runtime (one per node group) |
| EC2 Fleet instances | Tagged vantage-cluster={client_id}, managed by the autoscaler |
Cudo Compute
Cudo Compute K8s clusters provision a control plane VM through the Cudo Compute REST API. Unlike AWS, compute is specified by raw resources (vcpus + memory_gib) rather than instance types, and each node group has its own data center.
-
Open Clusters — Click Clusters, then Prepare Cluster.
-
Choose type — Select Kubernetes and click Continue.
-
Configure the cluster:
- Enter a Cluster Name (max 27 characters, must be unique) and optionally a Description.
- Select your Cudo Compute Cloud Account.
-
Select platform integrations — Same options as AWS (JupyterHub, Grafana, Ray, MLflow).
-
Submit — Click Prepare Cluster. Cudo provisioning typically takes 10-25 minutes (longer than AWS due to VM provisioning + cloud-init).
What Vantage provisions on Cudo
| Resource | Details |
|---|---|
| Storage disk | {vm_id}-storage, 100 GiB |
| Control plane VM | {cluster_name}-control-0-{timestamp}, Ubuntu 24.04, MicroK8s, LUKS, Vault KMS |
| VM metadata | vantage-cluster: {client_id}, vantage-role: control-plane |
Azure
-
Open Clusters — Click Clusters, then Prepare Cluster.
-
Choose type — Select Kubernetes and click Continue.
-
Configure:
- Enter a Cluster Name and optional Description.
- Select your Azure Cloud Account.
-
Select platform integrations — Same options as other providers.
-
Submit — Azure Kubernetes clusters use Vantage-managed defaults for node sizing and networking. Review your cloud account's regional quota before submitting.
GCP
-
Open Clusters — Click Clusters, then Prepare Cluster.
-
Choose type — Select Kubernetes and click Continue.
-
Configure:
- Enter a Cluster Name and optional Description.
- Select your GCP Cloud Account.
-
Select platform integrations — Same options as other providers.
-
Submit — GCP Kubernetes clusters use Vantage-managed defaults. Verify your project's quota before submitting.
On-premises / LXD
On-premises Kubernetes clusters connect through a lightweight agent, same as on-premises Slurm clusters. Vantage does not provision cloud resources.
-
Open Clusters — Click Clusters, then Prepare Cluster.
-
Choose type — Select Kubernetes and click Continue.
-
Configure:
- Enter a Cluster Name.
- Select your On-Premises or LXD cloud account.
-
Get the agent command — The wizard shows the agent installation command. Copy and run it on your cluster's head node. The agent establishes an outbound HTTPS connection to Vantage.
Slurm on Kubernetes
You can deploy a Slurm scheduler on top of an existing Kubernetes cluster (AWS or Cudo Compute). This gives you HPC batch scheduling on cloud-native, auto-scaled infrastructure.
Prerequisites
- An existing Kubernetes cluster with Ready status — see AWS or Cudo Compute above.
- The parent K8s cluster must have Slurm on Kubernetes enabled in its integrations.
Steps
-
Open Clusters — Click Clusters, then Prepare Cluster.
-
Choose type — Select Slurm on Kubernetes and click Continue.
-
Select parent K8s cluster — A grid shows your available Kubernetes clusters. Click the target cluster to select it, then click Configure Slurm Cluster.
-
Configure the Slurm cluster:
Cluster Identity:
- Slurm Cluster Name — Must start with a lowercase letter and use only lowercase letters, numbers, and dashes (no trailing dash).
- Parent K8s Cluster — Pre-filled from the previous step (read-only).
Node Groups: Two node groups are pre-configured — Control Plane and Compute Group. Node group names are auto-generated based on the cluster name (e.g.,
slurm-control-{name}andslurm-compute-{name}-1).Field Default Notes Profile — Select a profile. No default — a selection is required. Max Nodes 1 (Control Plane) / 10 (Compute) Minimum 1 The Profile field adapts based on the parent K8s cluster's provider:
-
AWS parent — Opens an instance type browser dialog. Select any EC2 instance type (e.g.,
t3.medium,c5n.4xlarge). -
Cudo Compute parent — A dropdown with three pre-defined profiles:
Profile vCPU Memory Small 4 8 GiB Medium 8 16 GiB Large 16 32 GiB
Click + Add Compute Group to add additional compute node groups. At least one control plane group and one compute group are required.
Partitions: A default partition named
computeis pre-configured. Partitions route jobs to a specific node group. Add more partitions as needed. -
Submit — Click Create Slurm Cluster. The wizard shows a progress stepper:
- Registering cluster — Creates the Slurm cluster record and provisions a Keycloak client.
- Creating node groups — Provisions each node group sequentially on the parent K8s cluster (control plane, then compute groups) via vdeployer.
- Creating Slurm cluster — VDeployer triggers Helm chart installation:
slurmctld,slurmdbd,slurmrestd,slurmd, and optionallyslurm-bridge.
The Slurm cluster enters preparing status and transitions to ready once all Slurm pods are running.
Non-AWS public cloud K8s clusters (Azure, GCP) do not support Slurm-on-Kubernetes. Only AWS and Cudo Compute parents can host a Slurm-on-K8s deployment.
What happens after submission
After submission, the cluster enters preparing status. The background thread handles all provisioning:
- STS AssumeRole (AWS only) — Assumes the IAM role from the cloud account for temporary credentials.
- Network setup — Creates VPC, subnets, and security groups (or validates existing ones).
- IAM resources (AWS only) — Creates instance roles and policies.
- Control plane launch — Creates the EC2 instance or VM with cloud-init that installs MicroK8s, LUKS encryption, and Vault KMS.
markClusterReady— Cloud-init calls this mutation when setup completes. The cluster transitions toready.- vdeployer deploy — Deploys autoscaler, tunnel client, and enabled integrations.
Poll the cluster status every 30-60 seconds. AWS provisioning typically takes 10-15 minutes; Cudo Compute takes 10-25 minutes.