Skip to main content

Create a Kubernetes cluster

Step-by-step guides for creating Kubernetes clusters on every supported provider, including Slurm on Kubernetes.

Create a Kubernetes cluster

Prerequisites

  • A Vantage account and organization.
  • A configured Cloud Account for your chosen provider — see Compute Providers.
  • AWS only: An SSH key pair created in the target AWS region. Vantage uses the key pair name when creating the control plane EC2 instance.

AWS

AWS K8s clusters use direct boto3 API calls (not CloudFormation) to provision infrastructure. Vantage creates the VPC, IAM roles, security groups, and launches a control plane EC2 instance with MicroK8s pre-configured.

  1. Open Clusters — Click Clusters, then Prepare Cluster.

  2. Choose type — Select Kubernetes and click Continue.

  3. Configure the cluster:

    • Enter a Cluster Name (max 27 characters, must be unique) and optionally a Description.
    • Select your AWS Cloud Account.
    • Pick a Region — the dropdown loads after you select the cloud account.
    • Click Select Control Plane to choose an EC2 instance type for the cluster management nodes. Browse by vCPU, GPU, and price.
    • Select an SSH Key Name — the list loads after you pick a region. If it's empty, create a key pair in the AWS EC2 console first.
  4. Advanced networking (optional) — Click Advanced Options to specify:

    • VPC ID — Deploy into an existing VPC. A new VPC is created if omitted.
    • Subnet ID — Existing subnet in the VPC. Resolved automatically if omitted.
  5. Select platform integrations — Choose which tools to install on the cluster. See Integrations for details.

    IntegrationPurposeDefault
    NotebookJupyterHub for interactive sessionsEnabled
    Grafana + PrometheusCluster monitoring and observabilityEnabled
    RayDistributed ML training frameworkDisabled
    MLflowML experiment trackingDisabled
    Slurm on KubernetesDeploy a Slurm scheduler on this cluster laterDisabled
  6. Submit — Click Prepare Cluster. The cluster enters preparing status. AWS provisioning typically takes 10-15 minutes.

What Vantage provisions on AWS

ResourceDetails
VPC10.0.0.0/16 CIDR (auto-created if not provided)
SubnetsPublic + private subnets
Internet Gateway + NAT GatewayFor outbound and inbound connectivity
Security groupsDefault VPC security group for inter-node communication
IAM Rolevantage-{client_id}-node-role with EC2 trust policy
Instance Profilevantage-{client_id}-instance-profile linked to the IAM role
IAM PoliciesAmazonEBSCSIDriverPolicy, AmazonEFSCSIDriverPolicy, AmazonFSxFullAccess, plus a custom inline policy for EC2 Fleet management and launch template operations
Control plane EC2 instancevantage-{client_id}-control-plane with Ubuntu 24.04, MicroK8s, LUKS encryption, Vault KMS, and Vantage Agent
Launch TemplatesCreated by the autoscaler at runtime (one per node group)
EC2 Fleet instancesTagged vantage-cluster={client_id}, managed by the autoscaler

Cudo Compute

Cudo Compute K8s clusters provision a control plane VM through the Cudo Compute REST API. Unlike AWS, compute is specified by raw resources (vcpus + memory_gib) rather than instance types, and each node group has its own data center.

  1. Open Clusters — Click Clusters, then Prepare Cluster.

  2. Choose type — Select Kubernetes and click Continue.

  3. Configure the cluster:

    • Enter a Cluster Name (max 27 characters, must be unique) and optionally a Description.
    • Select your Cudo Compute Cloud Account.
  4. Select platform integrations — Same options as AWS (JupyterHub, Grafana, Ray, MLflow).

  5. Submit — Click Prepare Cluster. Cudo provisioning typically takes 10-25 minutes (longer than AWS due to VM provisioning + cloud-init).

What Vantage provisions on Cudo

ResourceDetails
Storage disk{vm_id}-storage, 100 GiB
Control plane VM{cluster_name}-control-0-{timestamp}, Ubuntu 24.04, MicroK8s, LUKS, Vault KMS
VM metadatavantage-cluster: {client_id}, vantage-role: control-plane

Azure

  1. Open Clusters — Click Clusters, then Prepare Cluster.

  2. Choose type — Select Kubernetes and click Continue.

  3. Configure:

    • Enter a Cluster Name and optional Description.
    • Select your Azure Cloud Account.
  4. Select platform integrations — Same options as other providers.

  5. Submit — Azure Kubernetes clusters use Vantage-managed defaults for node sizing and networking. Review your cloud account's regional quota before submitting.

GCP

  1. Open Clusters — Click Clusters, then Prepare Cluster.

  2. Choose type — Select Kubernetes and click Continue.

  3. Configure:

    • Enter a Cluster Name and optional Description.
    • Select your GCP Cloud Account.
  4. Select platform integrations — Same options as other providers.

  5. Submit — GCP Kubernetes clusters use Vantage-managed defaults. Verify your project's quota before submitting.

On-premises / LXD

On-premises Kubernetes clusters connect through a lightweight agent, same as on-premises Slurm clusters. Vantage does not provision cloud resources.

  1. Open Clusters — Click Clusters, then Prepare Cluster.

  2. Choose type — Select Kubernetes and click Continue.

  3. Configure:

    • Enter a Cluster Name.
    • Select your On-Premises or LXD cloud account.
  4. Get the agent command — The wizard shows the agent installation command. Copy and run it on your cluster's head node. The agent establishes an outbound HTTPS connection to Vantage.

Slurm on Kubernetes

You can deploy a Slurm scheduler on top of an existing Kubernetes cluster (AWS or Cudo Compute). This gives you HPC batch scheduling on cloud-native, auto-scaled infrastructure.

Prerequisites

  • An existing Kubernetes cluster with Ready status — see AWS or Cudo Compute above.
  • The parent K8s cluster must have Slurm on Kubernetes enabled in its integrations.

Steps

  1. Open Clusters — Click Clusters, then Prepare Cluster.

  2. Choose type — Select Slurm on Kubernetes and click Continue.

  3. Select parent K8s cluster — A grid shows your available Kubernetes clusters. Click the target cluster to select it, then click Configure Slurm Cluster.

  4. Configure the Slurm cluster:

    Cluster Identity:

    • Slurm Cluster Name — Must start with a lowercase letter and use only lowercase letters, numbers, and dashes (no trailing dash).
    • Parent K8s Cluster — Pre-filled from the previous step (read-only).

    Node Groups: Two node groups are pre-configured — Control Plane and Compute Group. Node group names are auto-generated based on the cluster name (e.g., slurm-control-{name} and slurm-compute-{name}-1).

    FieldDefaultNotes
    ProfileSelect a profile. No default — a selection is required.
    Max Nodes1 (Control Plane) / 10 (Compute)Minimum 1

    The Profile field adapts based on the parent K8s cluster's provider:

    • AWS parent — Opens an instance type browser dialog. Select any EC2 instance type (e.g., t3.medium, c5n.4xlarge).

    • Cudo Compute parent — A dropdown with three pre-defined profiles:

      ProfilevCPUMemory
      Small48 GiB
      Medium816 GiB
      Large1632 GiB

    Click + Add Compute Group to add additional compute node groups. At least one control plane group and one compute group are required.

    Partitions: A default partition named compute is pre-configured. Partitions route jobs to a specific node group. Add more partitions as needed.

  5. Submit — Click Create Slurm Cluster. The wizard shows a progress stepper:

    1. Registering cluster — Creates the Slurm cluster record and provisions a Keycloak client.
    2. Creating node groups — Provisions each node group sequentially on the parent K8s cluster (control plane, then compute groups) via vdeployer.
    3. Creating Slurm cluster — VDeployer triggers Helm chart installation: slurmctld, slurmdbd, slurmrestd, slurmd, and optionally slurm-bridge.

The Slurm cluster enters preparing status and transitions to ready once all Slurm pods are running.

tip

Non-AWS public cloud K8s clusters (Azure, GCP) do not support Slurm-on-Kubernetes. Only AWS and Cudo Compute parents can host a Slurm-on-K8s deployment.

What happens after submission

After submission, the cluster enters preparing status. The background thread handles all provisioning:

  1. STS AssumeRole (AWS only) — Assumes the IAM role from the cloud account for temporary credentials.
  2. Network setup — Creates VPC, subnets, and security groups (or validates existing ones).
  3. IAM resources (AWS only) — Creates instance roles and policies.
  4. Control plane launch — Creates the EC2 instance or VM with cloud-init that installs MicroK8s, LUKS encryption, and Vault KMS.
  5. markClusterReady — Cloud-init calls this mutation when setup completes. The cluster transitions to ready.
  6. vdeployer deploy — Deploys autoscaler, tunnel client, and enabled integrations.
tip

Poll the cluster status every 30-60 seconds. AWS provisioning typically takes 10-15 minutes; Cudo Compute takes 10-25 minutes.

Ask AI
Ask a question about Vantage Compute...