Skip to main content

Kubernetes clusters

Managed platform clusters for Workbench sessions, ML training, and containerized workloads.

Kubernetes clusters

Kubernetes clusters are Vantage's platform clusters — they run MicroK8s with a Vantage-managed control plane and power every higher-level Vantage product: Workbench sessions, training jobs, model endpoints, pipelines, and sweeps. They also serve as parent clusters for Slurm-on-Kubernetes deployments.

Vantage handles the full lifecycle: cloud provisioning (VPC, IAM, control plane instance), K8s installation, node autoscaling, and integration deployment. You interact with the cluster through Vantage — no kubectl or direct cluster access required for day-to-day use.

How it works

When you create a Kubernetes cluster, Vantage:

  1. Validates input — Checks cluster name, cloud account credentials, instance types, and subscription limits.
  2. Creates database records — Inserts the cluster record with status = preparing.
  3. Creates a Keycloak client — Registers an OAuth2 client for the cluster.
  4. Provisions infrastructure (background thread) — This step varies by provider:
    • AWS: Assumes the IAM role, creates VPC/subnets/security groups (or uses existing), creates IAM roles and instance profiles, launches a control plane EC2 instance with cloud-init that installs MicroK8s, LUKS encryption, and Vault KMS.
    • Cudo Compute: Discovers machine types in the data center, creates a storage disk, provisions a control plane VM with cloud-init.
    • Azure / GCP: Uses Vantage-managed defaults for the control plane.
    • On-premises: No cloud provisioning — waits for the agent.
  5. Transitions to ready — The control plane's cloud-init script calls markClusterReady once MicroK8s, encryption, and Vault are set up.
  6. Deploys integrations — vdeployer-web deploys the cluster autoscaler, tunnel client, and any enabled integrations (JupyterHub, Grafana, Ray, MLflow).

Provider comparison

AspectAWSCudo ComputeAzure / GCPOn-premises / LXD
Control planeEC2 instance (boto3)VM (Cudo API)Vantage-managedAgent-based
Instance selectionEC2 type browserResource profiles (vcpus + memory)Vantage-managed defaultsYour hardware
VPC / networkingVPC + subnets (auto or existing)Data center + machine typeVantage-managedYour network
GPU supportGPU instance typesExplicit gpus + gpu_model fieldsGPU instance typesYour GPUs
Custom networkingVPC, subnet, security groupPer-group data centerNoNo
Slurm on K8s supportedYesYesNoNo

Slurm on Kubernetes

You can deploy a Slurm scheduler on top of an existing Kubernetes cluster — combining HPC batch scheduling with cloud-native autoscaling. This is available for:

  • AWS K8s parent clusters — Node groups use EC2 instance types selected from the instance browser.
  • Cudo Compute K8s parent clusters — Node groups use pre-defined profiles (Small, Medium, Large).

Non-AWS public cloud K8s clusters (Azure, GCP) do not currently support Slurm-on-Kubernetes.

For details, see creating a Slurm-on-Kubernetes cluster.

Next steps

Ask AI
Ask a question about Vantage Compute...