Kubernetes clusters
Kubernetes clusters are Vantage's platform clusters — they run MicroK8s with a Vantage-managed control plane and power every higher-level Vantage product: Workbench sessions, training jobs, model endpoints, pipelines, and sweeps. They also serve as parent clusters for Slurm-on-Kubernetes deployments.
Vantage handles the full lifecycle: cloud provisioning (VPC, IAM, control plane instance), K8s installation, node autoscaling, and integration deployment. You interact with the cluster through Vantage — no kubectl or direct cluster access required for day-to-day use.
How it works
When you create a Kubernetes cluster, Vantage:
- Validates input — Checks cluster name, cloud account credentials, instance types, and subscription limits.
- Creates database records — Inserts the cluster record with
status = preparing. - Creates a Keycloak client — Registers an OAuth2 client for the cluster.
- Provisions infrastructure (background thread) — This step varies by provider:
- AWS: Assumes the IAM role, creates VPC/subnets/security groups (or uses existing), creates IAM roles and instance profiles, launches a control plane EC2 instance with cloud-init that installs MicroK8s, LUKS encryption, and Vault KMS.
- Cudo Compute: Discovers machine types in the data center, creates a storage disk, provisions a control plane VM with cloud-init.
- Azure / GCP: Uses Vantage-managed defaults for the control plane.
- On-premises: No cloud provisioning — waits for the agent.
- Transitions to ready — The control plane's cloud-init script calls
markClusterReadyonce MicroK8s, encryption, and Vault are set up. - Deploys integrations — vdeployer-web deploys the cluster autoscaler, tunnel client, and any enabled integrations (JupyterHub, Grafana, Ray, MLflow).
Provider comparison
| Aspect | AWS | Cudo Compute | Azure / GCP | On-premises / LXD |
|---|---|---|---|---|
| Control plane | EC2 instance (boto3) | VM (Cudo API) | Vantage-managed | Agent-based |
| Instance selection | EC2 type browser | Resource profiles (vcpus + memory) | Vantage-managed defaults | Your hardware |
| VPC / networking | VPC + subnets (auto or existing) | Data center + machine type | Vantage-managed | Your network |
| GPU support | GPU instance types | Explicit gpus + gpu_model fields | GPU instance types | Your GPUs |
| Custom networking | VPC, subnet, security group | Per-group data center | No | No |
| Slurm on K8s supported | Yes | Yes | No | No |
Slurm on Kubernetes
You can deploy a Slurm scheduler on top of an existing Kubernetes cluster — combining HPC batch scheduling with cloud-native autoscaling. This is available for:
- AWS K8s parent clusters — Node groups use EC2 instance types selected from the instance browser.
- Cudo Compute K8s parent clusters — Node groups use pre-defined profiles (Small, Medium, Large).
Non-AWS public cloud K8s clusters (Azure, GCP) do not currently support Slurm-on-Kubernetes.
For details, see creating a Slurm-on-Kubernetes cluster.
Next steps
- Create a Kubernetes cluster — Provider-specific walkthroughs
- Manage a Kubernetes cluster — Status lifecycle, detail page, monitoring
- Node groups — Compute pools and autoscaling
- Integrations — JupyterHub, Grafana, Ray, MLflow
- Reference — Fields, limits, error codes