Kubernetes clusters
Kubernetes clusters are Vantage's platform clusters — they run MicroK8s with a Vantage-managed control plane and power every higher-level Vantage product: Workbench sessions, training jobs, model endpoints, pipelines, and sweeps. They also serve as parent clusters for Slurm-on-Kubernetes deployments.
Vantage handles the full lifecycle: cloud provisioning (VPC, IAM, control plane instance), K8s installation, node autoscaling, and integration deployment. You interact with the cluster through Vantage — no kubectl or direct cluster access required for day-to-day use.
How it works
When you create a Kubernetes cluster, Vantage:
- Validates input — Checks cluster name, cloud account credentials, instance types, and subscription limits.
- Creates database records — Inserts the cluster record with
status = preparing. - Creates a Keycloak client — Registers an OAuth2 client for the cluster.
- Provisions infrastructure (background thread) — This step varies by provider:
- AWS: Assumes the IAM role, creates VPC/subnets/security groups (or uses existing), creates IAM roles and instance profiles, launches a control plane EC2 instance with cloud-init that installs MicroK8s, LUKS encryption, and Vault KMS.
- Azure / GCP: Uses Vantage-managed defaults for the control plane.
- On-premises: No cloud provisioning — waits for the connector.
- Transitions to ready — The control plane's cloud-init script calls
markClusterReadyonce MicroK8s, encryption, and Vault are set up. - Deploys integrations — vdeployer-web deploys the cluster autoscaler, tunnel client, and any enabled integrations (JupyterHub, Grafana, Ray, MLflow).
Provider comparison
| Aspect | AWS | Azure / GCP | On-Premises | |---|---|---|---|---| | Control plane | EC2 instance (boto3) | Vantage-managed | Connector-based (Multipass and Juju are Slurm-only) | | Instance selection | EC2 type browser | Vantage-managed defaults | Your hardware or local VMs | | VPC / networking | VPC + subnets (auto or existing) | Vantage-managed | Your network | | GPU support | GPU instance types | GPU instance types | Your GPUs | | Custom networking | VPC, subnet, security group | No | No | | Slurm on K8s supported | Yes | No | No |
Slurm on Kubernetes
You can deploy a Slurm scheduler on top of an existing Kubernetes cluster — combining HPC batch scheduling with cloud-native autoscaling. This is available for:
- AWS K8s parent clusters — Compute pools use EC2 instance types selected from the instance browser. Non-AWS public cloud K8s clusters (Azure, GCP) do not currently support Slurm-on-Kubernetes.
For details, see creating a Slurm-on-Kubernetes cluster.
Next steps
- Create a Kubernetes cluster — Provider-specific walkthroughs
- Manage a Kubernetes cluster — Status lifecycle, detail page, monitoring
- Compute pools — Compute pools and autoscaling
- Integrations — JupyterHub, Grafana, Ray, MLflow
- Reference — Fields, limits, error codes
- On-Premises clusters — Connect your own Kubernetes infrastructure