Manage a Kubernetes cluster
Status lifecycle
A Kubernetes cluster moves through these statuses:
| Status | Meaning |
|---|---|
preparing | Infrastructure provisioning in progress. Vantage is creating the control plane, installing MicroK8s, and waiting for cloud-init to complete. |
ready | Cluster is connected, vdeployer has deployed the autoscaler and integrations, and the cluster is accepting workloads. |
failed | Provisioning error. Check the detail page for status details. |
deleting | Cluster teardown in progress. Vantage is deprovisioning cloud resources and removing database records. |
Transitions:
preparing→ready—markClusterReadycalled by cloud-init, vdeployer deployed integrations.preparing→failed— AWS API error, IAM role failure, VM creation timeout, or Keycloak error.ready→deleting— User initiated deletion.failed→deleting— User initiated deletion.
Detail page tabs
Click a cluster row in the Clusters list to open the detail page:
| Tab | What you'll find |
|---|---|
| Overview | Cluster name, status, provider, region, creation time, client ID, and quick actions. |
| Nodes | Live node list with instance type, status, CPU/memory utilization, and node group membership. |
| Node Groups | Compute pool definitions — name, role (control/worker), instance type, min/max size, labels, and taints. Add, edit, or remove node groups. |
| Integrations | Status of installed platform integrations (JupyterHub, Grafana, Ray, MLflow). Enable or disable integrations. |
| Monitoring | Per-cluster Grafana dashboard with live utilization, accumulated cost, and node-level metrics. |
| Configuration | Cluster metadata and configuration parameters. |
Node groups
Node groups are pools of identically-sized machines that the cluster autoscaler manages. Each node group has:
- A role —
controlfor cluster management nodes,workerfor compute nodes. - An instance type or profile — Determines the compute resources per node.
- Min/max size — Autoscaling bounds.
- Labels and taints — Used for workload scheduling (e.g., GPU-only nodes, spot instances).
For a full guide, see Node groups.
Monitoring
The Monitoring tab embeds a Grafana dashboard scoped to your cluster. Use it to:
- Track CPU, memory, and GPU utilization across all nodes.
- Monitor cluster autoscaler activity and node group scaling events.
- View accumulated spend and current burn rate.
- Identify underutilized nodes for cost optimization.
Grafana is automatically deployed when the cluster is created (enabled by default in integrations).
Adding node groups post-creation
- Open the cluster detail page and go to the Node Groups tab.
- Click Add Node Group.
- Configure the group: name, role, instance type (or profile for non-AWS), min/max size, and optional labels/taints.
- Click Save. The autoscaler picks up the new group and provisions nodes as needed.
Enabling or disabling integrations
- Open the cluster detail page and go to the Integrations tab.
- Toggle integrations on or off.
- Changes take effect after vdeployer redeploys the affected components.
Deleting a cluster
- Open the cluster detail page.
- Click the overflow menu (three dots) in the top-right corner.
- Select Delete Cluster and confirm.
Vantage tears down all cloud resources — EC2 instances (or Cudo VMs), VPC, subnets, IAM roles, security groups, Launch Templates — along with database records and Keycloak clients. Deletion is irreversible.
Deleting a parent K8s cluster also deletes any Slurm-on-Kubernetes clusters deployed on it.