Manage a Kubernetes cluster

Status lifecycle

A Kubernetes cluster moves through these statuses:

Status	Meaning
`preparing`	Infrastructure provisioning in progress. Vantage is creating the control plane, installing MicroK8s, and waiting for cloud-init to complete.
`ready`	Cluster is connected, vdeployer has deployed the autoscaler and integrations, and the cluster is accepting workloads.
`failed`	Provisioning error. Check the detail page for status details.
`deleting`	Cluster teardown in progress. Vantage is deprovisioning cloud resources and removing database records.

Transitions:

preparing → ready — markClusterReady called by cloud-init, vdeployer deployed integrations.
preparing → failed — AWS API error, IAM role failure, VM creation timeout, or Keycloak error.
ready → deleting — User initiated deletion.
failed → deleting — User initiated deletion.

Detail page tabs

Click a cluster row in the Clusters list to open the detail page:

Tab	What you'll find
Overview	Cluster name, status, provider, region, creation time, client ID, and quick actions.
Nodes	Live node list with instance type, status, CPU/memory utilization, and node group membership.
Node Groups	Compute pool definitions — name, role (control/worker), instance type, min/max size, labels, and taints. Add, edit, or remove node groups.
Integrations	Status of installed platform integrations (JupyterHub, Grafana, Ray, MLflow). Enable or disable integrations.
Monitoring	Per-cluster Grafana dashboard with live utilization, accumulated cost, and node-level metrics.
Configuration	Cluster metadata and configuration parameters.

Node groups

Node groups are pools of identically-sized machines that the cluster autoscaler manages. Each node group has:

A role — control for cluster management nodes, worker for compute nodes.
An instance type or profile — Determines the compute resources per node.
Min/max size — Autoscaling bounds.
Labels and taints — Used for workload scheduling (e.g., GPU-only nodes, spot instances).

For a full guide, see Node groups.

Monitoring

The Monitoring tab embeds a Grafana dashboard scoped to your cluster. Use it to:

Track CPU, memory, and GPU utilization across all nodes.
Monitor cluster autoscaler activity and node group scaling events.
View accumulated spend and current burn rate.
Identify underutilized nodes for cost optimization.

Grafana is automatically deployed when the cluster is created (enabled by default in integrations).

Adding node groups post-creation

Open the cluster detail page and go to the Node Groups tab.
Click Add Node Group.
Configure the group: name, role, instance type (or profile for non-AWS), min/max size, and optional labels/taints.
Click Save. The autoscaler picks up the new group and provisions nodes as needed.

Enabling or disabling integrations

Open the cluster detail page and go to the Integrations tab.
Toggle integrations on or off.
Changes take effect after vdeployer redeploys the affected components.

Deleting a cluster

Open the cluster detail page.
Click the overflow menu (three dots) in the top-right corner.
Select Delete Cluster and confirm.

Vantage tears down all cloud resources — EC2 instances (or Cudo VMs), VPC, subnets, IAM roles, security groups, Launch Templates — along with database records and Keycloak clients. Deletion is irreversible.

Deleting a parent K8s cluster also deletes any Slurm-on-Kubernetes clusters deployed on it.

Status lifecycle​

Detail page tabs​

Node groups​

Monitoring​

Adding node groups post-creation​

Enabling or disabling integrations​

Deleting a cluster​