Skip to main content

Manage a Kubernetes cluster

Status lifecycle, detail page, node groups, and day-to-day management.

Manage a Kubernetes cluster

Status lifecycle

A Kubernetes cluster moves through these statuses:

StatusMeaning
preparingInfrastructure provisioning in progress. Vantage is creating the control plane, installing MicroK8s, and waiting for cloud-init to complete.
readyCluster is connected, vdeployer has deployed the autoscaler and integrations, and the cluster is accepting workloads.
failedProvisioning error. Check the detail page for status details.
deletingCluster teardown in progress. Vantage is deprovisioning cloud resources and removing database records.

Transitions:

  • preparingreadymarkClusterReady called by cloud-init, vdeployer deployed integrations.
  • preparingfailed — AWS API error, IAM role failure, VM creation timeout, or Keycloak error.
  • readydeleting — User initiated deletion.
  • faileddeleting — User initiated deletion.

Detail page tabs

Click a cluster row in the Clusters list to open the detail page:

TabWhat you'll find
OverviewCluster name, status, provider, region, creation time, client ID, and quick actions.
NodesLive node list with instance type, status, CPU/memory utilization, and node group membership.
Node GroupsCompute pool definitions — name, role (control/worker), instance type, min/max size, labels, and taints. Add, edit, or remove node groups.
IntegrationsStatus of installed platform integrations (JupyterHub, Grafana, Ray, MLflow). Enable or disable integrations.
MonitoringPer-cluster Grafana dashboard with live utilization, accumulated cost, and node-level metrics.
ConfigurationCluster metadata and configuration parameters.

Node groups

Node groups are pools of identically-sized machines that the cluster autoscaler manages. Each node group has:

  • A rolecontrol for cluster management nodes, worker for compute nodes.
  • An instance type or profile — Determines the compute resources per node.
  • Min/max size — Autoscaling bounds.
  • Labels and taints — Used for workload scheduling (e.g., GPU-only nodes, spot instances).

For a full guide, see Node groups.

Monitoring

The Monitoring tab embeds a Grafana dashboard scoped to your cluster. Use it to:

  • Track CPU, memory, and GPU utilization across all nodes.
  • Monitor cluster autoscaler activity and node group scaling events.
  • View accumulated spend and current burn rate.
  • Identify underutilized nodes for cost optimization.

Grafana is automatically deployed when the cluster is created (enabled by default in integrations).

Adding node groups post-creation

  1. Open the cluster detail page and go to the Node Groups tab.
  2. Click Add Node Group.
  3. Configure the group: name, role, instance type (or profile for non-AWS), min/max size, and optional labels/taints.
  4. Click Save. The autoscaler picks up the new group and provisions nodes as needed.

Enabling or disabling integrations

  1. Open the cluster detail page and go to the Integrations tab.
  2. Toggle integrations on or off.
  3. Changes take effect after vdeployer redeploys the affected components.

Deleting a cluster

  1. Open the cluster detail page.
  2. Click the overflow menu (three dots) in the top-right corner.
  3. Select Delete Cluster and confirm.

Vantage tears down all cloud resources — EC2 instances (or Cudo VMs), VPC, subnets, IAM roles, security groups, Launch Templates — along with database records and Keycloak clients. Deletion is irreversible.

Deleting a parent K8s cluster also deletes any Slurm-on-Kubernetes clusters deployed on it.

Ask AI
Ask a question about Vantage Compute...