Node groups
Node groups are the Kubernetes equivalent of Slurm partitions — a pool of identically-sized machines that the cluster autoscaler scales up and down automatically. You define the minimum and maximum size when you create the group; Vantage handles provisioning through the cloud provider.
Each node group is a billing unit — you pay for every provisioned node, even if it isn't running a workload.
Roles
Every node group has a role:
| Role | Purpose |
|---|---|
control | Cluster management — runs the K8s control plane components, etcd, and Vantage system services. Node group name typically control-plane. |
worker | Compute — runs user workloads (Workbench sessions, training jobs, model serving). Node group name is user-defined. |
A cluster must have at least one control plane group and at least one worker group.
Instance selection
How you select compute resources depends on the cloud provider:
AWS: Instance type browser
AWS node groups use native EC2 instance types selected through the Vantage instance type browser:
- Search by family (e.g.,
t3,c5n,p3,g5). - Filter by vCPU count, memory, and GPU count.
- Select from a curated list of supported instance types.
- Multiple instance types can be specified per group — the autoscaler uses EC2 Fleet API for diversification and best availability.
| AWS node group field | Notes |
|---|---|
instance_types | List of EC2 instance types (e.g., ["t3.xlarge"]). Multiple types enable Fleet-based diversification. |
allocation_strategy | EC2 Fleet allocation strategy — lowest-price (default), diversified, or capacity-optimized. |
Non-AWS: Pre-defined profiles
Azure, GCP, and Cudo Compute node groups use pre-defined profiles instead of instance types:
| Profile | vCPU | Memory | Notes |
|---|---|---|---|
| Small | 4 | 8 GiB | Lightweight workers |
| Medium | 8 | 16 GiB | General purpose |
| Large | 16 | 32 GiB | Memory-intensive |
Cudo Compute node groups have additional fields beyond profiles:
| Cudo node group field | Notes |
|---|---|
id | Node group identifier (uses id instead of name). |
vcpus | Number of vCPUs per node (instead of instance type). |
memory_gib | Memory in GiB per node. |
boot_disk_size_gib | Boot disk size in GiB. Configurable per group. |
data_center_id | Cudo data center location (e.g., us-dallas-1). Per-group, not global. |
machine_type | Hardware family (e.g., intel-broadwell, intel-broadwell-v100). |
gpus | Number of GPUs per node (explicit, not inferred from instance type). |
gpu_model | GPU model (e.g., V100, A100). |
Labels and taints
Labels and taints control workload scheduling onto node groups:
- Labels — Key-value pairs attached to every node in the group. Workloads use
nodeSelectoror affinity rules to target specific labels. - Taints — Prevent pods from scheduling onto a node unless they tolerate the taint. Used for GPU-only nodes, spot instances, or dedicated infrastructure.
Auto-injected labels
The autoscaler injects a vc.pool: <node-group-name> label on every node in the group, enabling Slurm-on-Kubernetes pod scheduling affinity.
Common label patterns
| Label | Value | Purpose |
|---|---|---|
node-role.kubernetes.io/control-plane | "" | Control plane nodes |
node-role.kubernetes.io/worker | "" | Worker nodes |
nvidia.com/gpu | "true" | GPU-accelerated nodes |
workload-type | "ml-training" | Training job affinity |
Autoscaling
Every node group has autoscaling bounds:
- Min size — The floor. Set to 0 to allow scale-to-zero when idle.
- Max size — The ceiling. The autoscaler never provisions above this limit.
The autoscaler monitors pending pod capacity and scales up when workloads request more resources. It scales down when nodes are underutilized for a sustained period.
For AWS clusters, the autoscaler manages EC2 Fleet instances tagged with vantage-cluster={client_id}. For Cudo clusters, it uses Cudo-specific scaling profiles.
Best practices
- Separate workload types into different node groups — GPU training jobs and CPU preprocessing should use different groups so they don't compete for resources.
- Set min sizes conservatively — Idle nodes cost money. Start with min=0 and adjust once you understand your workload patterns.
- Use labels and taints for scheduling control — Mark GPU node groups with
nvidia.com/gpu: "true"so only GPU workloads land on them. - Multiple instance types improve availability — On AWS, specifying multiple instance types per group gives EC2 Fleet more options during capacity constraints.