Troubleshooting
Cluster stays in "preparing" status
Cloud provisioning is asynchronous. If a cluster remains in preparing for longer than expected:
- AWS clusters — Check the CloudFormation console in your AWS account for the stack named after your cluster. Look for failed events.
- Cudo Compute clusters — Verify the data center has capacity for your requested machine type. Some data centers have limited GPU availability.
- All cloud clusters — Check that the cloud account credentials (IAM role ARN, API key) are still valid. An invalid cloud account prevents provisioning from completing.
If provisioning times out, the cluster transitions to failed. Inspect creation_status_details on the cluster detail page for the specific error.
Cluster creation fails immediately
If the mutation returns an error instead of entering preparing:
| Error | Likely cause |
|---|---|
ClusterNameInUse | A cluster with that name already exists in your organization. |
InvalidInput | Missing required field or invalid combination (e.g., SSH key name not provided for AWS). |
SubscriptionLimitReached | Your tier's cluster limit has been reached. Contact your admin to upgrade. |
Cannot select an SSH key
If the SSH key name dropdown is empty:
- Ensure you have selected a cloud account and region first — the list loads based on those selections.
- Create an SSH key pair in the target AWS region through the EC2 console. Vantage only needs the key pair name, not the private key.
Node group or partition shows zero nodes
- Cloud clusters — Verify the minimum size is set to at least 1 if you expect nodes to always be present. Autoscaling scales to zero when
min_size = 0. - Slurm on Kubernetes — Ensure the parent K8s cluster has sufficient capacity and the autoscaler is enabled. The AWS autoscaler uses EC2 Fleet to provision instances.
- On-premises clusters — Nodes must be registered manually on your infrastructure. Run the agent installation command on each node.
Cluster not appearing in the list
- Verify you're in the correct workspace (check the workspace picker in the top-right corner).
- Check the Slurm and Kubernetes tabs — clusters only appear under their type's tab.
- On-premises clusters require the agent to establish an outbound connection to Vantage. Verify port 443 outbound is not blocked by a firewall.
vdeployer or integration deployment fails
Integrations (JupyterHub, Grafana, Ray, MLflow) are deployed by vdeployer-web after the cluster reaches ready. If integrations fail:
- Confirm the parent cluster's HTTPS tunnel is active.
- Check the cluster detail page for error messages in the integration status section.
- Retry from the cluster detail page's integrations tab.
Still stuck?
Contact Vantage support with your cluster name and organization ID. Include any error messages from the cluster detail page.