Federation concepts
Federations answer a simple question: what if you could submit a job without deciding which cluster should run it?
A federation groups multiple Vantage clusters into a single logical compute pool. Instead of choosing a cluster at submission time, you choose the federation, and Vantage routes the job to a member cluster that has the capacity to run it. The individual clusters keep their own schedulers, partitions, compute pools, and configuration — the federation is a layer above them, not a replacement for them.
Why federations exist
Without federations, every job submission requires an explicit cluster choice. That works when you have one or two clusters, but it breaks down as infrastructure grows:
- Capacity guesswork. Users must know which cluster has free resources before submitting. If they guess wrong, the job waits in a queue while another cluster sits idle.
- Manual failover. If a cluster goes down or hits a quota limit, someone has to resubmit the job somewhere else.
- Rigid boundaries. On-premises hardware and cloud clusters live in separate worlds. There is no built-in way to overflow from one to the other.
Federations remove these burdens by introducing an abstraction layer between the user and the underlying clusters. The user picks a federation; Vantage picks the cluster.
Mental model: the virtual cluster
Think of a federation as a virtual cluster that owns no compute resources itself. It is a dispatch layer that forwards each job to one of its member clusters.
This model has a few important properties:
- Clusters remain independent. Each cluster keeps its own scheduler type, partitions or compute pools, autoscaling rules, and provider configuration. Joining a federation does not change how a cluster operates internally.
- Membership is additive. A cluster can be added to or removed from a federation at any time through the CLI. Existing jobs on that cluster are not affected.
- A cluster belongs to at most one federation. This avoids conflicting routing decisions across overlapping federations.
How routing works
When a job targets a federation, Vantage evaluates the member clusters and selects one based on capacity. The routing process considers:
| Factor | What Vantage checks |
|---|---|
| Cluster status | Only clusters in the Ready state are eligible. Clusters that are Preparing, Failed, or Deleting are skipped. |
| Available capacity | Vantage looks at the cluster's current resource utilization — free nodes, available GPUs, and queue depth — to determine whether the job can start promptly. |
| Partition and pool availability | The job's resource requirements (CPUs, memory, GPUs, time limit) must match at least one partition or compute pool on the candidate cluster. |
If multiple clusters can accept the job, Vantage selects the one with the most available capacity. If no cluster can accept the job immediately, it enters the federation queue and is dispatched as soon as a member cluster has room.
After routing, the submission detail page in the Vantage UI shows which cluster the job was assigned to. From that point on, the job behaves exactly as if it had been submitted directly to that cluster.
Routing is a one-time decision made at dispatch. A job does not migrate between clusters after it has been assigned.
Mixing cluster types
Federations are not limited to a single scheduler or provider. A single federation can include:
- Slurm and Kubernetes clusters running different workload types
- Cloud clusters across AWS, Azure, and GCP in different regions
- On-premises clusters alongside cloud clusters
- Clusters provisioned through different methods (CloudFormation, Terraform, Ansible, Juju, manual connector)
This flexibility is what makes federations useful for hybrid and multi-cloud strategies. The federation does not need its member clusters to be identical — it only needs them to be registered in Vantage and in a Ready state.
However, keep in mind that a Slurm job script is not interchangeable with a Kubernetes workload definition. If a federation contains both Slurm and Kubernetes clusters, routing will only consider clusters whose scheduler type matches the submitted job type.
When to use a federation
Federations add value when any of the following apply:
You have more clusters than your users want to think about. If your organization runs five clusters across three regions, forcing every user to pick the right one for each job creates friction. A federation lets users submit to a single endpoint and trust that the platform will make a reasonable placement decision.
You need cloud bursting. A common pattern is to run day-to-day workloads on a fixed on-premises cluster and overflow to cloud clusters during peak demand. Place the on-premises cluster and one or more cloud clusters in a federation. When the on-premises cluster fills up, new jobs automatically route to the cloud.
You want multi-region redundancy. If a cluster in one region becomes unavailable (provider outage, quota exhaustion, maintenance window), jobs submitted to the federation are routed to healthy clusters in other regions without user intervention.
You are migrating between clusters. When replacing an old cluster with a new one, put both in the same federation. New jobs gradually land on the new cluster as its capacity increases, while the old cluster drains naturally. Once migration is complete, remove the old cluster from the federation.
When not to use a federation
Federations are not always the right choice:
- Single cluster. If you only have one cluster, a federation adds an unnecessary layer of indirection.
- Strict placement requirements. If every job must run on a specific cluster (for data locality, compliance, or licensing reasons), submit directly to that cluster instead of relying on federation routing.
- Workloads that need cross-cluster coordination. Federations route individual jobs; they do not provide cross-cluster MPI, shared filesystems, or inter-node communication between clusters. Each job runs entirely within one member cluster.
Relationship to other Vantage concepts
| Concept | Relationship to federations |
|---|---|
| Clusters | Federations group clusters. Each member cluster retains its own configuration and can still accept direct job submissions. |
| Partitions and compute pools | These exist within individual clusters. Federation routing checks partition and pool availability on each candidate cluster before dispatching. |
| Jobs | A job targets either a specific cluster or a federation. The submission detail page shows the final cluster assignment for federated jobs. |
| Cloud accounts | Clusters in a federation can belong to different cloud accounts and providers. The federation itself is provider-agnostic. |
Limitations and considerations
- CLI-only management. Federations are created, updated, and deleted through the Vantage CLI (
vantage cluster federation create|list|get|update|delete). There is no web UI for federation administration. See the how-to guides and the CLI reference for details. - No job migration. Once a job is routed to a cluster, it stays there. If that cluster fails after dispatch, the job must be resubmitted.
- No shared storage. Federations do not provide a shared filesystem across member clusters. If a job depends on input data, that data must be accessible from whichever cluster the job lands on.
- One federation per cluster. A cluster cannot belong to multiple federations simultaneously.
- Routing is capacity-based, not policy-based. You cannot currently define routing rules like "prefer on-prem" or "use cloud only as overflow." Vantage routes to the cluster with the most available capacity among eligible members.