Slurm vs. Kubernetes clusters
Vantage provisions and manages both Slurm and Kubernetes clusters. They solve different problems, target different workloads, and expose different abstractions. This page explains the core differences so you can pick the right cluster type for your situation — or combine both through a federation.
Two scheduling philosophies
Slurm and Kubernetes grew out of different communities and optimize for different things.
Slurm is a batch-job scheduler born in the HPC world. You write a shell script with #SBATCH directives that declare resources (CPUs, GPUs, memory, wall time), submit it to a partition, and Slurm queues the job until a matching node is available. The scheduler is partition-centric: each partition is a named queue backed by a pool of nodes with shared properties — instance type, maximum run time, priority class, and access controls. Jobs run directly on the host operating system, sharing a filesystem, and communicate over high-speed interconnects using MPI.
Kubernetes is a container orchestrator that came from the cloud-native world. Workloads are packaged as container images with resource requests (CPU, memory, GPU), and the scheduler bins-packs pods onto nodes in compute pools. Scheduling is declarative: you describe the desired state and Kubernetes reconciles it. Compute pools (the Kubernetes counterpart to Slurm partitions) are groups of identically-sized machines that the cluster autoscaler scales up and down automatically.
The mental model is different. In Slurm, you think in jobs and queues. In Kubernetes, you think in containers and desired state.
Feature comparison
| Aspect | Slurm | Kubernetes |
|---|---|---|
| Job model | Batch scripts with #SBATCH directives | Container images with resource specs |
| Scheduling unit | Job (single or array) | Pod (one or more containers) |
| Resource grouping | Partitions — named queues with priority, time limits, access controls | Compute pools — auto-scaled groups of identically-sized machines |
| Scaling | Fixed node count per partition (cloud autoscaler adjusts within bounds) | Compute pools scale from zero to max based on pending pod demand |
| Storage | Shared filesystems (NFS) mounted across all nodes | Persistent Volume Claims (PVCs) and CSI drivers; NFS available as an integration |
| Networking | MPI over InfiniBand or high-speed cloud interconnects | CNI plugins (Calico, Flannel); service mesh optional |
| GPU scheduling | Partition-level — submit to a GPU partition | Node selectors, labels, and taints target GPU compute pools |
| Workbench support | Not available | Full support — sessions, notebooks, training jobs, sweeps, model serving, pipelines |
| Execution environment | Host OS (bare metal or VM) | Container (OCI image) |
| Multi-tenancy | Partition access controls, Slurm accounts | Kubernetes namespaces, RBAC, resource quotas |
When to choose Slurm
Slurm is the stronger choice when the workload or the team leans toward traditional HPC:
- MPI-heavy simulations — Slurm's native MPI integration and tight coupling with high-speed interconnects (InfiniBand, EFA on AWS) make it the natural fit for tightly-coupled parallel codes. Running MPI across Kubernetes pods is possible but adds complexity.
- Existing HPC codebases — If your team already has batch scripts, Slurm job arrays, and queue-management workflows, a Slurm cluster lets you bring them into Vantage without rewriting anything.
- Partition-based priority and preemption — Slurm's partition model gives fine-grained control over job priority classes, maximum wall times, and user-level access per queue. This matters for shared clusters with competing workload classes.
- Bare-metal performance — Jobs run directly on the host OS, avoiding the small overhead of container runtimes and network overlays.
When to choose Kubernetes
Kubernetes is the stronger choice when the workload is containerized, ML-focused, or needs Vantage's higher-level products:
- Workbench — Sessions (JupyterLab, VS Code, RStudio), training jobs, hyperparameter sweeps, model serving endpoints, and pipelines all require a Kubernetes cluster. If you need any of these, Kubernetes is the only option.
- Containerized applications — If your workloads are already packaged as Docker/OCI images with defined entrypoints, Kubernetes runs them natively. No wrapper scripts or environment-module gymnastics.
- Auto-scaling to zero — Kubernetes compute pools can scale down to zero nodes when idle and scale back up when work arrives. This keeps costs low for bursty or intermittent workloads. Slurm clusters on the cloud also autoscale, but the model is partition-bound rather than pod-demand-driven.
- ML/AI pipelines — Training, evaluation, and deployment stages map naturally to Kubernetes-native constructs. Vantage layers runtimes, sweeps, and serving on top.
- Integrations — JupyterHub, Grafana, Ray, and MLflow deploy as Kubernetes integrations on the cluster.
Decision flowchart
Combining both with federations
You do not have to choose one cluster type for your entire organization. A federation links multiple clusters into a single logical compute pool. Submit a job to the federation, and Vantage routes it to whichever cluster has capacity.
This means you can run a Slurm cluster for MPI workloads alongside a Kubernetes cluster for ML training and model serving, and manage them under one roof. Teams that straddle both worlds — running simulations on Slurm and post-processing or serving on Kubernetes — benefit from this hybrid approach without maintaining separate toolchains.
Slurm on Kubernetes: a hybrid option
Vantage also supports deploying a Slurm scheduler on top of an existing Kubernetes cluster. This gives you HPC-style batch scheduling (partitions, job arrays, priority queues) while inheriting Kubernetes auto-scaling and cloud-native infrastructure management. It is currently available on AWS Kubernetes clusters only. See Slurm on Kubernetes for details.
Cross-references
- Compute and clusters — Cluster types, compute profiles, and providers
- Clusters overview — Supported providers and getting-started guides
- Slurm clusters — Creating and managing Slurm clusters
- Kubernetes clusters — Creating and managing Kubernetes clusters
- Partitions — How Slurm partitions work
- Compute pools — How Kubernetes compute pools work
- Federations — Linking clusters into a single compute pool