Slurm vs. Kubernetes clusters

Vantage provisions and manages both Slurm and Kubernetes clusters. They solve different problems, target different workloads, and expose different abstractions. This page explains the core differences so you can pick the right cluster type for your situation, or combine both through a federation.

Two scheduling philosophies

Slurm and Kubernetes grew out of different communities and optimize for different things.

Slurm is a batch-job scheduler born in the HPC world. You write a shell script with #SBATCH directives that declare resources (CPUs, GPUs, memory, wall time), submit it to a partition, and Slurm queues the job until a matching node is available. The scheduler is partition-centric: each partition is a named queue backed by a pool of nodes with shared properties such as instance type, maximum run time, priority class, and access controls. Jobs run directly on the host operating system, sharing a filesystem, and communicate over high-speed interconnects using MPI.

Kubernetes is a container orchestrator that came from the cloud-native world. Workloads are packaged as container images with resource requests (CPU, memory, GPU), and the scheduler bins-packs pods onto nodes in compute pools. Scheduling is declarative: you describe the desired state and Kubernetes reconciles it. Compute pools (the Kubernetes counterpart to Slurm partitions) are groups of identically-sized machines that the cluster autoscaler scales up and down automatically.

The mental model is different. In Slurm, you think in jobs and queues. In Kubernetes, you think in containers and desired state.

Feature comparison

Aspect	Slurm	Kubernetes
Job model	Batch scripts with `#SBATCH` directives	Container images with resource specs
Scheduling unit	Job (single or array)	Pod (one or more containers)
Resource grouping	Partitions, named queues with priority, time limits, access controls	Compute pools, auto-scaled groups of identically-sized machines
Scaling	Fixed node count per partition (cloud autoscaler adjusts within bounds)	Compute pools scale from zero to max based on pending pod demand
Storage	Shared filesystems (NFS) mounted across all nodes	Persistent Volume Claims (PVCs) and CSI drivers; NFS available as an integration
Networking	MPI over InfiniBand or high-speed cloud interconnects	CNI plugins (Calico, Flannel); service mesh optional
GPU scheduling	Partition-level, submit to a GPU partition	Node selectors, labels, and taints target GPU compute pools
Workbench support	Not available	Full support, sessions, notebooks, training jobs, sweeps, model serving, pipelines
Execution environment	Host OS (bare metal or VM)	Container (OCI image)
Multi-tenancy	Partition access controls, Slurm accounts	Kubernetes namespaces, RBAC, resource quotas

When to choose Slurm

Slurm is the stronger choice when the workload or the team leans toward traditional HPC:

MPI-heavy simulations: Slurm's native MPI integration and tight coupling with high-speed interconnects (InfiniBand, EFA on AWS) make it the natural fit for tightly-coupled parallel codes. Running MPI across Kubernetes pods is possible but adds complexity.
Existing HPC codebases: If your team already has batch scripts, Slurm job arrays, and queue-management workflows, a Slurm cluster lets you bring them into Vantage without rewriting anything.
Partition-based priority and preemption: Slurm's partition model gives fine-grained control over job priority classes, maximum wall times, and user-level access per queue. This matters for shared clusters with competing workload classes.
Bare-metal performance: Jobs run directly on the host OS, avoiding the small overhead of container runtimes and network overlays.

When to choose Kubernetes

Kubernetes is the stronger choice when the workload is containerized, ML-focused, or needs Vantage's higher-level products:

Workbench: Sessions (JupyterLab, VS Code, RStudio), training jobs, hyperparameter sweeps, model serving endpoints, and pipelines all require a Kubernetes cluster. If you need any of these, Kubernetes is the only option.
Containerized applications: If your workloads are already packaged as Docker/OCI images with defined entrypoints, Kubernetes runs them natively. No wrapper scripts or environment-module gymnastics.
Auto-scaling to zero: Kubernetes compute pools can scale down to zero nodes when idle and scale back up when work arrives. This keeps costs low for bursty or intermittent workloads. Slurm clusters on the cloud also autoscale, but the model is partition-bound rather than pod-demand-driven.
ML/AI pipelines: Training, evaluation, and deployment stages map naturally to Kubernetes-native constructs. Vantage layers runtimes, sweeps, and serving on top.
Integrations: JupyterHub, Grafana, Ray, and MLflow deploy as Kubernetes integrations on the cluster.

Decision flowchart

Combining both with federations

You do not have to choose one cluster type for your entire organization. A federation links multiple clusters into a single logical compute pool. Submit a job to the federation, and Vantage routes it to whichever cluster has capacity.

This means you can run a Slurm cluster for MPI workloads alongside a Kubernetes cluster for ML training and model serving, and manage them under one roof. Teams that straddle both worlds, running simulations on Slurm and post-processing or serving on Kubernetes, benefit from this hybrid approach without maintaining separate toolchains.

Slurm on Kubernetes: a hybrid option

Vantage also supports deploying a Slurm scheduler on top of an existing Kubernetes cluster. This gives you HPC-style batch scheduling (partitions, job arrays, priority queues) while inheriting Kubernetes auto-scaling and cloud-native infrastructure management. It is currently available on AWS Kubernetes clusters only. See Slurm on Kubernetes for details.

Cross-references

Compute and clusters: Cluster types, compute profiles, and providers
Clusters overview: Supported providers and getting-started guides
Slurm clusters: Creating and managing Slurm clusters
Kubernetes clusters: Creating and managing Kubernetes clusters
Partitions: How Slurm partitions work
Compute pools: How Kubernetes compute pools work
Federations: Linking clusters into a single compute pool

Two scheduling philosophies​

Feature comparison​

When to choose Slurm​

When to choose Kubernetes​

Decision flowchart​

Combining both with federations​

Slurm on Kubernetes: a hybrid option​

Cross-references​