Runtimes
An inference runtime is a pre-built serving environment: framework + base image + serving strategy. Endpoints reference a runtime to determine how the model is loaded, how requests are handled, and how resources are allocated.
Cluster vs. workspace runtimes
| Scope | Who manages it | Where it's available |
|---|---|---|
| Cluster runtime | Platform admin | Every workspace on the cluster. Read-only in the UI. |
| Workspace runtime | Workspace admin or user | Only in the workspace where it was created. Full CRUD. |
The runtimes list
The list view shows every runtime available to your workspace. Each row displays:
- Name — the runtime's identifier.
- Framework — serving framework (e.g. Triton, vLLM, KServe).
- Scope — Cluster or Workspace.
- Description — what this runtime is for.
Click a runtime to see its full specification, ML policy, and which endpoints are using it.
Creating a workspace runtime
Click Create runtime to define a custom serving environment. You'll specify:
- Name — a unique identifier for the runtime.
- Framework — the serving framework.
- Base image — the container image to use.
- ML policy — parallelism, resource requirements, and serving configuration.
- Compute pool — optional target pool.
Next steps
- Deploying an endpoint — how runtimes are used
- Presets — pre-configured endpoint defaults