Runtimes

An inference runtime is a pre-built serving environment: framework + base image + serving strategy. Endpoints reference a runtime to determine how the model is loaded, how requests are handled, and how resources are allocated.

Cluster vs. workspace runtimes

Scope	Who manages it	Where it's available
Cluster runtime	Platform admin	Every workspace on the cluster. Read-only in the UI.
Workspace runtime	Workspace admin or user	Only in the workspace where it was created. Full CRUD.

The runtimes list

The list view shows every runtime available to your workspace. Each row displays:

Name: the runtime's identifier.
Framework: serving framework (e.g. Triton, vLLM, KServe).
Scope: Cluster or Workspace.
Description: what this runtime is for.

Click a runtime to see its full specification, ML policy, and which endpoints are using it.

Creating a workspace runtime

Click Create runtime to define a custom serving environment. You'll specify:

Name: a unique identifier for the runtime.
Framework: the serving framework.
Base image: the container image to use.
ML policy: parallelism, resource requirements, and serving configuration.
Compute pool: optional target pool.

Next steps

Deploying an endpoint: how runtimes are used
Presets: pre-configured endpoint defaults

Cluster vs. workspace runtimes​

The runtimes list​

Creating a workspace runtime​

Next steps​

Cluster vs. workspace runtimes

The runtimes list

Creating a workspace runtime

Next steps