Deploying an endpoint

Seven-step wizard from model selection to live URL.

Pick a kind — Predictive or LLM.
Pick a model source — from your registry, from HuggingFace, from S3, from a PVC, or a custom container image.
Pick a runtime — Workbench filters runtimes by your kind and whether they need a GPU.
Set sizing — CPU, memory, GPU, replica bounds (min/max), and the autoscaling metric (concurrency / RPS / CPU).
Pick a compute profile — usually a GPU profile for LLMs and a smaller predictive box otherwise.
(LLM only) Configure parallelism — tensor / pipeline / data sharding to fit the model on your GPUs.
Submit. The endpoint goes Pending while images pull and pods schedule, then Running once at least one replica passes its readiness check.