Deploying an endpoint
Seven-step wizard from model selection to live URL.
- Pick a kind — Predictive or LLM.
- Pick a model source — from your registry, from HuggingFace, from S3, from a PVC, or a custom container image.
- Pick a runtime — Workbench filters runtimes by your kind and whether they need a GPU.
- Set sizing — CPU, memory, GPU, replica bounds (min/max), and the autoscaling metric (concurrency / RPS / CPU).
- Pick a compute profile — usually a GPU profile for LLMs and a smaller predictive box otherwise.
- (LLM only) Configure parallelism — tensor / pipeline / data sharding to fit the model on your GPUs.
- Submit. The endpoint goes Pending while images pull and pods schedule, then Running once at least one replica passes its readiness check.