Endpoints

Authenticated, autoscaling inference services.

Endpoints serve a registered model behind an HTTP URL. They autoscale on traffic, support canary rollouts for safe updates, and can scale to zero when idle.

Predictive vs LLM

Workbench distinguishes two endpoint kinds, because their tuning surfaces are different. Predictive endpoints are optimized for classical models with fast, single-input inference. LLM endpoints are built for generative models where batching, context length, and parallelism matter.

Endpoint concepts

Inferences: the endpoint services themselves, with URLs, scaling policy, and deployment status.
Runtimes: pre-built serving environments (framework + image + strategy) that endpoints reference.
Pods: the running containers behind your endpoints, managed by the autoscaler.
Presets: reusable endpoint configurations that pre-fill the deployment form.
Secrets: sensitive data (API keys, tokens, credentials) mounted into endpoint pods.

Next steps

Predictive vs LLM: which kind to use
Deploying an endpoint: step-by-step deployment
Autoscaling: how pod count adjusts to traffic
Canary rollouts: safe model updates
Runtimes: serving environments
Pods: running inference containers
Presets: pre-configured endpoint defaults
Secrets: managing sensitive data
Reference: every field, every default

Predictive vs LLM​

Endpoint concepts​

Next steps​

Predictive vs LLM

Endpoint concepts

Next steps