Skip to main content

Endpoints

Authenticated, autoscaling inference services.

Endpoints

Authenticated, autoscaling inference services.

Endpoints serve a registered model behind an HTTP URL. They autoscale on traffic, support canary rollouts for safe updates, and can scale to zero when idle.

Predictive vs LLM

Workbench distinguishes two endpoint kinds, because their tuning surfaces are different. Predictive endpoints are optimized for classical models with fast, single-input inference. LLM endpoints are built for generative models where batching, context length, and parallelism matter.

Next steps

⌘I