Skip to main content

Endpoints

Authenticated, autoscaling inference services.

Endpoints

Authenticated, autoscaling inference services.

Endpoints serve a registered model behind an HTTP URL. They autoscale on traffic, support canary rollouts for safe updates, and can scale to zero when idle.

Predictive vs LLM

Workbench distinguishes two endpoint kinds, because their tuning surfaces are different. Predictive endpoints are optimized for classical models with fast, single-input inference. LLM endpoints are built for generative models where batching, context length, and parallelism matter.

Endpoint concepts

  • Inferences — the endpoint services themselves, with URLs, scaling policy, and deployment status.
  • Runtimes — pre-built serving environments (framework + image + strategy) that endpoints reference.
  • Pods — the running containers behind your endpoints, managed by the autoscaler.
  • Presets — reusable endpoint configurations that pre-fill the deployment form.
  • Secrets — sensitive data (API keys, tokens, credentials) mounted into endpoint pods.

Next steps

Ask AI
Ask a question about Vantage Compute...