Skip to main content

Use Workbench observability

Monitor GPU utilization, spend, and active alerts across your Workbench

Use Workbench observability

The Observability page gives you cluster-wide rollups of utilization, spend, idle GPU hours, and live alerts. It is the first page to check when reviewing resource usage or diagnosing cost spikes.

Prerequisites

  • A Kubernetes cluster in ready status with Prometheus and Grafana enabled (default for Kubernetes clusters)

Open the Observability dashboard

  1. Click Workbench in the left sidebar.
  2. Click Observability at the bottom of the Workbench sidebar.

KPI tiles

The top of the page shows six tiles:

TileWhat it shows
Active sessionsCount of running sessions, with delta versus the previous period
Active endpointsCount of running endpoints, with warm endpoints called out
Running training jobsCount plus total GPU-hours consumed today
Spend todayCurrent burn rate compared to the daily ceiling
Spend MTDMonth-to-date spend with an end-of-month forecast
Idle GPU hoursHours where GPU utilization was below 5%; the target to drive down

If Idle GPU hours exceeds 20% of Spend today, sessions are likely being left open without active use.

Charts

Three Grafana panels provide detailed breakdowns:

  1. Cluster GPU utilization -- percentage allocated and active by GPU SKU over time
  2. Spend breakdown -- stacked area chart by resource type (sessions, endpoints, training jobs, pipelines, sweeps)
  3. Active alerts -- scheduling failures, image pull errors, OOM kills, and idle warnings

Change the time range

Use the time range dropdown to switch between 1 hour, 24 hours, 7 days, and 30 days. The KPI deltas re-baseline against the matching prior window.

Ask AI
Ask a question about Vantage Compute...