Use Workbench observability

The Observability page gives you cluster-wide rollups of utilization, spend, idle GPU hours, and live alerts. It is the first page to check when reviewing resource usage or diagnosing cost spikes.

Monitor GPU utilization, spend, and active alerts across your Workbench.

TimeAbout 2 minutes

You will needA Kubernetes cluster with Prometheus and Grafana enabled

OutcomeMetrics visible in the observability dashboard

Click Workbench in the left sidebar, then click Observability near the top of the Workbench sidebar.

Metrics and logs are visible in the observability dashboard, and you can monitor session and job performance.

KPI tiles

The top of the page shows six tiles:

Tile	What it shows
Active sessions	Count of running sessions, with delta versus the previous period
Active endpoints	Count of running endpoints, with warm endpoints called out
Running training jobs	Count plus total GPU-hours consumed today
Spend today	Current burn rate compared to the daily ceiling
Spend MTD	Month-to-date spend with an end-of-month forecast
Idle GPU hours	GPU hours where utilization was below 5%; the metric to drive down

If Idle GPU hours exceeds 20% of Spend today, sessions are likely being left open without active use.

Charts

Three Grafana panels provide detailed breakdowns:

Cluster GPU utilization -- percentage allocated and active by GPU SKU over time
Spend breakdown -- stacked area chart by resource type (sessions, endpoints, training jobs, pipelines, sweeps)
Active alerts -- scheduling failures, image pull errors, OOM kills, and idle warnings

Change the time range

Use the time range dropdown to switch between 1 hour, 24 hours, 7 days, and 30 days. The KPI deltas re-baseline against the matching prior window.

KPI tiles​

Charts​

Change the time range​

What to do next​

KPI tiles

Charts

Change the time range

What to do next