Use Workbench observability
The Observability page gives you cluster-wide rollups of utilization, spend, idle GPU hours, and live alerts. It is the first page to check when reviewing resource usage or diagnosing cost spikes.
Prerequisites
- A Kubernetes cluster in ready status with Prometheus and Grafana enabled (default for Kubernetes clusters)
Open the Observability dashboard
- Click Workbench in the left sidebar.
- Click Observability at the bottom of the Workbench sidebar.
KPI tiles
The top of the page shows six tiles:
| Tile | What it shows |
|---|---|
| Active sessions | Count of running sessions, with delta versus the previous period |
| Active endpoints | Count of running endpoints, with warm endpoints called out |
| Running training jobs | Count plus total GPU-hours consumed today |
| Spend today | Current burn rate compared to the daily ceiling |
| Spend MTD | Month-to-date spend with an end-of-month forecast |
| Idle GPU hours | Hours where GPU utilization was below 5%; the target to drive down |
If Idle GPU hours exceeds 20% of Spend today, sessions are likely being left open without active use.
Charts
Three Grafana panels provide detailed breakdowns:
- Cluster GPU utilization -- percentage allocated and active by GPU SKU over time
- Spend breakdown -- stacked area chart by resource type (sessions, endpoints, training jobs, pipelines, sweeps)
- Active alerts -- scheduling failures, image pull errors, OOM kills, and idle warnings
Change the time range
Use the time range dropdown to switch between 1 hour, 24 hours, 7 days, and 30 days. The KPI deltas re-baseline against the matching prior window.
Related
- Manage sessions to pause idle sessions
- Configure autoscaling for endpoint cost control