Observability
Where individual detail pages show metrics for one resource, the Observability tab rolls up the entire workspace. It's the page you check first thing in the morning.
The KPI strip
Six tiles across the top:
- Active sessions — count, with delta vs yesterday.
- Active endpoints — count, with the warm ones called out separately.
- Running training jobs — count + total GPU-hours consumed today.
- Spend today — burn rate vs your daily ceiling.
- Spend MTD — month-to-date, with a forecast for end-of-month.
- Idle GPU hours — paid-for time where GPU util stayed under 5%. The number you want to drive down.
Charts
Below the KPIs, three Grafana panels:
- Cluster GPU utilization — % allocated and % active, by GPU SKU, over time.
- Spend breakdown — stacked area chart by resource type (sessions / endpoints / training / pipelines / sweeps).
- Active alerts — anything Vantage's alert rules have fired in the last hour: scheduling failures, image pull errors, OOM kills, idle-too-long warnings.
Time range
The dropdown in the upper-right switches the whole page between last 1h / 24h / 7d / 30d. The KPI deltas re-baseline against the prior matching window.
warning
If your Idle GPU hours tile is more than 20% of Spend today, you have a sessions-left-open problem. Sort the Sessions list by elapsed time descending, then ping the owners of anything older than ~24h.
See Reference for every KPI definition and time-range option.