Skip to main content

Observability

Cluster-wide rollups of utilization, spend, idle GPU hours, and live alerts.

Observability

Where individual detail pages show metrics for one resource, the Observability tab rolls up the entire workspace. It's the page you check first thing in the morning.

The KPI strip

Six tiles across the top:

  • Active sessions — count, with delta vs yesterday.
  • Active endpoints — count, with the warm ones called out separately.
  • Running training jobs — count + total GPU-hours consumed today.
  • Spend today — burn rate vs your daily ceiling.
  • Spend MTD — month-to-date, with a forecast for end-of-month.
  • Idle GPU hours — paid-for time where GPU util stayed under 5%. The number you want to drive down.

Charts

Below the KPIs, three Grafana panels:

  1. Cluster GPU utilization — % allocated and % active, by GPU SKU, over time.
  2. Spend breakdown — stacked area chart by resource type (sessions / endpoints / training / pipelines / sweeps).
  3. Active alerts — anything Vantage's alert rules have fired in the last hour: scheduling failures, image pull errors, OOM kills, idle-too-long warnings.

Time range

The dropdown in the upper-right switches the whole page between last 1h / 24h / 7d / 30d. The KPI deltas re-baseline against the prior matching window.

warning

If your Idle GPU hours tile is more than 20% of Spend today, you have a sessions-left-open problem. Sort the Sessions list by elapsed time descending, then ping the owners of anything older than ~24h.

See Reference for every KPI definition and time-range option.

⌘I