Autoscaling
Replica bounds, autoscaling metrics, and scale-to-zero.
Endpoints scale between min_replicas and max_replicas based on a metric:
- concurrency — number of in-flight requests per replica (good default for synchronous APIs).
- rps — requests-per-second per replica.
- cpu — CPU saturation as a percentage.
Set min_replicas to 0 and turn on scale to zero for a development endpoint that costs nothing when idle. Cold-start latency goes up; the trade is yours.