Knative: Operator’s Handbook

Request-based autoscaling

Knative knows every request that comes to a Pod it deploys, by injecting a sidecar proxy called queue-proxy container. This helps Knative autoscale applications without any intervention or configuration.

Recommended: Watch this talk by Tara Gu to learn more about how request based autoscaling works. It's a good deep dive.

By default, for each Service, Knative uses “requests per second” as the autoscaling metric. This metric is collected nearly real-time, so Knative detects and responds to traffic spikes quickly.

By default, Knative takes into account:

  1. Target concurrency (“in-flight requests” per Pod), which defaults to “100”. 1 This is a soft limit.

  2. Concurrency limit configured on the Service/Revision. This is a hard limit.

When Knative gets more requests than the existing Pods can handle, it will hold onto the requests, and send them to new Pod instances once they are ready.

By default Knative autoscaler will:


  1. This is the container-concurrency-target-default setting. ↩︎

  2. This is the stable-window setting. ↩︎

  3. This is the panic-threshold-percentage setting. ↩︎

  4. This is the max-scale-up-rate setting. ↩︎

  5. This is the tick-interval setting. ↩︎