Knative knows every request that comes to a Pod it deploys, by injecting a
sidecar proxy called
queue-proxy container. This helps Knative autoscale
applications without any intervention or configuration.
Recommended: Watch this talk by Tara Gu to learn more about how request based autoscaling works. It's a good deep dive.
By default, for each Service, Knative uses “requests per second” as the autoscaling metric. This metric is collected nearly real-time, so Knative detects and responds to traffic spikes quickly.
By default, Knative takes into account:
Concurrency limit configured on the Service/Revision. This is a hard limit.
When Knative gets more requests than the existing Pods can handle, it will hold onto the requests, and send them to new Pod instances once they are ready.
By default Knative autoscaler will:
- Targets for 100 concurrent requests per Pod, unless the Pod overrides it. 1
- The “concurrency target average” is calculated over 60 seconds.2
- Go into a “panic mode” when concurrency exceeds 2x this target. 3 In this mode, quick reaction is needed to handle the spiky traffic.
- Increases Pod count maximum 10x4 every 2 seconds5.