GPU autoscaling on Kubernetes with KEDA: Building an external scaler
Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.
GPU autoscaling with KEDA on K8s is a deep technical tutorial directly applicable to AI infrastructure.
KEDA cannot natively scale on GPU metrics because it is compiled with CGO_ENABLED=0, making NVML inaccessible. A custom external scaler deployed as a DaemonSet on each GPU node reads local hardware metrics via go-nvml and exposes them over gRPC, enabling KEDA to trigger HPA decisions based on GPU utilization, memory, temperature, or power draw. Pre-built profiles cover common workloads: vLLM inference scales on memory usage with scale-to-zero, Triton on utilization, and training jobs on utilization without scale-down.
- Deploy the keda-gpu-scaler DaemonSet via Helm and configure ScaledObjects with workload-specific profiles to autoscale GPU deployments on real GPU metrics, including scale-to-zero for idle vLLM instances.
For engineers running LLM serving or agentic inference on Kubernetes, this approach directly reduces GPU waste, inference latency, and energy costs by autoscaling on accelerator-level signals rather than CPU/memory proxies.