GPU autoscaling on Kubernetes with KEDA: Building an external scaler

9.4 relevance

GPU autoscaling with KEDA on K8s is a deep technical tutorial directly applicable to AI infrastructure.

2026-05-27 AI/ML cncf.io

GPU autoscaling on Kubernetes with KEDA: Building an external scaler

Summary

KEDA cannot natively scale on GPU metrics because it is compiled with CGO_ENABLED=0, making NVML inaccessible. A custom external scaler deployed as a DaemonSet on each GPU node reads local hardware metrics via go-nvml and exposes them over gRPC, enabling KEDA to trigger HPA decisions based on GPU utilization, memory, temperature, or power draw. Pre-built profiles cover common workloads: vLLM inference scales on memory usage with scale-to-zero, Triton on utilization, and training jobs on utilization without scale-down.

Key Takeaways

Deploy the keda-gpu-scaler DaemonSet via Helm and configure ScaledObjects with workload-specific profiles to autoscale GPU deployments on real GPU metrics, including scale-to-zero for idle vLLM instances.

Why it matters

For engineers running LLM serving or agentic inference on Kubernetes, this approach directly reduces GPU waste, inference latency, and energy costs by autoscaling on accelerator-level signals rather than CPU/memory proxies.

Author

epower