Skip to content

GPU autoscaling on Kubernetes with KEDA: Building an external scaler

9.4 relevance
Score Breakdown
technical depth
9
novelty
8
actionability
9
community
8
strategic
6
personal
9

Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.

GPU autoscaling with KEDA on K8s is a deep technical tutorial directly applicable to AI infrastructure.

2026-05-27 AI/ML cncf.io
GPU autoscaling on Kubernetes with KEDA: Building an external scaler
Summary

KEDA cannot natively scale on GPU metrics because it is compiled with CGO_ENABLED=0, making NVML inaccessible. A custom external scaler deployed as a DaemonSet on each GPU node reads local hardware metrics via go-nvml and exposes them over gRPC, enabling KEDA to trigger HPA decisions based on GPU utilization, memory, temperature, or power draw. Pre-built profiles cover common workloads: vLLM inference scales on memory usage with scale-to-zero, Triton on utilization, and training jobs on utilization without scale-down.

Key Takeaways
  • Deploy the keda-gpu-scaler DaemonSet via Helm and configure ScaledObjects with workload-specific profiles to autoscale GPU deployments on real GPU metrics, including scale-to-zero for idle vLLM instances.
Why it matters

For engineers running LLM serving or agentic inference on Kubernetes, this approach directly reduces GPU waste, inference latency, and energy costs by autoscaling on accelerator-level signals rather than CPU/memory proxies.

Author

epower

More from epower →