Article: Two Misconfigurations That Caused Spark OOM Failures on Kubernetes
Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.
Real-world Spark on K8s misconfigurations, highly actionable for data and platform engineers.
Two infrastructure misconfigurations during a Spark-on-Kubernetes lift-and-shift migration to AKS triggered executor OOM failures solely during shuffle stages: RAM-backed scratch directories via spark.kubernetes.local.dirs.tmpfs=true and a hard podAffinity rule co-locating all executors on a single node, together consuming node memory rather than disk for shuffle spill. The 1Gi tmpfs-backed scratch volume proved insufficient, and the compound effect only surfaced under production load, mimicking a Spark memory tuning issue. Fixes included disabling tmpfs, increasing scratch volumes to 10Gi disk-backed, and switching to preferred podAntiAffinity.