What are important data systems problems, ignored by research? (2024)
Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.
Discussion of unsolved data systems problems; directly relevant to data engineering research interests, high technical depth.
A panel at Dutch-Belgian DataBase Day, featuring Allison Lee (Snowflake), Andy Pavlo (CMU), and Hannes Mühleisen (DuckDB/CWI), flagged that variable-length strings, which comprise ~50% of columns in Redshift, are severely understudied due to benchmark simplicity like TPC-H treating strings as fixed-size objects. The discussion also revealed overlooked areas such as network connection handling, query scheduling, and the debate between single-node vs distributed processing, with few SIGMOD/VLDB papers addressing these practical problems.
- Audit your data stack for string compression and variable-length support, and treat benchmark results with skepticism—real-world workloads differ significantly from TPC-H.
As a solutions architect focused on data infrastructure and platform engineering, these gaps directly impact real-world query performance and storage costs, especially given the dominance of string data in analytical workloads.