What are important data systems problems, ignored by research? (2024)

6.9 relevance

Discussion of unsolved data systems problems; directly relevant to data engineering research interests, high technical depth.

2026-05-29 General databasearchitects.blogspot.com

Summary

A panel at Dutch-Belgian DataBase Day, featuring Allison Lee (Snowflake), Andy Pavlo (CMU), and Hannes Mühleisen (DuckDB/CWI), flagged that variable-length strings, which comprise ~50% of columns in Redshift, are severely understudied due to benchmark simplicity like TPC-H treating strings as fixed-size objects. The discussion also revealed overlooked areas such as network connection handling, query scheduling, and the debate between single-node vs distributed processing, with few SIGMOD/VLDB papers addressing these practical problems.

Key Takeaways

Audit your data stack for string compression and variable-length support, and treat benchmark results with skepticism—real-world workloads differ significantly from TPC-H.

Why it matters

As a solutions architect focused on data infrastructure and platform engineering, these gaps directly impact real-world query performance and storage costs, especially given the dominance of string data in analytical workloads.

Author

Viktor Leis