I like to take time off between Christmas and New Year’s Day to recharge and reflect on the past year, as well as to plan my personal and professional priorities for the upcoming year. Over the past week or so, I have been thinking about my involvement in the DataFusion project and how I can more effectively contribute to it in 2023.
The project has gained a tremendous amount of momentum over the past couple of years, with many more contributors getting involved, including some full-time contributors from companies that are building commercial products on top of DataFusion.
I myself was lucky enough to have a chance to work on DataFusion full-time in my day job at NVIDIA for a few months this year, helping the Dask SQL project migrate from Apache Calcite to DataFusion for SQL query planning.
However, my main focus at work is contributing to the Spark RAPIDS plugin, which allows Apache Spark SQL and ETL jobs to run on GPU without code changes, so my involvement with DataFusion is still largely part-time, driven by my desire to continue gaining expertise in building query engines. It’s a fun, if addictive, hobby.
With such limited time to contribute, and given the high level of activity in the project now, it is no longer practical for me to stay on top of code reviews, so I am looking at other ways to contribute to the success of the project.
Here are the areas that I am planning on working on in 2023.
I was slow to recognize the importance of Python to the project. Python is the de-facto language for data science and data engineering, not Scala or Rust. Given that so many people are using data frameworks in Python, such as Pandas, PySpark, Dask, Dask SQL, and Polars, I believe that it is important to improve the quality of DataFusion’s Python bindings so that DataFusion is a viable alternative to some of these frameworks. My hope is that this will attract more contributors to the project.
Raw performance is not the number one priority today for many contributors because DataFusion is used as the base for other query engines that may provide their own query plan optimizations and/or execution plans, but many people will judge DataFusion based on performance for popular benchmarks.
Over the past week, I started working on two new benchmark repositories; SQLBench-H and SQLBench-DS, derived from the respective TPC benchmarks. At the time of this blog post neither are ready for general use, but they should be by the end of January 2023.
I have also purchased a Mac Mini M1 to run these benchmarks at “laptop-scale,” meaning scale factors of 100 GB or less, so that I can automate running these benchmarks daily against DataFusion/Ballista master branches and publish the results online so that we can track progress over time and catch regressions earlier.
I also plan on adding comparisons to other open-source query engines over time. Not just comparing raw numbers, but also making it easier to compare query plans between different engines. To help with this effort I have been experimenting with a simple Query Plan Markup Language to make it easier to produce consistent diagrams from different query engines.
A frequent complaint that I see is that DataFusion’s documentation is lacking. I think this is a fair criticism. Many contributors (myself included) are often more interested in contributing new features rather than writing documentation.
As I spend more time as a user of DataFusion, I plan on contributing more to the user guide based on my experiences.
To summarize, what I want from DataFusion in 2023 is a better first-time experience for end-users and I plan on contributing to areas that I hope will help with that.
Want to learn more about query engines? Check out my book "How Query Engines Work".