DataFusion update 2/11/18

February 11, 2018

Following on from my blog post Rust is for Big Data, I announced my open source distributed data processing project, DataFusion, on reddit last week. I was probably a bit premature in announcing the project since it was at such an early stage but I was excited that I had some simple queries working with quite decent performance (roughly 2x the performance of Apache Spark) and wanted to start generating some interest in the project.

That reddit post led to some fantastic feedback which I have been reflecting on over the past week. Since then the project has gained more than 140 followers on github and I have approved five pull requests from two new contributors. It’s great to see that I’m not alone in thinking that Rust could be a good fit for Big Data and distributed computing thanks to its advantages over JVM-based platforms such as Apache Spark.

I have limited time to work on the project but this weekend I made some good progress and it is now possible to run a standalone worker node and execute queries from a command-line SQL console. For more detail please see this new guide.

It is also possible now to use ORDER BY expressions in SQL statements and there is a corresponding sort() method on the DataFrame trait. The sorts are in-memory only so far.

These are just more baby steps and there is a long way to go until this becomes a useful platform. The next steps are adding support for data partitioning and the ability to run multiple worker nodes in a cluster.

If you are interested in contributing to this project, please contact me.

Want to learn more about query engines? Check out my book "How Query Engines Work".

Andy Grove