Rust is for Big Data (#rust2018)

January 28, 2018

This blog post isn’t so much about what I want from the Rust language in 2018, but more about where I see an opportunity for Rust to gain more widespread use in 2018.

I’ve been following the Rust language for a couple of years now after a co-worker introduced me to it and mentored me in getting a simple project up and running. I was also lucky enough to attend the very first RustConf in 2016 where there was a lot of talk about the opportunites for Rust to have a big impact on the server due to it’s inherent security (no more buffer overflow attacks) and it’s performance and scalability with the recently released futures and tokio crates.

Fast-forward to 2018 and I think I have identified an area where Rust is uniquely suited and can make a big impact, and that is in the world of distributed data processing.

In my day job, I spend a lot of time building distributed data processing jobs with Apache Spark. It’s a powerful platform and it gets the job done but it could be so much better.

Apache Spark started out as a fairly simple project but suffered from some predictable performance and scalability challenges due to the use of Java serialization to transfer data between nodes and the overhead of garbage collection. Over the years some brilliant engineering has gone into Spark to address these issues. In particular, Project Tungsten made huge strides by storing data off-heap in a binary format rather than using Java objects (thus reducing the garbage collection and serialization overheads). Also, bytecode generation was employed to make job execution more efficient since it had been identified that CPU was the main bottleneck. In other words, all of this brilliant engineering was about making a JVM product make less use of the JVM.

What if Apache Spark had been implemented in Rust?

I have a hypothesis that had Apache Spark been implemented in Rust from the beginning then the performance of even a naive implementation would have been pretty good to start with, but more importantly, would have been more predictable and reliable. No more tweaking job parameters to avoid the dreaded OutOfMemory exception.

Apache Spark has become the de-facto standard for distributed data processing but I would love to see what is possible with Rust if we (the Rust community) can come up with something even better for the future.

I have started an open source project DataFusion to explore this. The project is in a very early stage of development but there are trivial working examples using a DataFrame API and a SQL API.

I’m going to continue working on this in my spare time throughout 2018, primarily as a way to become a better Rust developer but I also think this could evolve into something very useful over time.

If you are interested in Rust and Big Data, please take a look at this project and consider getting involved!

Updates: May 2024

A lot has happened since I first published this blog post!

  • Feb 2019: DataFusion was donated to the Apache Arrow project
  • Jul 2019: I started work on Ballista for distributed query execution
  • Feb 2020: I self-published a short book “How Query Engines Work” that explains the design of DataFusion and Ballista
  • Apr 2021: Ballista is donated to the Apache Arrow project
  • Mar 2024: Apple donates DataFusion Comet which accelerates Apache Spark using DataFusion
  • Apr 2024: I joined Apple to work on Comet
  • May 2024: DataFusion became a top-level project

Want to learn more about query engines? Check out my book "How Query Engines Work".