I wanted to share some updates on Ballista for the small group of you that have been following my progress on this project. If you are reading this blog post and are not familiar with the project, Ballista is an attempt at building a modern distributed compute platform in Rust, using Apache Arrow as the memory model.
I would say that Ballista progress to date can best be described as a series of interesting proof-of-concepts that demonstrate many possibilities but have so far failed to deliver anything of real value, other than showcasing the power of Rust and Apache Arrow for distributed compute and also helping drive requirements for DataFusion (Arrow’s Rust-native in-memory query engine, which Ballista depends on).
Ballista is an ambitious project, especially because I work on it in my spare time with occasional periods of intense activity separated by long periods of apparent inactivity (although I am often using this time to contribute features to the next version of DataFusion). The sporadic nature of my contributions has made it challenging to build a community around this project. With little time available to work on the project, I have tended to prioritize coding over communication or collaboration.
However, this weekend I decided to start the Rust implementation pretty much from scratch now that the DataFusion 3.0.0 release is imminent, so this is a great time to reboot my efforts at building a community, and I have decided this time to start blogging frequently as I work on the project. The goal of these blog posts will be to explain what I am working on, what challenges I am facing or have resolved, and what comes next.
At the very least, this might help people follow along with my thought process and have the opportunity to give me feedback along the way.
So let’s dive in and talk about what I worked on this weekend and what the current state of the project is and what the immediate goals are.
The new release of DataFusion has a number of improvements that make it easier to extend. This means that Ballista can now be a much simpler project and no longer needs to duplicate the DataFrame, Logical Plan, and Physical Plan from DataFusion.
Ballista will extend DataFusion to support distributed query execution of DataFusion queries by providing the following components:
- Serde code to support serialization and deserialization of logical and physical query plans in protocol buffer format (so that full or partial query plans can be sent between processes).
- Executor process implementing the Flight protocol that can receive a query plan and execute it.
- Shuffle write operator that can store the partitioned output of a query in the executor’s memory, or persist to the file system.
- Shuffle read operator than can read shuffle partitions from other executors in the cluster.
- Distributed query planner / scheduler that will start with a DataFusion physical plan and insert shuffle read and write operators as necessary and then execute the query stages.
- Kubernetes support so that clusters can easily be created.
The current status is that I renamed the
rust/ballista directory to
rust/ballista-old and have started the process
of porting code over one piece at a time to get it working with the latest version of Apache Arrow and DataFusion.
So far, there is reasonable progress on the serde code for serializing query plans. Rather than finish implementing all of this right away, I would rather start getting some end-to-end functionality working again, such as the ability to start an executor process and submit a query plan for execution. This means that there are opportunities for others to help finish the serde module so that it supports all the operators and expressions provided by DataFusion. I have made an effort to make it obvious in the code where contributions are needed.
I’m out of time for this weekend. I’ll try to keep a regular cadence going on these blog posts and post updates at least monthly. I don’t have comments enabled on this blog, so the best way to discuss getting involved would be to join the Discord or Gitter IM chats.
Want to learn more about query engines? Check out my book "How Query Engines Work".