I have been talking about distributed computing in Rust for a long time now. It is more than two and a half years since my Rust is for Big Data blog post where I first talked about the prototype I was working on at the time (which eventually became DataFusion and is now part of Apache Arrow).

One year ago, over the July 4th weekend, I started again with a new project named “Ballista”. The idea was to build on the foundations provided by Apache Arrow and DataFusion and demonstrate distributed compute. I put together a neat little demo and it got a lot of interest but unfortunately, it was a hot mess of hacked-together code with no coherent architecture and it was impossible to maintain. I also did not understand Rust futures well enough at the time to make it real.

However, six months ago, I decided to pretty much start from scratch with Ballista and I am happy to announce that an alpha release of Ballista 0.3.0 is now available that truly supports distributed queries. I don’t want to oversell the capabilities of the current release. It should be viewed as a proof-of-concept still since it only supports a small number of operators and expressions, but it is sufficient to run something very close to TPCH query 1, as shown in this brief demonstration:

I’m excited about this release because it means anyone can now easily deploy a Ballista cluster into Kubernetes (or run a local test cluster using docker-compose) and try running some queries against their data. The project is also at a point where it is easier to contribute to, in order to add more functionality, such as additional operators and expressions.

The performance of distributed queries is not yet optimized and that will be one of the main areas to be improved before the full 0.3.0 release is made available in August 2020.

If you would like to try Ballista out, please checkout the user guide or the github repository for more information.