Ballista Reboot

April 18, 2020

In July 2019, I created Ballista as a small proof-of-concept of parallel query execution across a number of Rust executors running in Kubernetes. This PoC generated a good discussion on Hacker News, which I felt demonstrated that there is interest in a platform like this. Unfortunately, it was far from usable for anything real and lacked a well-designed architecture.

Over the past few months, I have had the opportunity to discuss this project with some really smart people in the industry and this has inspired me to reboot the project with a slightly different focus.

Ballista still aims to be a modern distributed compute platform with full support for Rust as a first-class citizen but the emphasis now is on an architecture that is not tied to a single programming language. The reality is that many developers have existing code, tools, and language preferences for the particular type of workloads they need to run and it shouldn’t be necessary to use a single language end-to-end. I am now of the opinion that designing a distributed compute platform in [insert favorite programming language] is a mistake, even if the language is Rust!

Supporting multiple languages from the start helps to enforce an architecture that is language-agnostic.

The foundational technologies in Ballista are:

Apache Arrow Flight protocol for efficient data transfer between processes.
Google Protocol Buffers for serializing query plans.
Docker for packaging up executors along with user-defined code.
Kubernetes for deployment and management of the executor docker containers.

Ballista supports a number of query engines that can be deployed as executors that can accept a query plan and execute it:

Ballista JVM Executor: This is a Kotlin-native query engine built on Apache Arrow, and is based on the design in the book How Query Engines Work.
Ballista Rust Executor: This is a Rust-native query engine built on Apache Arrow and DataFusion.
Apache Spark Executor for Ballista: This is a wrapper around Apache Spark and allows Ballista queries to be executed by Apache Spark.

Ballista provides DataFrame APIs for Rust and Kotlin. The Kotlin DataFrame API can be used from Java, Kotlin, and Scala because Kotlin is 100% compatible with Java.

The goal is for any client to be able to use any query engine.

I am currently actively working on two examples that demonstrate the capabilities of this architecture.

Rust Parallel Query Execution

This example is based on the original PoC and demonstrates running a simple HashAggregate query in parallel across a number of Rust executors that are deployed to a Kubernetes cluster. Because Ballista does not yet have any distributed query planning or scheduling capabilities, this example sends the query to each executor in parallel and then combines the results locally before running a secondary aggregate query to combine the results.

The full source code for this example can be found here.

Rust bindings for Apache Spark

This example demonstrates a Rust client performing a query against an Apache Spark executor. The Spark executor is essentially just a Spark driver that is hosting a Flight server that accepts queries and then executes them against the Spark context. It is conceptually similar to the Spark SQL Thrift Server.

The full source code for this example can be found here.

Future Work

I am planning on working on the following areas over the coming months, roughly in this order.

Expand the Rust DataFrame API and the Spark Executor so that more of Spark can be used from Rust
Add support for executing Rust code in Apache Spark
Increase the functionality supported by the Ballista Rust query engine (and DataFusion)
Increase the functionality supported by the Ballista JVM query engine
Implement a distributed query planner and scheduler

As I work through these items, I will continue to expand the content in my book How Query Engines Work, which is an introductory guide to the topic. The book explains concepts such as logical query plans versus physical query plans, as well as the query planning, optimization, and execution phases of a query engine. The book is published on the leanpub platform, which provides free updates as more content is available.