July 02, 2018
I’m pleased to announce that DataFusion 0.3.0 has now been released. There have been some changes in the project goals since the last release so I’d like to take this opportunity to re-introduce DataFusion and talk about the revised roadmap.
What is DataFusion (today)?
DataFusion is a SQL query engine implemented in Rust that uses Apache Arrow for its memory model.
What data sources are supported?
DataFusion is designed to be data source agnostic, making it easy to integrate your own data sources. In this respect, DataFusion is inspired by Apache Calcite in the Java world, although the design of the query planner owes more to Apache Spark.
DataFusion currently supports the following data sources:
- Apache Parquet
- Newline-delimited JSON (ndjson)
How can I use DataFusion?
You can add DataFusion as a dependency in your Rust project and make use of the SQL Parser, Query Planner, Query Optimizer, and Query Execution components by following the DataFusion Crate Documentation. Examples can be found here.
[dependencies] datafusion-rs = “0.3.0”
Alternatively, you can use the a Docker image to run a command-line SQL console to run queries against local data sources. For more details including sample queries see the Getting Started with Docker guide.
What were the goals for the 0.3.0 release?
The main goal of this release was to provide something that could be used to run some subset of real world queries (such as simple aggregate queries) with a focus on improving the quality of the query planning and execution by adding things like automatic type coercion. Code coverage has also been increased to an acceptable level for a 0.3.0 release (currently around 80%).
What is planned for the future of DataFusion?
The roadmap is pretty fluid at this point in the project and is easily influenced if people want to contribute in a particular area, but my current thoughts for the next few major releases are:
0.4.0: Add JOIN support (probably starting with simple equi joins) and ORDER BY. This would increase the range of queries that can be supported.
0.5.0: Introduce partitioning and parallel query execution (still single process but using multiple threads).
0.6.0: Make this distributed so that queries can be run on a cluster. This will depend on the Arrow Rust implementation supporting IPC, which in turn depends on the Rust version of Google Flatbuffers being ready, so I am hoping things fall into place by the time I get to work on this.