There are a few interesting things going on with DataFusion that I wanted to share.
Support for Apache Parquet
Thanks to the great work happening with the parquet-rs crate, I have been able to add preliminary Parquet support to DataFusion. It is now possible to open local Parquet files as DataFrames and run SQL against them. There are some examples in the repo.
Currently, support is limited to flat parquet files (no nested types) and also limited to a subset of data types (INT32, INT64, FLOAT32, FLOAT64, and UTF8) but with that in place it should be easy to add other types and this seems like a good place for others to contribute.
Also, Parquet support is read-only today.
Introducing Quiver, a native file format for Apache Arrow
In order to get to a distributed version of DataFusion, it is necessary to be able to stream data between nodes and persist data to disk temporarily as needed, so we need a very efficient file format for doing this. I have started work on Quiver, which is a simple columnar file format that will be highly optimized for storing Arrow arrays.
As usual, I have more enthusiam than time, so the current Quiver implementation is very basic so far, but it is possible now to read and write Arrow datasets through the new
QuiverWriter structs. I have not yet integrated this with the DataFusion context so it is not possible to run SQL against these formats just yet.
Once Quiver is a little further along and I have a documented specification for it, I intend to publish this as a crate, separately from DataFusion.
[EDIT: Since posting this, I learned that Arrow IPC has a file format specified already using Flatbuffers. I will use Quiver for now but once the Rust implementation of Flatbuffers is available I will build a Rust implementation of Arrow IPC and use that instead.]
I’m realizing that for this project to succeed I have to do a much better job as a project manager and evangelist and put together a detailed roadmap with well written GitHub issues that contributors can pick up. I am now going to be focussing more on that.
There is now a detailed roadmap in GitHub and I will be adding more issues over the next few days, but here is a high level summary of what I am thinking:
The goal for 0.3.0 is to be able to run some useful data processing jobs on a single node.
Specifically, I woud like to see support for simple aggregates (MIN, MAX, SUM, AVG) and GROUP BY with an in-memory accumulator so that it is possible to run aggregate queries on large datasets as long as the aggregated data can fit in memory.
This would make DataFusion useful for a small subset of real world problems.
I also want to focus on quality in 0.3.0 and get code coverage in place and improve quality of unit tests.
The goal for 0.4.0 is to be able to run distributed queries on a cluster of worker nodes. Here are some of the things that need to be built:
- Implement data partitioning
- Implement data streaming between nodes (using Quiver)
- Implement basic distributed query planning and execution
- Implement mechanism for distributing UDFs (probably will require static compilation initially)
- Docker container for running worker nodes
- Web UI for worker nodes (using React)
With simple distributed queries in place, the goal for 0.5.0 is to add support for a wider range of SQL capabilities so that DataFusion can be used for a larger subset of real-world problems.
- Implement wider range of SQL capabilities (ORDER BY, JOIN, UNION)
- Implement basic query optimizations (e.g. push predicate through join)
I gave a short talk on Rust, Big Data, and DataFusion at the Boulder/Denver Rust Meetup a few days ago. Here are the slides:
I was also excited to see that someone is giving a talk on DataFusion at an upcoming meetup in Poland!.