DataFusion 0.13.0 released as part of Apache Arrow
April 02, 2019
DataFusion is an in-memory query engine for analytical queries, implemented in Rust, that uses Apache Arrow for the memory model.
DataFusion 0.13.0 is now available on crates.io. This is the first release as part of Apache Arrow, which is why the version number has jumped from 0.6.0.
Here is a high level changelog for this release:
It is now possible to run queries against Parquet files (in addition to the existing support for CSV files).
Currently, only primitive types are supported (no lists or structs).
let mut ctx = ExecutionContext::new(); ctx.register_table("aggregate_test_100", Rc::new(ParquetTable::try_new("aggregate_test_100.parquet")?)); let sql = "SELECT c1, MIN(c12), MAX(c12) \ FROM aggregate_test_100 \ WHERE c11 > 0.1 AND c11 < 0.9 \ GROUP BY c1"; let result_set = ctx.sql(&sql, 1024 * 1024)?;
Custom Data Source Support
Now that the ExecutionContext has a generic
register_table method, it is possible to register custom implementations of the TableProvider trait.
Experimental DataFrame-style API
In addition to using SQL to build a logical plan, it is now possible to use a DataFrame-style API. The trait is called
Table rather than
DataFrame and currently only supports simple column projection and a LIMIT clause. I am hoping to get feedback on this idea and expand it further for the next release.
I am particularly interested in adding the ability to call custom code as part of the logical plan, in a way that can be supported in distributed execution.
let t = ctx.table("aggregate_test_100")?; let example = t .select_columns(vec!["c1", "c2", "c11"])? .limit(10)?; let plan = example.to_logical_plan(); ctx.execute(&plan, 1024*1024)?;
This release also sees the start of a real query optimizer and two rules are currently implemented:
- Projection push-down (only load necessary columns from disk)
- Type coercion
Fixing Tech Debt
- Almost all uses of
unimplemented!have now been removed from the codebase and methods have been updated to return
- Rustdocs have been improved
My goal for the next release (approximately two months from now) is to make DataFusion truly usable for some subset of real-world use cases.
The most important feature for me is the ability to run aggregate queries on Parquet files, leveraging multiple cores. This will involve creating a physical query plan and implementing parallel query execution using threads. I have already built a proof-of-concept of this, and I believe it is achievable in the timeframe.
With this in place, it will be simple to package up DataFusion in a Docker image as a command-line SQL tool, or wrap it in a REST API using Rocket to create a standalone analytics query service.
At this point I think it will be much easier for more people to try DataFusion out, without having to go through much of a learning curve, and I hope that will drive more interest in contributing to the DataFusion codebase, and to Apache Arrow in general.
Want to learn more about query engines? Check out my book "How Query Engines Work".