DataFusion is a Rust-native in-memory query engine, which is part of the Apache Arrow project.
This blog post provides a summary of some of the highlights in DataFusion 0.17.0. The full release notes for Apache Arrow 0.17.0 can be found here.
There is a new LogicalPlanBuilder that provides a more intuitive method for building logical plans. There is also a new
UnresolvedColumn expression that allows plans to refer to columns by name, rather than index. The query optimizer contains a new
ResolveColumnRule that resolves these columns and replaces them with indices.
Here is some sample code for building a simple aggregate query.
let plan = LogicalPlanBuilder::scan("default", "employee.csv", &employee_schema(), None)? .aggregate(vec![col("state")], // group by vec![sum(col("salary")).alias("total_salary")])?, // aggregates .project(vec![col("state"), col("total_salary")])? .build()?;
The SQL query planner and the query optimizer rules have been refactored to use this new builder, resulting in code that is more consistent and concise.
User-Defined Functions (UDFs)
DataFusion now supports scalar UDFs both from logical plans and from SQL. Scalar UDFs can accept zero or more arrays and produce one array as output.
pub type ScalarUdf = fn(input: &Vec<ArrayRef>) -> Result<ArrayRef>;
UDFs can be registered via
Many math functions are now available and are registered using the new UDF support. The functions supported are
log10. The type coercion rules in the query optimizer automatically add casts to
f64 for inputs to these functions and the implementation for each function is simply a call to the equivalent Rust function on the
It is now possible to use the SQL wildcard operator, so
SELECT * and
SELECT COUNT(*) are now supported. The SQL query planner has been significantly refactored and is now much easier to contribute to. The validation for aggregate queries has also been improved.
Two new examples are provided, demonstrating how Apache Arrow Flight can be used in Rust. The flight_client and flight_server examples demonstrate sending a SQL statement from the client to the server, and then streaming results back to the client.
There is a packaging issue ARROW-8536 which means that DataFusion and other Arrow crates cannot be used as dependencies without a hack to make the
Flight.proto file available at a specific location on the local file system. There is already a PR open to resolve this, and this will either be fixed in a 0.17.1 point release, or in the next major version.
Join the discussion
Here is the /r/rust conversation for this blog post.
Want to learn more about query engines? Check out my book "How Query Engines Work".