DataFusion 0.17.0

April 24, 2020

DataFusion is a Rust-native in-memory query engine, which is part of the Apache Arrow project.

This blog post provides a summary of some of the highlights in DataFusion 0.17.0. The full release notes for Apache Arrow 0.17.0 can be found here.

LogicalPlanBuilder

There is a new LogicalPlanBuilder that provides a more intuitive method for building logical plans. There is also a new UnresolvedColumn expression that allows plans to refer to columns by name, rather than index. The query optimizer contains a new ResolveColumnRule that resolves these columns and replaces them with indices.

Here is some sample code for building a simple aggregate query.

let plan = LogicalPlanBuilder::scan("default", "employee.csv", &employee_schema(), None)?
  .aggregate(vec![col("state")], // group by
             vec![sum(col("salary")).alias("total_salary")])?, // aggregates
  .project(vec![col("state"), col("total_salary")])?
  .build()?;

The SQL query planner and the query optimizer rules have been refactored to use this new builder, resulting in code that is more consistent and concise.

User-Defined Functions (UDFs)

DataFusion now supports scalar UDFs both from logical plans and from SQL. Scalar UDFs can accept zero or more arrays and produce one array as output.

pub type ScalarUdf = fn(input: &Vec<ArrayRef>) -> Result<ArrayRef>;

UDFs can be registered via ExecutionContext.register_udf().

Math Functions

Many math functions are now available and are registered using the new UDF support. The functions supported are sqrt, cos, sin, tan, asin, acos, atan, floor, ceil, round, trunc, abs, signum, exp, log, log2, and log10. The type coercion rules in the query optimizer automatically add casts to f64 for inputs to these functions and the implementation for each function is simply a call to the equivalent Rust function on the f64 type.

SQL Improvements

It is now possible to use the SQL wildcard operator, so SELECT * and SELECT COUNT(*) are now supported. The SQL query planner has been significantly refactored and is now much easier to contribute to. The validation for aggregate queries has also been improved.

Flight Examples

Two new examples are provided, demonstrating how Apache Arrow Flight can be used in Rust. The flight_client and flight_server examples demonstrate sending a SQL statement from the client to the server, and then streaming results back to the client.

Known Issues

There is a packaging issue ARROW-8536 which means that DataFusion and other Arrow crates cannot be used as dependencies without a hack to make the Flight.proto file available at a specific location on the local file system. There is already a PR open to resolve this, and this will either be fixed in a 0.17.1 point release, or in the next major version.

Join the discussion

Here is the /r/rust conversation for this blog post.


Want to learn more about query engines? Check out my book "How Query Engines Work".