DataFusion 0.6.0

January 21, 2019

DataFusion is an in-memory query engine implemented in Rust that uses Apache Arrow for the memory model.

DataFusion 0.6.0 is now available on crates.io and is the first release to depend on an official release of the Rust implementation of Apache Arrow. Over the past couple months I essentially started from scratch with DataFusion because the Rust implementation of Apache Arrow had changed significantly since I contributed the original prototype (and it is much improved now thanks to contributions from quite a few people). I had also learned from some of the mistakes I made in the original DataFusion codebase and wanted to start the project with a cleaner architecture.

Because of this reboot, the functionality has regressed compared to earlier releases, so instead of publishing a changelog, here is a summary of the current functionality. DataFusion is still very much a proof of concept and not yet suitable for real world use.

DataFusion currently supports running simple SQL queries against CSV files. Previous versions of DataFusion also supported Parquet. The Rust Parquet project was recently donated to Apache Arrow and there is ongoing integration work that I want to leverage in DataFusion. I expect Parquet support will be added in the next month or so.

I have recently offered to donate DataFusion to the Apache Arrow project to become the default Rust-native query engine in Arrow. I am expecting a vote on this donation on the developer mailing list in the next week or two.

Current Features

Query plan

  • Projection (SELECT)
  • Selection (WHERE)
  • Aggregate Functions (MIN, MAX, SUM)
  • GROUP BY

Expressions

  • Identifiers (column names)
  • Literal values (primitive types only e.g. int, float, string, bool)
  • CAST(expr AS type)
  • Binary expressions (>, >=, =, <=, <, +, -, *, /, %, AND, OR)

How to use DataFusion

To use DataFusion as a crate dependency, add the following to your Cargo.toml:

[dependencies]
datafusion = "0.6.0"

Here is a brief example for running a SQL query against a CSV file. See the examples directory for full examples.

fn main() {
    // create local execution context
    let mut ctx = ExecutionContext::new();

    // define schema for data source (csv file)
    let schema = Arc::new(Schema::new(vec![
        Field::new("city", DataType::Utf8, false),
        Field::new("lat", DataType::Float64, false),
        Field::new("lng", DataType::Float64, false),
    ]));

    // register csv file with the execution context
    let csv_datasource = CsvDataSource::new("test/data/uk_cities.csv", schema.clone(), 1024);
    ctx.register_datasource("cities", Rc::new(RefCell::new(csv_datasource)));

    // simple projection and selection
    let sql = "SELECT city, lat, lng FROM cities WHERE lat > 51.0 AND lat < 53";

    // execute the query
    let relation = ctx.sql(&sql).unwrap();

    // display the relation
    let mut results = relation.borrow_mut();

    while let Some(batch) = results.next().unwrap() {

        println!(
            "RecordBatch has {} rows and {} columns",
            batch.num_rows(),
            batch.num_columns()
        );

        let city = batch
            .column(0)
            .as_any()
            .downcast_ref::<BinaryArray>()
            .unwrap();

        let lat = batch
            .column(1)
            .as_any()
            .downcast_ref::<Float64Array>()
            .unwrap();

        let lng = batch
            .column(2)
            .as_any()
            .downcast_ref::<Float64Array>()
            .unwrap();

        for i in 0..batch.num_rows() {
            let city_name: String = String::from_utf8(city.get_value(i).to_vec()).unwrap();

            println!(
                "City: {}, Latitude: {}, Longitude: {}",
                city_name,
                lat.value(i),
                lng.value(i),
            );
        }
    }
}

Conclusion

With this reboot of DataFusion and a more realistic roadmap, this is a great time for new contributors to get involved. I’m excited to be able to offer DataFusion to the Apache Arrow project and I hope that this will help advance Rust in the fields of data science and so called Big Data.

Leave a Comment