DataFusion now uses Apache Arrow

April 5, 2018

I’m excited to announce that DataFusion is now using Apache Arrow for its internal memory representation of data. It was already using columnar data structures based on Vec<T> and moving to Arrow was not that big a leap.

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, Ruby, Go, and now Rust.

The Rust implementation of Arrow is now in the main repo at https://github.com/apache/arrow-rs/tree/master/rust and currently supports a subset of data types including booleans, integers, floating point numerics, strings, and structs (nested types). Here is the Arrow representation of Arrays.

pub struct Array {
    len: i32,
    null_count: i32,
    validity_bitmap: Option<Bitmap>,
    data: ArrayData,
}

pub enum ArrayData {
    Boolean(Buffer<bool>),
    Float32(Buffer<f32>),
    Float64(Buffer<f64>),
    Int8(Buffer<i8>),
    Int16(Buffer<i16>),
    Int32(Buffer<i32>),
    Int64(Buffer<i64>),
    UInt8(Buffer<u8>),
    UInt16(Buffer<u16>),
    UInt32(Buffer<u32>),
    UInt64(Buffer<u64>),
    Utf8(List<u8>),
    Struct(Vec<Rc<Array>>),
}

The primitive array types use a Buffer<T> type, which is similar to a Vec<T> in that it stores fixed-width primitives in a contiguous region of memory but it also ensures that the data starts at a 64-byte aligned memory address as required by the Arrow specification.

There is much work ahead of course to make the Rust implementation as complete as the other languages, especially around support for IPC mechanisms so that Rust-based systems can start to co-exist with products such as Apache Spark / Parquet / Kudu etc.

I think this is another small but important step in helping Rust become widely adopted in data science and distributed computing.

DataFusion is still a very early stage proof-of-concept project but I think adopting Arrow as the native memory provides a great foundation to build on.