Refactoring Apache Arrow to use traits and generics

May 04, 2018

I am currently working on a refactor of the Rust implementation of Apache Arrow to change the way that arrays are represented. This is a relatively large change even though this is a tiny codebase so far and I thought it would be good to write up this blog post to explain why I think this is needed. I think this information will also be interesting for any Rust developer who is struggling with making the right choice between (or using the right combination of) enums, structs, generics and traits. I was inspired to write this up after reading this blog post that was posted to Reddit just a few days ago.

First, a little background about the data structures required in Apache Arrow (I will be skipping over some details here and oversimplifying Arrow for the purpose of this discussion).

The main data structure is an Array to represent a column of data, or a column within a nested type. Generally, all data in the array must be contained in a contiguous region of memory to allow zero-copy sharing with other processes as well for other efficiencies around vectorized processing.

There are multiple types of data that can be contained in an Array, but for this blog post, I’m just going to consider the following:

  • Buffer<T> - A simple list of primitive values, similar to Vec<T>
  • List<T> - A list of slices of a primitive value, similar to Vec<&[T]>
  • Struct - A nested type made up of child arrays, which can be of any of these types

Regardless of how we choose to model the Array type, there are some generic types that we need to represent the underlying primitive collections.

struct Buffer<T> where T: PrimitiveType {
	len: usize,
	data: *const T // memory aligned at 64 byte boundary
}

struct List<T> where T: PrimitiveType {
	offsets: Buffer<i32>,
	data: Buffer<T>
}

Enums

Currently, Arrow uses the Rust enum type to represent ArrayData. Here’s an example showing a subset of supported types.

enum ArrayData {
    Int8(Buffer<i8>),
    Int16(Buffer<i16>),
    Int32(Buffer<i32>),
    Int64(Buffer<i64>),
    Float32(Buffer<f32>),
    Float64(Buffer<f64>),
    ListInt8(List<i8>),
    ListInt16(List<i16>),
    ListInt32(List<i32>),
    ListInt64(List<i64>),
    Struct(Vec<Rc<Array>>)
}

struct Array {
    len: i32,
    bitmap: Bitmap,
    data: Rc<ArrayData>
}

One nice thing about this approach is that pattern matching makes it easy to deal with types dynamically at runtime. For example:

match array.data() {
    &ArrayData::Int8(ref list) => cast_from_to!(i8, data_type, list),
    &ArrayData::Int16(ref list) => cast_from_to!(i16, data_type, list),
    &ArrayData::Int32(ref list) => cast_from_to!(i32, data_type, list),
    &ArrayData::Int64(ref list) => cast_from_to!(i64, data_type, list),
    &ArrayData::Float32(ref list) => cast_from_to!(f32, data_type, list),
    &ArrayData::Float64(ref list) => cast_from_to!(f64, data_type, list),

However, the trade off is that because of the lack of generics, it is often necessary to have repeated blocks of pattern matching code with a lot of redundant code. For example to add two arrays of the same type:

fn add(a: &ArrayData, b: &ArrayData) -> Rc<ArrayData> {
    match (a,b) {
        (ArrayData::Int32(ref aa), ArrayData::Int32(ref bb)) => {
            let mut builder: Builder<i32> = Builder::new();
            for i in 0..aa.len() {
                builder.push(aa.get(i) + bb.get(i));
            }
            Rc::new(ArrayData::Int32(builder.finish()))
        }
        (ArrayData::Int64(ref aa), ArrayData::Int64(ref bb)) => {
            let mut builder: Builder<i64> = Builder::new();
            for i in 0..aa.len() {
                builder.push(aa.get(i) + bb.get(i));
            }
            Rc::new(ArrayData::Int64(builder.finish()))
        }
        ...

This repetitive type handling then leads to macros, which make the code more concise but also harder to read and maintain.

Traits, Structs, and Generics

Let’s start with the ArrayData structs to represent the Buffer<T> and List<T> types. It makes sense to have specific structs that also use generic types.

pub struct BufferArrayData<T: PrimitiveType> {
    len: usize,
    bitmap: Bitmap,
    data: Buffer<T>
}

pub struct ListArrayData<T: PrimitiveType> {
    len: usize,
    bitmap: Bitmap,
    data: List<T>
}

Because structs can contain mixed types, we can’t use generics. We now need to introduce an ArrayData trait so that the struct type can refer to other types using trait objects.

pub trait ArrayData {
    fn as_any(&self) -> &Any;
}

pub struct StructArrayData {
    len: i32,
    bitmap: Bitmap,
    data: Vec<Rc<ArrayData>>
}

The ArrayData trait has an as_any() method. This is required so that we can use downcasting to get a specfic ArrayData implementation. For example, if we know from meta data that we are dealing with an array with the type List<f64> then we can downcast as follows.

let data = array.data().as_any().downcast_ref::<ListArrayData<f64>>().unwrap();

This is slightly less convenient that pattern matching on an enumeration since we now always need to have the meta data about the arrays available.

However, revisiting our example from the previous section where we want to add two arrays of the same primitive type, we can now implement an add() function directly on the BufferArrayData<T> type just for the subset of T that supports the Add operation.

impl<T> BufferArrayData<T> where T: PrimitiveType + Add<Output=T> {

    pub fn add(&self, other: &BufferArrayData<T>) -> BufferArrayData<T> {
        let mut builder: Builder<T> = Builder::new();
        for i in 0..self.len {
            builder.push(self.data.get(i) + other.data.get(i));
        }
        BufferArrayData::from(builder.finish())
    }
}

This means that if you have two instances of BufferArrayData<T> where T is the same type, you can simply add them together using a single function call. I see this as a huge advantage over the enum approach.

However, we still have to pattern match for dynamic behavior at runtime. We can’t pattern match on the BufferArrayData itself but instead have to pattern match on the separate type metadata.

fn add(a: &ArrayData, a_type: DataType, b: &ArrayData, b_type: DataType) -> Rc<ArrayData> {
    match (a_type, b_type) {
        (DataType::Int32, DataType::Int32) => {
            let a_data = a.as_any().downcast_ref::<BufferArrayData<i32>>().unwrap();
            let b_data = b.as_any().downcast_ref::<BufferArrayData<i32>>().unwrap();
            Rc::new(a_data.add(&b_data))
        }
        (DataType::Int64, DataType::Int64) => {
            let a_data = a.as_any().downcast_ref::<BufferArrayData<i64>>().unwrap();
            let b_data = b.as_any().downcast_ref::<BufferArrayData<i64>>().unwrap();
            Rc::new(a_data.add(&b_data))
        }

This is less verbose than the enum approach, but we still need to employ macros to make this easier to work with.

macro_rules! as_buffer {
    ($ARRAY:ident, $TY:ty) => {
        $ARRAY.as_any().downcast_ref::<BufferArrayData<$TY>>().unwrap();
    }
}

fn add(a: &ArrayData, a_type: DataType, b: &ArrayData, b_type: DataType) -> Rc<ArrayData> {
    match (a_type, b_type) {
        (DataType::Int32, DataType::Int32) => Rc::new(as_buffer!(a, i32).add(as_buffer!(b, i32))),
        (DataType::Int64, DataType::Int64) => Rc::new(as_buffer!(a, i64).add(as_buffer!(b, i64))),
        ...

Conclusions?

My gut feel says that the generics approach is the way to go. In cases where code is dealing with static types then static dispatch can be used. However, dynamic runtime code needs to make use of pattern matching and downcasting, and therefore will still rely on macros.

I’m not sure I love the solution but it seems like the right way to go. I’d love to hear opinions from more experienced Rust developers.


Want to learn more about query engines? Check out my book "How Query Engines Work".