Rust Database Connectivity (RDBC)

January 10, 2020

Many years ago I wrote a commercial product that could import a database schema and then generate source code based on the schema. There were many different use cases for this product and it could be used to generate simple Data Access Object (DAO) code or even to generate fully working (although very crude) web applications for data entry. Believe it or not, some companies have schemas with more than 500 tables, so tools like this can dramatically reduce development costs. This type of product isn’t very sexy or modern but it generated decent revenue for a side project at the time, and there are still valid use cases today for this type of tool.

When I started prototyping a generic database tool in Rust a few weeks ago, I discovered that there isn’t a standard way to access databases in Rust and that I would have to write separate code for every single database, and I’m not just talking about MySQL and Postgres (two wouldn’t be so bad), I’m also thinking about distributed query engines such as PrestoDB, Hive, Impala, and Drill, to name just a few.

This is a problem that has already been solved in other ecosystems outside of Rust. In 1992, Microsoft created the Open Database Connectivity (ODBC) standard, which is an application programming interface for accessing relational databases. In 1997, Sun Microsystems released the Java Database Connectivity (JDBC) specification. I have used ODBC and JDBC drivers for much of my career and have implemented production JDBC drivers, and I just assumed that something like this would exist in Rust as well.

Rather than writing custom integrations for my prototype, I switched gears and started prototyping a Rust version of ODBC/JDBC instead, which I unimaginatively named Rust Database Connectivity, or RDBC for short.

RDBC is specifically for the use case where a developer needs the ability to execute arbitrary SQL against any database and then be able to fetch the results. Good examples of this use case are generic SQL tools, business intelligence products, and distributed query engines. If I were building a product in one of these categories, I would want my product to work with as many databases as possible, including databases that don’t exist yet. It would be important for the end user to be able to download my product and then download a database driver and expect them to work together.

This is very different from the use case where a developer is building an application from scratch that uses a single database. For that use case, I recommend using Diesel so that all data access can be expressed as idiomatic Rust code.

RDBC PoC

In order to start a discussion around creating an RDBC standard in Rust, I created a proof of concept (here) with some traits representing database connections, statements, and result sets. I created implementations (drivers) for MySQL and Postgres and I received a PR to add a driver for SQLlite as well.

The design of this PoC is blindly copied from JDBC and has quite a few design flaws:

  • The API does not support async
  • The API is not idiomatic Rust code. For example, it doesn’t use generics when accessing data.
  • It is not possible to create more than one prepared statement per connection due to some ownership issues inherited from the mysql crate (see Issue #46 for more information)
  • The API is row-based and cannot efficiently support retrieving data from columnar databases

Unfortunately I am not an expert with the Rust programming language yet, so I find myself in a situation where I’m struggling to turn this PoC into a great project that others would want to use and I would love to get some help from the community to improve this project.

Columnar support is important!

ODBC and JDBC were created quite a long time ago, so we need to be careful about blindly porting them to Rust. These standards were created specifically for relational databases at a time where databases were predominantly row-based. The API is designed around fetching one row at a time (although rows are obviously buffered in batches behind the scenes).

Today we have many columnar databases and although it is trivial to translate columnar data into rows, it is inefficient, and removes some of the benefit of using a columnar store in the first place. The short summary of the benefit of a columnar representation of data is that it supports vectorized processing and can take advantage of modern hardware features, such as SIMD and GPUs.

Although columnar databases existed before ODBC/JDBC, they didn’t really gain a lot of traction until the mid 2000’s with products like MonetDB and Vertica. Today, we have Apache Arrow, which defines a standardized memory format for columnar data and is starting to build out compute kernels and query engines around this format. Arrow is driving some standardization that allows different products to more efficiently share and process columnar data. Arrow has more recently defined the Arrow Flight protocol, which provides 20-50x better performance than ODBC/JDBC. These last two links really are worth a read to better understand why support for columnar data in RDBC is important.

The point I’m trying to get to here is that it would be very short sighted to design RDBC with just the row-based access pattern in mind (which is exactly what I did in the PoC).

Proposed RowSet Trait

Luckily, supporting both row-oriented and column-oriented data at the same time is actually pretty simple. My opinion is that the RDBC API should support fetching batches of data into row sets that implement a RowSet trait, providing both row and column-oriented access.

Because all values in a column have the same data type, we can have a type-safe trait for accessing the values.

pub trait ColumnAccessor<T> {
    fn get(&self, i: u64) -> Result<Option<T>>;
}

Rows typically contain mixed types, so here we would specify the generic type on the method rather than on the trait.

pub trait RowAccessor {
    fn get<T>(&self, i: u64) -> Result<Option<T>> where Self: Sized;
}

The RowSet trait would look something like this:

pub trait RowSet {
    /// get meta data about this row set
    fn meta_data(&self) -> Result<Box<dyn RowSetMetaData>>;
    /// Get an accessor for a row containing mixed types
    fn get_row(&self, i: u64) -> Result<Box<dyn RowAccessor>>;
    /// Get a column as a type-safe accessor
    fn get_column<T>(&self, i: u64) -> Result<Box<dyn ColumnAccessor<T>>>;
}

The data backing the RowSet will either be row-oriented or column-oriented. It will always be possible to access the data as both rows and columns but one of those methods will be more efficient than the other. This likely won’t really matter too much for most use cases, but there are use cases where this is important, so the meta-data should expose information about the native format.

pub trait RowSetMetaData {
    fn row_oriented(&self) -> bool;
    fn num_rows(&self) -> u64;
    fn num_columns(&self) -> u64;
    fn column_name(&self, i: u64) -> String;
    fn column_type(&self, i: u64) -> DataType;
}

How does this relate to ORMs

RDBC is just a low level data access and query execution API. It does not aim to replace ORMs like Diesel and although it will soon be possible to build applications using RDBC instead of Diesel, I would strongly discourage that. I have worked with many code bases that use ODBC/JDBC directly and it isn’t pretty. I have also experienced issues with more than one poorly designed ORM, but Diesel is pretty awesome.

RDBC is moving to Tokio!

Thanks to support from the tokio team, there is a an rdbc channel on the tokio discord server where we have started discussing how to asyncify the RDBC API and the github repo will shortly move to the tokio-rs organization in github.

I’m excited to start working with the Tokio team on this project, and hope to learn a lot from them about writing quality Rust code!