Recent Posts

What I Want from DataFusion in 2023

January 01, 2023

I like to take time off between Christmas and New Year’s Day to recharge and reflect on the past year, as well as to plan my personal and professional priorities for the upcoming year. Over the past week or so, I have been thinking about my involvement in the DataFusion project and how I can more effectively contribute to it in 2023.

Building a 32 core Kubernetes cluster for less than $1,000

May 15, 2021

I enjoy working with distributed systems, and I also enjoy working with embedded devices and single-board computers, so I thought it would be fun to combine these interests and build a Kubernetes cluster using some Raspberry Pi computers. This seemed like a low-cost way to have an always-on cluster that can be used to learn more about Kubernetes, and distributed computing in general.

Ballista: New approach for 2021

January 10, 2021

I wanted to share some updates on Ballista for the small group of you that have been following my progress on this project. If you are reading this blog post and are not familiar with the project, Ballista is an attempt at building a modern distributed compute platform in Rust, using Apache Arrow as the memory model.

Ballista Distributed Compute: One Year Later

July 26, 2020

I have been talking about distributed computing in Rust for a long time now. It is more than two and a half years since my Rust is for Big Data blog post where I first talked about the prototype I was working on at the time (which eventually became DataFusion and is now part of Apache Arrow).

Ballista Reboot

April 18, 2020

In July 2019, I created Ballista as a small proof-of-concept of parallel query execution across a number of Rust executors running in Kubernetes. This PoC generated a good discussion on Hacker News, which I felt demonstrated that there is interest in a platform like this. Unfortunately, it was far from usable for anything real and lacked a well-designed architecture.

How Query Engines Work

February 27, 2020

Over the past decade I’ve spent a fair bit of time either building query engines or building integrations with query engines so I decided to write an introductory book on the subject.

Rust Database Connectivity (RDBC)

January 10, 2020

Many years ago I wrote a commercial product that could import a database schema and then generate source code based on the schema. There were many different use cases for this product and it could be used to generate simple Data Access Object (DAO) code or even to generate fully working (although very crude) web applications for data entry. Believe it or not, some companies have schemas with more than 500 tables, so tools like this can dramatically reduce development costs. This type of product isn’t very sexy or modern but it generated decent revenue for a side project at the time, and there are still valid use cases today for this type of tool.

Announcing Ballista - Distributed Compute with Rust, Apache Arrow, and Kubernetes

July 16, 2019

Eighteen months ago, I started the DataFusion project with the goal of building a distributed compute platform in Rust that could (eventually) rival Apache Spark. Unsurprisingly, this turned out to be an overly ambitious goal at the time and I fell short of achieving that. However, some very good things came out of this effort. We now have a Rust implementation of Apache Arrow with a growing community of committers, and DataFusion was donated to the Apache Arrow project as an in-memory query execution engine and is now starting to see some early adoption. I even saw the first DataFusion job listing recently, which shows that this effort is already having an impact on the industry.

DataFusion Donated to Apache Arrow

February 05, 2019

I’m excited to announce that DataFusion has now been donated to the Apache Software Foundation as a Rust-native in-memory query engine for the Apache Arrow project.

DataFusion 2019

November 04, 2018

Earlier this year I put a lot of time and energy into DataFusion with the goal of creating a platform somewhat like Apache Spark, but implemented in Rust, without all the inefficiencies of the JVM. This was quite the journey, and I learned a lot of positive things from this effort, specifically:

Refactoring Apache Arrow to use traits and generics

May 04, 2018

I am currently working on a refactor of the Rust implementation of Apache Arrow to change the way that arrays are represented. This is a relatively large change even though this is a tiny codebase so far and I thought it would be good to write up this blog post to explain why I think this is needed. I think this information will also be interesting for any Rust developer who is struggling with making the right choice between (or using the right combination of) enums, structs, generics and traits. I was inspired to write this up after reading this blog post that was posted to Reddit just a few days ago.

DataFusion now uses Apache Arrow

April 05, 2018

I’m excited to announce that DataFusion is now using Apache Arrow for its internal memory representation of data. It was already using columnar data structures based on Vec<T> and moving to Arrow was not that big a leap.

This Weekend in DataFusion (2/18/18)

February 18, 2018

I had limited time to work on DataFusion this weekend but have started to refactor the code base based on some feedback that I received on Reddit last week and have also been working on some benchmarks.

DataFusion update 2/11/18

February 11, 2018

Following on from my blog post Rust is for Big Data, I announced my open source distributed data processing project, DataFusion, on reddit last week. I was probably a bit premature in announcing the project since it was at such an early stage but I was excited that I had some simple queries working with quite decent performance (roughly 2x the performance of Apache Spark) and wanted to start generating some interest in the project.

Rust is for Big Data (#rust2018)

January 28, 2018

This blog post isn’t so much about what I want from the Rust language in 2018, but more about where I see an opportunity for Rust to gain more widespread use in 2018.

My Goals for 2018

January 25, 2018

Instead of making new year’s resolutions, I’ve set myself some fairly specific goals for 2018 relating to family, health, finances, hobbies, and career. Many of these goals are broken down into monthly and quarterly objectives and some even have specific objectives and measurable results in true OKR style.