What I Want from DataFusion in 2023

January 01, 2023

I like to take time off between Christmas and New Year’s Day to recharge and reflect on the past year, as well as to plan my personal and professional priorities for the upcoming year. Over the past week or so, I have been thinking about my involvement in the DataFusion project and how I can more effectively contribute to it in 2023.

Building a 32 core Kubernetes cluster for less than $1,000

May 15, 2021

I enjoy working with distributed systems, and I also enjoy working with embedded devices and single-board computers, so I thought it would be fun to combine these interests and build a Kubernetes cluster using some Raspberry Pi computers. This seemed like a low-cost way to have an always-on cluster that can be used to learn more about Kubernetes, and distributed computing in general.

Ballista: New approach for 2021

January 10, 2021

I wanted to share some updates on Ballista for the small group of you that have been following my progress on this project. If you are reading this blog post and are not familiar with the project, Ballista is an attempt at building a modern distributed compute platform in Rust, using Apache Arrow as the memory model.

Ballista Distributed Compute: One Year Later

July 26, 2020

I have been talking about distributed computing in Rust for a long time now. It is more than two and a half years since my Rust is for Big Data blog post where I first talked about the prototype I was working on at the time (which eventually became DataFusion and is now part of Apache Arrow).

Why does musl make my Rust code so slow?

May 05, 2020

TL;DR: Stop using musl and alpine for smaller docker images!

DataFusion 0.17.0

April 24, 2020

DataFusion is a Rust-native in-memory query engine, which is part of the Apache Arrow project.

Ballista Reboot

April 18, 2020

In July 2019, I created Ballista as a small proof-of-concept of parallel query execution across a number of Rust executors running in Kubernetes. This PoC generated a good discussion on Hacker News, which I felt demonstrated that there is interest in a platform like this. Unfortunately, it was far from usable for anything real and lacked a well-designed architecture.

How Query Engines Work

February 27, 2020

Over the past decade I’ve spent a fair bit of time either building query engines or building integrations with query engines so I decided to write an introductory book on the subject.

Rust Database Connectivity (RDBC)

January 10, 2020

Many years ago I wrote a commercial product that could import a database schema and then generate source code based on the schema. There were many different use cases for this product and it could be used to generate simple Data Access Object (DAO) code or even to generate fully working (although very crude) web applications for data entry. Believe it or not, some companies have schemas with more than 500 tables, so tools like this can dramatically reduce development costs. This type of product isn’t very sexy or modern but it generated decent revenue for a side project at the time, and there are still valid use cases today for this type of tool.

Rust 2020 - Rust needs to be boring

November 07, 2019

This blog post is a response to the call for blogs from the Rust Core Team.

New Benchmarks Page

October 20, 2019

For latest information on DataFusion and Ballista benchmarks, see https://github.com/datafusion-contrib/benchmark-automation

DataFusion 0.15.0 Release Notes

September 22, 2019

DataFusion is an extensible query execution framework implemented in Rust that uses the Apache Arrow memory model, and is part of the Apache Arrow project.

EKS security patches cause Apache Spark jobs to fail with permissions error

August 31, 2019

Over the past couple days at work we started noticing Spark 2.3 and 2.4 jobs failing with a permissions error across multiple EKS clusters. Here is an example stack trace:

Announcing Ballista - Distributed Compute with Rust, Apache Arrow, and Kubernetes

July 16, 2019

Eighteen months ago, I started the DataFusion project with the goal of building a distributed compute platform in Rust that could (eventually) rival Apache Spark. Unsurprisingly, this turned out to be an overly ambitious goal at the time and I fell short of achieving that. However, some very good things came out of this effort. We now have a Rust implementation of Apache Arrow with a growing community of committers, and DataFusion was donated to the Apache Arrow project as an in-memory query execution engine and is now starting to see some early adoption. I even saw the first DataFusion job listing recently, which shows that this effort is already having an impact on the industry.

DataFusion 0.13.0 Benchmarks

April 28, 2019

Latest Benchmarks

Parallel Query Execution in Rust

April 20, 2019

I’m working on a design to create a physical execution plan for DataFusion (part of Apache Arrow) that will support parallel query execution across multiple threads.

DataFusion 0.13.0 released as part of Apache Arrow

April 02, 2019

DataFusion is an in-memory query engine for analytical queries, implemented in Rust, that uses Apache Arrow for the memory model.

DataFusion Donated to Apache Arrow

February 05, 2019

I’m excited to announce that DataFusion has now been donated to the Apache Software Foundation as a Rust-native in-memory query engine for the Apache Arrow project.

DataFusion 0.6.0

January 21, 2019

DataFusion is an in-memory query engine implemented in Rust that uses Apache Arrow for the memory model.

DataFusion 2019

November 04, 2018

Earlier this year I put a lot of time and energy into DataFusion with the goal of creating a platform somewhat like Apache Spark, but implemented in Rust, without all the inefficiencies of the JVM. This was quite the journey, and I learned a lot of positive things from this effort, specifically:

Hosting Jekyll web sites with Amazon Lightsail and Let's Encrypt SSL

May 19, 2018

After the recent fiasco where I moved my web sites from Google Cloud to GitHub Pages and then found out that it wasn’t possible to use a custom domain with SSL with both an apex domain and a www subdomain, I decided to move the sites again this weekend, this time to AWS.

DataFusion Aggregate Performance

May 15, 2018

Latest Benchmarks

How not to move your blog to GitHub pages using a custom domain and SSL

May 15, 2018

I was excited to see the recent tweet from GitHub just a couple of weeks ago, announcing:

Refactoring Apache Arrow to use traits and generics

May 04, 2018

I am currently working on a refactor of the Rust implementation of Apache Arrow to change the way that arrays are represented. This is a relatively large change even though this is a tiny codebase so far and I thought it would be good to write up this blog post to explain why I think this is needed. I think this information will also be interesting for any Rust developer who is struggling with making the right choice between (or using the right combination of) enums, structs, generics and traits. I was inspired to write this up after reading this blog post that was posted to Reddit just a few days ago.

DataFusion: Parquet, Quiver, and Roadmap

April 15, 2018

There are a few interesting things going on with DataFusion that I wanted to share.

DataFusion now uses Apache Arrow

April 05, 2018

I’m excited to announce that DataFusion is now using Apache Arrow for its internal memory representation of data. It was already using columnar data structures based on Vec<T> and moving to Arrow was not that big a leap.

Q1 Review & Q2 Goals

April 02, 2018

At the start of the year, I set myself some goals for Q1.

DataFusion 0.2.1 Benchmark

March 17, 2018

Latest Benchmarks

This Weekend in DataFusion (2/18/18)

February 18, 2018

I had limited time to work on DataFusion this weekend but have started to refactor the code base based on some feedback that I received on Reddit last week and have also been working on some benchmarks.

DataFusion update 2/11/18

February 11, 2018

Following on from my blog post Rust is for Big Data, I announced my open source distributed data processing project, DataFusion, on reddit last week. I was probably a bit premature in announcing the project since it was at such an early stage but I was excited that I had some simple queries working with quite decent performance (roughly 2x the performance of Apache Spark) and wanted to start generating some interest in the project.

What I'm Reading This Month (Jan 2018)

January 31, 2018

I’m reading the following two books at the moment and both are helping me towards my 2018 goals.

Rust is for Big Data (#rust2018)

January 28, 2018

This blog post isn’t so much about what I want from the Rust language in 2018, but more about where I see an opportunity for Rust to gain more widespread use in 2018.

My Goals for 2018

January 25, 2018

Instead of making new year’s resolutions, I’ve set myself some fairly specific goals for 2018 relating to family, health, finances, hobbies, and career. Many of these goals are broken down into monthly and quarterly objectives and some even have specific objectives and measurable results in true OKR style.

Re-launching this site

January 25, 2018

This is my “Hello, World” blog post.

Andy Grove

Recent Posts

What I Want from DataFusion in 2023

Building a 32 core Kubernetes cluster for less than $1,000

Ballista: New approach for 2021

Ballista Distributed Compute: One Year Later

Why does musl make my Rust code so slow?

DataFusion 0.17.0

Ballista Reboot

How Query Engines Work

Rust Database Connectivity (RDBC)

Rust 2020 - Rust needs to be boring

New Benchmarks Page

DataFusion 0.15.0 Release Notes

EKS security patches cause Apache Spark jobs to fail with permissions error

Announcing Ballista - Distributed Compute with Rust, Apache Arrow, and Kubernetes

DataFusion 0.13.0 Benchmarks

Latest Benchmarks

Parallel Query Execution in Rust

DataFusion 0.13.0 released as part of Apache Arrow

DataFusion Donated to Apache Arrow

DataFusion 0.6.0

DataFusion 2019

Hosting Jekyll web sites with Amazon Lightsail and Let's Encrypt SSL

DataFusion Aggregate Performance

Latest Benchmarks

How not to move your blog to GitHub pages using a custom domain and SSL

Refactoring Apache Arrow to use traits and generics

DataFusion: Parquet, Quiver, and Roadmap

DataFusion now uses Apache Arrow

Q1 Review & Q2 Goals

DataFusion 0.2.1 Benchmark

Latest Benchmarks

This Weekend in DataFusion (2/18/18)

DataFusion update 2/11/18

What I'm Reading This Month (Jan 2018)

Rust is for Big Data (#rust2018)

My Goals for 2018

Re-launching this site