Rust Big Data Benchmarks

Rust is a systems level programming language that offers memory safety guarantees, prevents data races, and has no garbage collector, potentially making it a good choice for distributed compute and data processing systems.

Since early 2018 I have been blogging about this topic and also working on some Rust projects to demonstrate the possibilities, including:

  • Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
  • DataFusion is a subproject of the Rust implementation of Apache Arrow that provides an Arrow-native extensible query engine that supports parallel query execution using threads against CSV and Parquet files.
  • Ballista is a very early stage PoC of a distributed compute platform that leverages Kubernetes and DataFusion.

Benchmark Overview

I have been running benchmarks of aggregate queries against the NYC taxi data set, using Apache Spark (JVM-based) as the baseline, since it is currently a popular tool for distributed compute, and a tool I am familiar with.

The benchmarks involve running simple aggregate queries against the 2018 data set in both CSV and Parquet format, with one partition per month, so twelve partitions in total. Total data size is relatively small at 8.5 GB for CSV format, and 2.0 GB for Parquet format. I plan on running against larger data sets once I have the benchmarks automated.

All benchmarks are dockerized so that CPU and memory limits can be enforced. The tests are run on a desktop with 24 cores and 64 GB RAM, running Ubuntu. Files are stored on an SSD drive. The benchmark source code can be found here.

Latest Results (October 2019)

The following chart shows the throughput in queries per second (higher is better) of executing the following query against a partitioned Parquet file in a single process, using different number of cores. The query is executed a single time. Memory is constrained to 1 GB.

SELECT passenger_count, MIN(fare_amount), MAX(fare_amount) 
FROM tripdata 
GROUP BY passenger_count 

Apache Spark is being used in local mode (in-process) for this benchmark.

bench-20191019-3.png

The following chart is for repeated runs of the same query, and Spark performs better than DataFusion once the JIT compiler kicks in, although Spark seems to lose its edge when there is higher concurrency.

bench-20191019-4.png

Observations

Current observations from this and other benchmarks I am currently running:

  • DataFusion performance is very consistent
  • DataFusion can run these benchmarks with as little as 128 MB whereas Spark needs at least 1 GB
  • For one off queries, DataFusion is faster than Spark
  • For repeated runs of the same query, DataFusion is slower than Spark
  • Spark performance improves with multiple runs, thanks to the JIT kicking in
  • Spark performance can also become unpredictable over time, thanks to GC pauses

Future Work

  • Publish results for the same benchmark using CSV files
  • Run benchmarks against larger data sets
  • Publish results for distributed queries using Ballista vs Apache Spark running in cluster mode