DataFusion Aggregate Performance
May 15, 2018
Latest Benchmarks
This blog post is from more than four years ago. For latest information on DataFusion and Ballista benchmarks, see https://github.com/datafusion-contrib/benchmark-automation
Original Post
Recently I’ve been working on improving the support for aggregate queries in DataFusion. I’m currently using NYC Taxi Trip Record Data for testing.
The results on this page are from running two simple aggregate queries against data for a single month (Dec 2017) which is admittedly a pretty small data set at 800 MB in CSV format, but I have to start somewhere, and this is where I’m starting. I plan to scale up the benchmarks to run across multiple files, most likely treating each month as a single partition so I can test parallel (and eventually distributed) query execution.
I also wrote a quick Spark app to convert the CSV file to Parquet so that I could get timings for both CSV and Parquet in both DataFusion and Spark.
In all cases, the files are being read from an SSD drive. These are just single threaded tests on my desktop at this stage.
Whole Table Aggregate Query
The first query calculates a few aggregates across the whole file (no GROUP BY clause).
SELECT COUNT(1), MIN(CAST(fare_amount AS FLOAT)), MAX(CAST(fare_amount AS FLOAT))
FROM tripdata
The results of this query are:
9508276,-340.0,391911.78
Aggregates With Grouping
This query calculates the same aggregates as the previous query but grouping them by passenger_count.
SELECT passenger_count, COUNT(1), MIN(CAST(fare_amount AS FLOAT)), MAX(CAST(fare_amount AS FLOAT))
FROM tripdata
GROUP BY passenger_count
The results of this query are:
5,443643,-59.7,850
9,30,-9,99.99
2,1446874,-340,3029
7,44,-70,77.75
1,6647126,-340,391911.78
3,425351,-169,550
4,217875,-60,590
8,36,0.8,100
6,268944,-52,297.5
0,58353,0,655.35
Note: when I have more time, I really need to research that single taxi trip costing almost half a million dollars!
Results - CSV
The CSV results were quite predictable with DataFusion outperforming Spark by 2-3 times. This is expected because of the native code execution and also because the data is relatively small. The gap between DataFusion and Spark would likely close a bit with larger files.
Results - Parquet
The results for Parquet are really interesting and quite surprising to me. First of all, the performance without grouping shows DataFusion outperforming Spark by around 10x (one order of magnitude). I think this demonstrates the advantages of DataFusion’s columnar processing versus Spark’s row based processing. The results for the GROUP BY query are very close between DataFusion and Spark. Primarily this is because DataFusion currently reverts to row-based processing once grouping is involved (and I’m not sure if that can be avoided) but it also seems that Spark has some very efficient handling of Parquet files.
Conclusions
I’m pretty happy overall with how DataFusion is performing but I’d like to see if I can find a more efficient way of processing the GROUP BY queries.
Want to learn more about query engines? Check out my book "How Query Engines Work".