Andy Grove

Andy Grovehttps://andygrove.io/Recent content on Andy GroveHugo -- gohugo.ioen-usSun, 01 Jan 2023 00:00:00 +0000What I Want from DataFusion in 2023https://andygrove.io/2023/01/what-i-want-from-datafusion-2023/Sun, 01 Jan 2023 00:00:00 +0000https://andygrove.io/2023/01/what-i-want-from-datafusion-2023/I like to take time off between Christmas and New Year’s Day to recharge and reflect on the past year, as well as to plan my personal and professional priorities for the upcoming year. Over the past week or so, I have been thinking about my involvement in the DataFusion project and how I can more effectively contribute to it in 2023. The project has gained a tremendous amount of momentum over the past couple of years, with many more contributors getting involved, including some full-time contributors from companies that are building commercial products on top of DataFusion.Building a 32 core Kubernetes cluster for less than $1,000https://andygrove.io/2021/05/building-k8s-cluster-raspberry-pi/Sat, 15 May 2021 00:00:00 +0000https://andygrove.io/2021/05/building-k8s-cluster-raspberry-pi/I enjoy working with distributed systems, and I also enjoy working with embedded devices and single-board computers, so I thought it would be fun to combine these interests and build a Kubernetes cluster using some Raspberry Pi computers. This seemed like a low-cost way to have an always-on cluster that can be used to learn more about Kubernetes, and distributed computing in general. Here is a photo of the end result.Ballista: New approach for 2021https://andygrove.io/2021/01/ballista-2021/Sun, 10 Jan 2021 00:00:00 +0000https://andygrove.io/2021/01/ballista-2021/I wanted to share some updates on Ballista for the small group of you that have been following my progress on this project. If you are reading this blog post and are not familiar with the project, Ballista is an attempt at building a modern distributed compute platform in Rust, using Apache Arrow as the memory model. I would say that Ballista progress to date can best be described as a series of interesting proof-of-concepts that demonstrate many possibilities but have so far failed to deliver anything of real value, other than showcasing the power of Rust and Apache Arrow for distributed compute and also helping drive requirements for DataFusion (Arrow’s Rust-native in-memory query engine, which Ballista depends on).Ballista Distributed Compute: One Year Laterhttps://andygrove.io/2020/07/ballista-one-year-on/Sun, 26 Jul 2020 00:00:00 +0000https://andygrove.io/2020/07/ballista-one-year-on/I have been talking about distributed computing in Rust for a long time now. It is more than two and a half years since my Rust is for Big Data blog post where I first talked about the prototype I was working on at the time (which eventually became DataFusion and is now part of Apache Arrow). One year ago, over the July 4th weekend, I started again with a new project named “Ballista”.Why does musl make my Rust code so slow?https://andygrove.io/2020/05/why-musl-extremely-slow/Tue, 05 May 2020 00:00:00 +0000https://andygrove.io/2020/05/why-musl-extremely-slow/TL;DR: Stop using musl and alpine for smaller docker images! During some recent benchmarking work of the Ballista Distributed Compute project, I discovered that the Rust benchmarks were ridiculously slow. After some brief debugging, it turns out that this was due to the use of musl, and this blog post was originally asking for help with the issue, but now provides some solutions. My benchmark is packaged in Docker and I had used musl to produce a statically linked executable which was then copied into an alpine image, resulting in a small docker image.DataFusion 0.17.0https://andygrove.io/2020/04/datafusion-0.17.0/Fri, 24 Apr 2020 00:00:00 +0000https://andygrove.io/2020/04/datafusion-0.17.0/DataFusion is a Rust-native in-memory query engine, which is part of the Apache Arrow project. This blog post provides a summary of some of the highlights in DataFusion 0.17.0. The full release notes for Apache Arrow 0.17.0 can be found here. LogicalPlanBuilder There is a new LogicalPlanBuilder that provides a more intuitive method for building logical plans. There is also a new UnresolvedColumn expression that allows plans to refer to columns by name, rather than index.Ballista Reboothttps://andygrove.io/2020/04/ballista-reboot/Sat, 18 Apr 2020 00:00:00 +0000https://andygrove.io/2020/04/ballista-reboot/In July 2019, I created Ballista as a small proof-of-concept of parallel query execution across a number of Rust executors running in Kubernetes. This PoC generated a good discussion on Hacker News, which I felt demonstrated that there is interest in a platform like this. Unfortunately, it was far from usable for anything real and lacked a well-designed architecture. Over the past few months, I have had the opportunity to discuss this project with some really smart people in the industry and this has inspired me to reboot the project with a slightly different focus.How Query Engines Workhttps://andygrove.io/2020/02/how-query-engines-work/Thu, 27 Feb 2020 00:00:00 +0000https://andygrove.io/2020/02/how-query-engines-work/Over the past decade I’ve spent a fair bit of time either building query engines or building integrations with query engines so I decided to write an introductory book on the subject. The book walks through every step of building a SQL query engine in Kotlin with full source code available in a companion github repository. Most of the book is programming-language agnostic and Kotlin was chosen for the code examples due to its conciseness and readability.Rust Database Connectivity (RDBC)https://andygrove.io/2020/01/rust-database-connectivity-rdbc/Fri, 10 Jan 2020 00:00:00 +0000https://andygrove.io/2020/01/rust-database-connectivity-rdbc/Many years ago I wrote a commercial product that could import a database schema and then generate source code based on the schema. There were many different use cases for this product and it could be used to generate simple Data Access Object (DAO) code or even to generate fully working (although very crude) web applications for data entry. Believe it or not, some companies have schemas with more than 500 tables, so tools like this can dramatically reduce development costs.Rust 2020 - Rust needs to be boringhttps://andygrove.io/2019/11/rust-2020-rust-needs-to-be-boring/Thu, 07 Nov 2019 00:00:00 +0000https://andygrove.io/2019/11/rust-2020-rust-needs-to-be-boring/This blog post is a response to the call for blogs from the Rust Core Team. I’ve been following Rust for long enough that I remember the early days (pre 1.0) where the language would keep changing from under me and I’d have to regularly rewrite parts of my project using the latest syntax. Fun times! Of course, things have changed a lot since then. The language has stabilized and we have Rust Editions to rely on for major releases.New Benchmarks Pagehttps://andygrove.io/2019/10/new-benchmarks-page/Sun, 20 Oct 2019 00:00:00 +0000https://andygrove.io/2019/10/new-benchmarks-page/For latest information on DataFusion and Ballista benchmarks, see https://github.com/datafusion-contrib/benchmark-automationRust Big Data Benchmarkshttps://andygrove.io/rust_bigdata_benchmarks/Tue, 01 Oct 2019 00:00:00 +0000https://andygrove.io/rust_bigdata_benchmarks/For latest information on DataFusion and Ballista benchmarks, see https://github.com/datafusion-contrib/benchmark-automationDataFusion 0.15.0 Release Noteshttps://andygrove.io/2019/09/datafusion-0.15.0-release-notes/Sun, 22 Sep 2019 00:00:00 +0000https://andygrove.io/2019/09/datafusion-0.15.0-release-notes/DataFusion is an extensible query execution framework implemented in Rust that uses the Apache Arrow memory model, and is part of the Apache Arrow project. DataFusion 0.15.0 is due to be released in the next few days (as part of the Apache Arrow 0.15.0 release) and contains a preview of a new query execution implementation based on a physical query plan, as opposed to executing the logical plan directly. The main motivations for this new implementation were:EKS security patches cause Apache Spark jobs to fail with permissions errorhttps://andygrove.io/2019/08/apache-spark-regressions-eks/Sat, 31 Aug 2019 00:00:00 +0000https://andygrove.io/2019/08/apache-spark-regressions-eks/Over the past couple days at work we started noticing Spark 2.3 and 2.4 jobs failing with a permissions error across multiple EKS clusters. Here is an example stack trace: java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) We eventually realized that this was due to Amazon rolling out security patches to their EKS clusters to address CVE-2019-9512 and CVE-2019-9514, causing a regression with the kubernetes client that Spark uses.Announcing Ballista - Distributed Compute with Rust, Apache Arrow, and Kuberneteshttps://andygrove.io/2019/07/announcing-ballista/Tue, 16 Jul 2019 00:00:00 +0000https://andygrove.io/2019/07/announcing-ballista/Eighteen months ago, I started the DataFusion project with the goal of building a distributed compute platform in Rust that could (eventually) rival Apache Spark. Unsurprisingly, this turned out to be an overly ambitious goal at the time and I fell short of achieving that. However, some very good things came out of this effort. We now have a Rust implementation of Apache Arrow with a growing community of committers, and DataFusion was donated to the Apache Arrow project as an in-memory query execution engine and is now starting to see some early adoption.DataFusion 0.13.0 Benchmarkshttps://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/Sun, 28 Apr 2019 00:00:00 +0000https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/Latest Benchmarks This blog post is from more than three years ago. For latest information on DataFusion and Ballista benchmarks, see https://github.com/datafusion-contrib/benchmark-automation Original Post Over the past couple weeks I’ve been working on a couple different efforts around parallel query execution with DataFusion: Benchmarking parallel query execution by manually creating one execution context per parquet partition and running on a thread, just to get an idea of expected performance, and comparing results to Apache Spark (running in local mode).Parallel Query Execution in Rusthttps://andygrove.io/2019/04/parallel-query-execution/Sat, 20 Apr 2019 00:00:00 +0000https://andygrove.io/2019/04/parallel-query-execution/I’m working on a design to create a physical execution plan for DataFusion (part of Apache Arrow) that will support parallel query execution across multiple threads. A typical use case would be executing a SQL query against a Parquet file that has already been partitioned (basically, multiple files with the same schema in a single directory). DataFusion only supports a small number of operations currently: Projection (filter columns) Selection (filter rows) Limit (filter rows) Aggregate The first two listed here (projection and selection) and perfectly parallelizable, meaning that the projection and selection for each partition can run on its own thread.DataFusion 0.13.0 released as part of Apache Arrowhttps://andygrove.io/2019/04/datafusion-0.13.0/Tue, 02 Apr 2019 00:00:00 +0000https://andygrove.io/2019/04/datafusion-0.13.0/DataFusion is an in-memory query engine for analytical queries, implemented in Rust, that uses Apache Arrow for the memory model. DataFusion 0.13.0 is now available on crates.io. This is the first release as part of Apache Arrow, which is why the version number has jumped from 0.6.0. Here is a high level changelog for this release: Parquet Support It is now possible to run queries against Parquet files (in addition to the existing support for CSV files).DataFusion Donated to Apache Arrowhttps://andygrove.io/2019/02/datafusion-donated-to-apache-arrow/Tue, 05 Feb 2019 00:00:00 +0000https://andygrove.io/2019/02/datafusion-donated-to-apache-arrow/I’m excited to announce that DataFusion has now been donated to the Apache Software Foundation as a Rust-native in-memory query engine for the Apache Arrow project. I am also honored to have been invited to join the Apache Arrow PMC (Project Management Committee). Here is a brief blog post announcing the donation: https://arrow.apache.org/blog/2019/02/05/datafusion-donation/Apache Arrow Git Tipshttps://andygrove.io/apache_arrow_git_tips/Fri, 01 Feb 2019 00:00:00 +0000https://andygrove.io/apache_arrow_git_tips/I’m used to working with git merge and this has made life difficult for me when working with Apache Arrow because it doesn’t use a merge model and pull request branches often need to rebased against master and force pushed. There are numerous ways to work in this model but this article documents the approach I use, based on some guidance I was given on one of my PRs. I’m documenting this for my own benefit but hopefully it helps others too.Retro Arcade Cabinethttps://andygrove.io/projects/retro-arcade-cabinet/Mon, 28 Jan 2019 00:00:00 +0000https://andygrove.io/projects/retro-arcade-cabinet/I made this retro arcade cabinet based on plans from The Geek Pub.DataFusion 0.6.0https://andygrove.io/2019/01/datafusion-0.6.0/Mon, 21 Jan 2019 00:00:00 +0000https://andygrove.io/2019/01/datafusion-0.6.0/DataFusion is an in-memory query engine implemented in Rust that uses Apache Arrow for the memory model. DataFusion 0.6.0 is now available on crates.io and is the first release to depend on an official release of the Rust implementation of Apache Arrow. Over the past couple months I essentially started from scratch with DataFusion because the Rust implementation of Apache Arrow had changed significantly since I contributed the original prototype (and it is much improved now thanks to contributions from quite a few people).DataFusion 2019https://andygrove.io/2018/11/datafusion-2019/Sun, 04 Nov 2018 00:00:00 +0000https://andygrove.io/2018/11/datafusion-2019/Earlier this year I put a lot of time and energy into DataFusion with the goal of creating a platform somewhat like Apache Spark, but implemented in Rust, without all the inefficiencies of the JVM. This was quite the journey, and I learned a lot of positive things from this effort, specifically: I greatly increased my skills in the Rust programming language (I’m still no expert but I would classify myself as competent and productive at least) I really understood the benefits of columnar data formats for the first time.How To Build A Modern Distributed Compute Platformhttps://andygrove.io/how_to_build_a_modern_distributed_compute_platform/Thu, 01 Nov 2018 00:00:00 +0000https://andygrove.io/how_to_build_a_modern_distributed_compute_platform/February 2020 Update: I have now written a book How Query Engines Work based on the content in this article. Introduction I have been involved in several projects over the past decade where I have built query engines and distributed databases, the latest being my open source Ballista project. The sophistication of these projects has varied wildly and I have learned plenty of lessons the hard way. That said, this has been, and will continue to be, an exciting journey for me.Overview of Popular Open Source Big Data Technologieshttps://andygrove.io/overview_of_popular_open_source_big_data_technologies/Thu, 01 Nov 2018 00:00:00 +0000https://andygrove.io/overview_of_popular_open_source_big_data_technologies/I’m writing this article based on multiple requests. I am not an expert on all of the open source projects that exist for so called Big Data but I do have knowledge of some of them at least, so I figured I could help people navigate their way around the ecosystem a little. In some cases I have simply copied and pasted the description from the project’s web site. In other cases I have added my own opinions.Hosting Jekyll web sites with Amazon Lightsail and Let's Encrypt SSLhttps://andygrove.io/2018/05/hosting-jekyll-lightsail-lets-encrypt-ssl/Sat, 19 May 2018 00:00:00 +0000https://andygrove.io/2018/05/hosting-jekyll-lightsail-lets-encrypt-ssl/After the recent fiasco where I moved my web sites from Google Cloud to GitHub Pages and then found out that it wasn’t possible to use a custom domain with SSL with both an apex domain and a www subdomain, I decided to move the sites again this weekend, this time to AWS. Previously I would have used EC2 but after some quick Googling, it looked like Amazon Lightsail would be a better way to go.DataFusion Aggregate Performancehttps://andygrove.io/2018/05/datafusion-aggregate-performance/Tue, 15 May 2018 00:00:00 +0000https://andygrove.io/2018/05/datafusion-aggregate-performance/Latest Benchmarks This blog post is from more than four years ago. For latest information on DataFusion and Ballista benchmarks, see https://github.com/datafusion-contrib/benchmark-automation Original Post Recently I’ve been working on improving the support for aggregate queries in DataFusion. I’m currently using NYC Taxi Trip Record Data for testing. The results on this page are from running two simple aggregate queries against data for a single month (Dec 2017) which is admittedly a pretty small data set at 800 MB in CSV format, but I have to start somewhere, and this is where I’m starting.How not to move your blog to GitHub pages using a custom domain and SSLhttps://andygrove.io/2018/05/github-pages-custom-domain-ssl/Tue, 15 May 2018 00:00:00 +0000https://andygrove.io/2018/05/github-pages-custom-domain-ssl/I was excited to see the recent tweet from GitHub just a couple of weeks ago, announcing: “Today, custom domains on GitHub Pages are gaining support for HTTPS via @letsencrypt. It’s another step towards making the web more secure for everyone.” After reviewing the docs I decided to move this blog and a couple other small sites over from my Google Cloud account and host them on GitHub pages since I was already using Jekyll.Refactoring Apache Arrow to use traits and genericshttps://andygrove.io/2018/05/apache-arrow-traits-generics/Fri, 04 May 2018 00:00:00 +0000https://andygrove.io/2018/05/apache-arrow-traits-generics/I am currently working on a refactor of the Rust implementation of Apache Arrow to change the way that arrays are represented. This is a relatively large change even though this is a tiny codebase so far and I thought it would be good to write up this blog post to explain why I think this is needed. I think this information will also be interesting for any Rust developer who is struggling with making the right choice between (or using the right combination of) enums, structs, generics and traits.DataFusion: Parquet, Quiver, and Roadmaphttps://andygrove.io/2018/04/datafusion-parquet-quiver-roadmap/Sun, 15 Apr 2018 00:00:00 +0000https://andygrove.io/2018/04/datafusion-parquet-quiver-roadmap/There are a few interesting things going on with DataFusion that I wanted to share. Support for Apache Parquet Thanks to the great work happening with the parquet-rs crate, I have been able to add preliminary Parquet support to DataFusion. It is now possible to open local Parquet files as DataFrames and run SQL against them. There are some examples in the repo. Currently, support is limited to flat parquet files (no nested types) and also limited to a subset of data types (INT32, INT64, FLOAT32, FLOAT64, and UTF8) but with that in place it should be easy to add other types and this seems like a good place for others to contribute.DataFusion now uses Apache Arrowhttps://andygrove.io/2018/04/datafusion-apache-arrow/Thu, 05 Apr 2018 00:00:00 +0000https://andygrove.io/2018/04/datafusion-apache-arrow/I’m excited to announce that DataFusion is now using Apache Arrow for its internal memory representation of data. It was already using columnar data structures based on Vec<T> and moving to Arrow was not that big a leap. Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.Q1 Review & Q2 Goalshttps://andygrove.io/2018/04/q1-review-q2-goals/Mon, 02 Apr 2018 00:00:00 +0000https://andygrove.io/2018/04/q1-review-q2-goals/At the start of the year, I set myself some goals for Q1. Q1 Achievements I met the majority of my Q1 goals. My main professional achievements (outside of work) were: Contributed a Rust implementation of Apache Arrow to the Apache Software Foundation Resurrected an open source project I have been working on for some time and reached a milestone where I could publish the project and start talking about it.DataFusion 0.2.1 Benchmarkhttps://andygrove.io/2018/03/datafusion-0.2.1-benchmark/Sat, 17 Mar 2018 00:00:00 +0000https://andygrove.io/2018/03/datafusion-0.2.1-benchmark/Latest Benchmarks This blog post is from more than four years ago. For latest information on DataFusion and Ballista benchmarks, see https://github.com/datafusion-contrib/benchmark-automation Original Post Over the past week or so I have been refactoring the core of DataFusion to convert it from a row-based execution engine to perform column-based processing. This was a pretty large refactoring effort but I am now back to roughly the same level of functionality as before (which is definitely still POC but capable of running some real queries).DataFusion update 2/11/18https://andygrove.io/2018/02/datafusion-update-02-11-18/Sun, 11 Feb 2018 00:00:00 +0000https://andygrove.io/2018/02/datafusion-update-02-11-18/Following on from my blog post Rust is for Big Data, I announced my open source distributed data processing project, DataFusion, on reddit last week. I was probably a bit premature in announcing the project since it was at such an early stage but I was excited that I had some simple queries working with quite decent performance (roughly 2x the performance of Apache Spark) and wanted to start generating some interest in the project.What I'm Reading This Month (Jan 2018)https://andygrove.io/2018/01/what-im-reading-this-month/Wed, 31 Jan 2018 00:00:00 +0000https://andygrove.io/2018/01/what-im-reading-this-month/I’m reading the following two books at the moment and both are helping me towards my 2018 goals. First of all I’m reading this book on responsive web design since this is probably my weakest area professionally. Hopefully this web site will start looking a lot nicer once I put some of these ideas into practice! Secondly, I’m reading this book on Rust which should help me with my DataFusion project:Rust is for Big Data (#rust2018)https://andygrove.io/2018/01/rust-is-for-big-data/Sun, 28 Jan 2018 00:00:00 +0000https://andygrove.io/2018/01/rust-is-for-big-data/This blog post isn’t so much about what I want from the Rust language in 2018, but more about where I see an opportunity for Rust to gain more widespread use in 2018. I’ve been following the Rust language for a couple of years now after a co-worker introduced me to it and mentored me in getting a simple project up and running. I was also lucky enough to attend the very first RustConf in 2016 where there was a lot of talk about the opportunites for Rust to have a big impact on the server due to it’s inherent security (no more buffer overflow attacks) and it’s performance and scalability with the recently released futures and tokio crates.My Goals for 2018https://andygrove.io/2018/01/goals-for-2018/Thu, 25 Jan 2018 00:00:00 +0000https://andygrove.io/2018/01/goals-for-2018/Instead of making new year’s resolutions, I’ve set myself some fairly specific goals for 2018 relating to family, health, finances, hobbies, and career. Many of these goals are broken down into monthly and quarterly objectives and some even have specific objectives and measurable results in true OKR style. My high-level goals in terms of software engineering and career are: Increase my expertise in Scala & Spark, since those are my bread and butter Increase my expertise in Rust, because I’m excited about Rust’s future and want to be a part of it Become proficient at web design and web development (skills I’m sorely lacking) Publish at least one high quality blog post each quarter Re-launch Keep Calm And Learn Rust and make it a useful resource for developers who are learning Rust Start at least one open source side-project where I can practice my Rust and web skills I think these goals are very acheivable and the most useful part of this excercise was figuring out all the things that I’m not going to do this year so I can focus on the things that really matter.Re-launching this sitehttps://andygrove.io/2018/01/relaunching-this-site/Thu, 25 Jan 2018 00:00:00 +0000https://andygrove.io/2018/01/relaunching-this-site/This is my “Hello, World” blog post. I’ve previously used Wordpress for my blogs but one of my goals for 2018 is to finally get hands-on with web design and web development. I’ve been a backend and data guy forever and the last time I built a production UI it was a fat client implemented in Swing. I’m probably showing my age a bit there. Later this year I will be learning React for a side project that I am working on.Ultrasonic Pi Pianohttps://andygrove.io/projects/ultrasonic-pi-piano/Fri, 14 Apr 2017 00:00:00 +0000https://andygrove.io/projects/ultrasonic-pi-piano/The Ultrasonic Pi Piano is a piano that uses ultrasonic sensors as inputs and translates the distances into MIDI notes that are then played via a software synthesizer on the Raspberry Pi. This project has been featured on hackaday.io, Adafruit Blog, and The Raspberry Pi Foundation’s Blog. Here’s a video showing the final product. Here’s a video that shows how the project works. Source code and documentation is available at https://github.Full Size Dalekhttps://andygrove.io/projects/full-size-dalek/Mon, 29 Jul 2013 00:00:00 +0000https://andygrove.io/projects/full-size-dalek/Soon after finishing my one-fifth scale Dalek Robot, I started work on a full size Dalek. This one is made from a combination of plywood, cardboard, PVC plumbing and acrylic Christmas decorations plus a good deal of duct tape and glue. Originally I made a paper mache dome but I couldn’t get it looking authentic enough, so I cheated and purchased a fiber glass dome from a Dalek builder in the UK.About Mehttps://andygrove.io/about/Mon, 01 Jan 0001 00:00:00 +0000https://andygrove.io/about/I’m a software engineer with more than 30 years of professional experience in a wide range of industries, including Banking, Media, Insurance, Hardware, and Software. I have co-founded two startups (one was acquired, one failed). I have been specializing in query engines and distributed systems for the past 15 years. I am the original author of Apache DataFusion and am a PMC member of Apache Arrow and Apache DataFusion. I am also the original author of the sqlparser-rs project, which is one of the leading open-source SQL parsers for the Rust ecosystem.Contact Mehttps://andygrove.io/contact/Mon, 01 Jan 0001 00:00:00 +0000https://andygrove.io/contact/Best ways to contact me: Bluesky: https://bsky.app/profile/andygrove.io LinkedIn: https://www.linkedin.com/in/andygrove/ Github: https://github.com/andygroveResumehttps://andygrove.io/resume/Mon, 01 Jan 0001 00:00:00 +0000https://andygrove.io/resume/I’m an experienced software engineer (30 years experience). I have worked in multiple industries and with multiple tech stacks and paradigms. I have extensive experience with distributed computing and for the past decade have been involved with several projects where I have built SQL parsers, query planners and optimizers, as well as distributed query execution capabilities. I have experience with various Hadoop related technologies such as Apache Spark, Apache Parquet, Apache Arrow, Apache Drill, HDFS, Thrift and so on.This Weekend in DataFusion (2/18/18)https://andygrove.io/1/01/this-weekend-in-datafusion/Mon, 01 Jan 0001 00:00:00 +0000https://andygrove.io/1/01/this-weekend-in-datafusion/I had limited time to work on DataFusion this weekend but have started to refactor the code base based on some feedback that I received on Reddit last week and have also been working on some benchmarks. Generating Closures DataFusion uses the Expr enum to represent the different types of expression that can be used in projections, selections, and so on. Here is the current definition of Expr, which only supports a very limited set of expressions today.