EKS security patches cause Apache Spark jobs to fail with permissions error
August 31, 2019
Over the past couple days at work we started noticing Spark 2.3 and 2.4 jobs failing with a permissions error across multiple EKS clusters. Here is an example stack trace:
java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
We eventually realized that this was due to Amazon rolling out security patches to their EKS clusters to address CVE-2019-9512 and CVE-2019-9514, causing a regression with the kubernetes client that Spark uses. Two issues have been filed against Spark for this: SPARK-28921 and SPARK-28925.
The solution is to upgrade Spark to use kubernetes client 4.4.0 or later, to pick up this patch for the issue.
I have created the following trivial pull requests against Spark to upgrade to the 4.4.2 release (latest version at time of writing):
- Patch for Spark 2.3.x (rejected because Spark 2.3 is EOL 😢)
- Patch for Spark 2.4.x
- Patch for Spark 3.0.x
If you need to fix this urgently, there are a few options:
Option 1: Simply replace the kubernetes client jar(s) in the Spark distributions with the three 4.4.2 jars that are available to download from Maven central and hope for the best.
kubernetes-client-4.4.2.jar
kubernetes-model-4.4.2.jar
kubernetes-model-common-4.4.2.jar
Option 2:
- Check out the tag for specific Spark release that you need
- Update the kubernetes version in
resource-managers/kubernetes/core/pom.xml
- Follow the instructions for building a Spark distribution
Option 3: Ask your Amazon Technical Account Manager to have the patches rolled back, assuming you are comfortable accepting the risk of a DoS attack against your EKS API if it is exposed publicly.
Want to learn more about query engines? Check out my book "How Query Engines Work".