EKS security patches cause Apache Spark jobs to fail with permissions error

August 31, 2019

Over the past couple days at work we started noticing Spark 2.3 and 2.4 jobs failing with a permissions error across multiple EKS clusters. Here is an example stack trace:

java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' 
    at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
    at java.lang.Thread.run(Thread.java:748)

We eventually realized that this was due to Amazon rolling out security patches to their EKS clusters to address CVE-2019-9512 and CVE-2019-9514, causing a regression with the kubernetes client that Spark uses. Two issues have been filed against Spark for this: SPARK-28921 and SPARK-28925.

The solution is to upgrade Spark to use kubernetes client 4.4.0 or later, to pick up this patch for the issue.

I have created the following trivial pull requests against Spark to upgrade to the 4.4.2 release (latest version at time of writing):

If you need to fix this urgently, there are a couple of options:

Option 1: Simply replace the kubernetes client jar(s) in the Spark distributions with the three 4.4.2 jars that are available to download from Maven central and hope for the best.

kubernetes-client-4.4.2.jar
kubernetes-model-4.4.2.jar
kubernetes-model-common-4.4.2.jar

Option 2:

  • Check out the tag for specific Spark release that you need
  • Update the kubernetes version in resource-managers/kubernetes/core/pom.xml
  • Follow the instructions for building a Spark distribution

I’ll update this post after the weekend once I’ve had time to verify how well this works.