-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23980][K8S] Resilient Spark driver on Kubernetes #21067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looks like there's a lot of conflicts from the refactor that was just merged. In general though I don't think this buys us too much. The problem is that when the driver fails, you'll lose any and all state of progress done so far. We don't have a solid story for checkpointing streaming computation right now, and even if we did, you'll certainly lose all progress from batch jobs. Also, restarting the driver might not be the right thing to do in all cases. This assumes that it's always ok to have the driver re-launch itself automatically. But whether or not the driver should be relaunchable should be determined by the application submitter, and not necessarily done all the time. Can we make this behavior configurable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You set the job name here but for the driver pod you really want the pod name. It also seems very difficult to pass through the pod name in the driver config since you only know the pod's name derived from the job after the job has started. But we can probably use a unique label to look up the driver pod. Is the label mapped to job-name guaranteed to be unique?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should guarantee uniqueness by using UUID. Use labels to make it easy to group all executors tied to this specific job.
Should probably clarify re: streaming - we don't do any Kubernetes-specific actions (e.g. Persistent Volumes) to do Streaming checkpointing. But anything built-in to Spark should work, such as DFS checkpointing - barring anything that requires using the pod's local disk. |
This should be easy by configuring Pod Backoff failure policy of the job such that it executes the pod only once.
We've done work for this to store checkpointing on persistence volume but thought that should be a separate PR as it's not strictly linked to this change.
Agree that the batch job would be rerun from scratch. Still I think there is value for one being able to run the batch job unattended and not intervene in case of machine failure as the batch job will be rescheduled to another node. |
|
@mccheah Rebased to master, and added support for configurable backofflimit. |
|
Rebased again to master. |
|
Jenkins, ok to test |
|
Test build #90868 has finished for PR 21067 at commit
|
|
@felixcheung fixed the Scala style validations, sorry. |
|
Test build #90895 has finished for PR 21067 at commit
|
|
Test build #90898 has finished for PR 21067 at commit
|
|
@foxish on concerns of the lack of exactly-one semantics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is misnamed since it can configure the pod that's used on the driver as well, if the step is to be shared between the driver and the executor. Aside from that, I think this isn't the right abstraction. The feature steps should only be used to configure the driver/executor pod spec. Then, we can have KubernetesClientApplication or KubernetesDriverBuilder wrap the driver pod spec in a Job object. Therefore I don't think configuring the Job object should be done by these steps, but by some external wrapping step after all the features have been applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Driver Pod spec cannot be wrapped into Job spec because it excepts only Container Spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job has a pod template though right? We should just inject the pod into the job as the pod template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok looking a bit more closely at the API - I think what we want is for all feature steps to be configuring a SparkPod, where:
case class SparkPod(podTemplate: PodTemplateSpec, container: Container)
Then the KubernetesDriverBuilder adapts the pod template into a Job, but the KubernetesExecutorBuilder adapts the pod template into just a Pod. For the job, the adapting part is trivial (JobBuilder.withNewJobSpec().withSpec(...)). We can translate a pod template spec into a pod as well:
- inject
PodTemplateSpec.getMetadata()intoPodBuilder.withMetadata(), - inject
PodTemplateSpec.getSpec()intoPodBuilder.withSpec(), and PodBuilder.editMetadata().withName(<name>).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just use KubernetesClusterSchedulerBackend#applicationId? Believe that's actually newly generated every time, albeit confusingly since that app ID isn't tied to any fields on the driver Kubernetes objects itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is only generated when a new Spark Application is submitted. We need a random number which is regenerated for every new driver Pod not for new Application. Also it is too long because Kubernetes only allows pods with name length not longer than 64 character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is only generated when a new Spark Application is submitted. We need a random number which is regenerated for every new driver Pod not for new Application. Also it is too long because Kubernetes only allows pods with name length not longer than 64 character.
The application id is generated when then JVM launches - see SchedulerBackend.scala. Note this application ID isn't populated by spark submit itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use applicationID as a salt the executor pod name will exceed the 64 length limit in case the application name is longer.
override def configureExecutorPod(pod: SparkExecutorPod): SparkExecutorPod = { val name = s"$executorPodNamePrefix-$applicationID" + s"-exec-${kubernetesConf.roleSpecificConf.executorId}"
For example if the application name is networkwordcount then this result an executor pod name like: networkwordcount-1519234651100-spark-application-1528815922371-exec-10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just have the executor pod name be s"$applicationId-exec-$executorId then. Don't think the pod name prefix has to strictly be tied to the application name. The application name should be applied to a label so the executor pod can be located using that when using the dashboard and kubectl, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 digits does not have enough entropy to be a reliable salt, which is why I suggest this to avoid collisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In between restarts, you could potentially have two pods with the same job name but you need to pick the right one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually would like @foxish or @liyinan926 to confirm if that is the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this can happen, I can think two scenarios:
- Job fails: No one restarts the Job the user need to use the spark-submit again (all job related pods will be deleted because of the Ownerreference)
- Pod fails: Job will recreate a new Driver Pod to replace the failed one. There will be only one Driver pod because the failed one will be removed by the Kubernetes garbage collector.
Can you please elaborate on what do you mean by restart here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pod fails: Job will recreate a new Driver Pod to replace the failed one. There will be only one Driver pod because the failed one will be removed by the Kubernetes garbage collector.
For this one I'm not sure if the GC is done immediately. If that's part of the Kubernetes contract, then we're fine, if not, then we can't make any assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okey I will update the filter with an additional check if the found pod is in running state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation.
|
any update? |
|
Test build #91660 has finished for PR 21067 at commit
|
|
@felixcheung rebased to master and fixed failing unit tests |
|
@baluchicken @foxish any update on this? HA story is pretty critical for production in many cases. |
|
@skonto sorry I have couple of other things to do but I am trying to update this as my time allows it. |
|
@mccheah rebased to master and updated the PR, now the KubernetesDriverBuilder will create the driver job instead of the configuration steps. |
|
Test build #92650 has finished for PR 21067 at commit
|
|
@baluchicken probably this is covered here: #21260. I kind of missed that, as I thought it was only for hostpaths but it also covers PVs. |
|
@skonto thanks, I am going to check it. |
|
Test build #92685 has finished for PR 21067 at commit
|
|
I don't think this current approach will suffice. Correctness is important here, especially for folks using spark streaming. I understand that you're proposing the use of backoff limits but there is no guarantee that a job controller won't spin up 2 driver pods when we ask for 1. That by definition is how the job controller works, by being greedy and working towards desired completions. For example, in the case of a network partition, the job controller logic in the Kubernetes master will not differentiate between:
This has been the reason why in the past I've proposed using a StatefulSet. However, getting termination semantics with a StatefulSet will be more work. I don't think we should sacrifice correctness in this layer as it would surprise the application author who now has to reason about whether the operation they are performing is idempotent. The key issue here is that there are two sources of truth, the job itself (k8s API Server) and the Spark driver (JVM) which are at different levels and not talking with each other. Can we have a proposal and understand all the subtleties before trying to change this behavior. For example, if we end up with more than one driver for a single job, I'd like to ensure that only one of them is making progress (for ex. by using a lease in ZK). |
|
+1 on what @foxish said. I would also like to see a detailed discussion on the semantic differences this brings onto the table first before committing to this approach. |
|
@foxish I just checked on a Google Kubernetes Cluster with Kubernetes version 1.10.4-gke.2. I created a two node cluster and I emulated "network partition" with iptables rules (node running the spark driver become NotReady). After a short/configurable delay the driver pod state changed to Unknown and the Job controller initiated a new spark driver. After that I removed the iptables rules denying the kubelet to speak with the master (The node with status NotReady become Ready again). The driver pod with the unknown state got terminated, with all of it's executors. In this case there are no parallel running spark drivers so I think we are not sacrificing correctness. Am I missing something? |
|
@baluchicken, did that test involve using checkpointing in a shared location? |
|
After a short/configurable delay the driver pod state changed to Unknown and the Job controller initiated a new spark driver.
This is dangerous behavior. The old spark driver can still be perfectly
functional and running within the cluster even though it's state is marked
Unknown. It could also still be making progress with it's own executors and it probably was - did you check it's logs on the partitioned node?
Network connection with the K8s master is not a prerequisite for pods to
continue running.
Thanks for performing the test. We dealt with this once before when designing statefulset where with stateful workloads, we ran into this issue and it caused split brain. The way to solve that was to make the name of the pod a lock, and therefore, when it enters the "unknown" state, to not create a replacement unless the user explicitly force-kills the old one. Job was however not expected to do this, it's proactive about replacing the pod it was using with a new one on a new accessible node. You can run Statefulset through the same simulation to see it's behavior. Ideally we want "StatefulJob" here which doesn't exist.
…On Thu, Jul 12, 2018, 7:57 AM Lucas Kacher ***@***.***> wrote:
@baluchicken <https://github.com/baluchicken>, did that test involve
using checkpointing in a shared location?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21067 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA3U5z1vUwuS3NHh8Zx388Am8gs1sedTks5uF2PogaJpZM4TTiRg>
.
|
|
Test build #93201 has finished for PR 21067 at commit
|
|
I ran some more tests about this. I think we can say that this change can add resiliency to spark batch jobs where just like in case of YARN Spark will retry the job from the beginning if an error happened. Also it can add resiliency to the Spark Streaming apps. I fully understand your concerns but if someone is going to submit a resilient Spark Streaming app it will use Spark feature Checkpointing. For checkpointing he/she should use some kind of PersistentVolume otherwise all info saved to this dir will be lost in case of a node failure. For PVC the accessMode should be a ReadWriteOnce solution because for this amount of data it is way faster than the ReadWriteMany ones. My new tests used the same approach described above with one modifications I enabled the checkpointing dir backed with ReadWriteOnce PVC. ReadWriteOnce storage can only be attached to one node. I thought Kubernetes will detach this volume once the Node become "NotReady", but other thing happened. Kubernetes does not detached the volume from the unknown node so despite of the Job Controller created a new driver pod to replace the Unknown one it remained in Pending state because of required PVC still attached to a different node. Once the partitioned node become available again the unknown old driver pod got terminated, the volume got unattached and get reattached to the new driver pod which state now changed from pending to running. So I think there is no problem with the correctness here. We can maybe add a warning to the documentation that if someone wants to use a ReadWriteMany backed checkpoint dir correctness issue may arise, but otherwise maybe I am still missing something but I think it won't. |
What if the node never becomes available again? |
|
@baluchicken yeah I thought of that but I was hoping for more automation. |
This is well known. Using the RWO volume for fencing here would work - but this is not representative of all users. This breaks down if you include checkpointing to object storage (s3) or HDFS or into ReadWriteMany volumes like NFS. In all of those cases, there will be a problem with correctness. For folks that need it right away, the same restarts feature can be realized using an approach like the spark-operator without any of this hassle in a safe way, so, why are we trying to fit this into Spark with caveats around how volumes should be used to ensure fencing? This seems more error prone and harder to explain and I can't see the gain from it. One way forward is proposing to the k8s community to have a new option jobs that allow us to get fencing from the k8s apiserver through deterministic names. I think that would be a good way forward. |
|
+1 on what @foxish said. If using a Job is the right way to go ultimately, it's good to open discussion with sig-apps on adding an option to the Job API & controller to use deterministic pod names as well as to offer the exactly-once semantic. Spark probably is not the only use case needing such a semantic guarantee. |
|
Thanks for the responses, I learned a lot from this:) I am going to close this PR for now, and maybe collaborate on the Kubernetes ticket raised by this PR. Thanks. |
What changes were proposed in this pull request?
The current implementation of
Spark driveron Kubernetes is not resilient to node failures as it’s implemented as aPod. In case of a node failure Kubernetes terminates the pods that were running on that node. Kubernetes doesn't reschedule these pods to any of the other nodes of the cluster.If the
driveris implemented as Kubernetes Job than it will be rescheduled to other node.When the driver is terminated its executors (that may run on other nodes) are terminated by Kubernetes with some delay by Kubernetes Garbage collection.
This can lead to concurrency issues where the re-spawned
driverwas trying to create new executors with same name as the executors being in the middle of being cleaned up by Kubernetes garbage collection.To solve this issue the executor name must be made unique for each
driverinstance.For example:
networkwordcount-1519301591265-usmj-exec-1How was this patch tested?
This patch was tested manually.
Submitted a Spark application to a cluster with three node:
Spark driver
networkwordcount-1519301591265-driveris a Kubernetes Job, that manages thenetworkwordcount-1519301591265-driver-mszl2pod.Shutted down the node where the driver pod is running
The spark driver kubernetes job rescheduled the driver pod as
networkwordcount-1519301591265-driver-dwvkf.Please review http://spark.apache.org/contributing.html before opening a pull request.