-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32175][CORE] Fix the order between initialization for ExecutorPlugin and starting heartbeat thread #29002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #124964 has finished for PR 29002 at commit
|
|
retest this please. |
|
Test build #124993 has finished for PR 29002 at commit
|
|
retest this please. |
|
Test build #124994 has finished for PR 29002 at commit
|
|
retest this please. |
|
Test build #124999 has finished for PR 29002 at commit
|
|
retest this please. |
|
Test build #125110 has finished for PR 29002 at commit
|
Ngone51
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this from a real issue?
Shall we do the same thing for DriverPlugin or we already did?
|
Test build #125316 has finished for PR 29002 at commit
|
|
As soon as we heartbeat to driver, we will start getting task assignment, etc - how does the interact with a plugin which has not yet completed initialization ? |
It happens not in the production but an environment for experiment. |
Yeah that's right.
This can be a reasonable compromise. Another solution can be to perform initialization before registering executors and then it's guaranteed that plugins are initialized when tasks start running. |
Isn't that not what is implicitly handling now ? And causing issues if plugin takes a long time to initialize ? |
I wonder if this is correct. Executor only gets task assignment after it sends spark/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala Lines 156 to 158 in 3946b24
After heartbeater started, the executor can send heartbeat to the driver asynchronically and the initialization of the plugin can keep going after heartbeater starting. It doesn't matter if it takes a long time because we'd send heartbeat now but no task assignment since |
|
So If we reorder the initialization, it should be placed before the registration, not after heartbeater started right? |
|
You are right @Ngone51, I misread the proposed change. |
Ideally, that way should work as well. However, it seems plugin initialization relies on some internal instances of BTW, I think your current implementation can already work as I mentioned above. |
Yes, if we reorder the initialization with another way, the initialization would be put on executor backends like
Yeah, I misunderstood. |
|
retest this please. |
|
Test build #125591 has finished for PR 29002 at commit
|
| } | ||
|
|
||
| test("SPARK-32175: Plugin initialization should start after heartbeater started") { | ||
| val tempDir = Utils.createTempDir() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean it up at the end of the test? (though I know it will be cleaned by shutdown hook anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to use withTempDir.
| val importStatements = | ||
| """ | ||
| |import java.util.Map; | ||
| |import org.apache.spark.api.plugin.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using the qualified class name to avoid adding the new parameter preClassDefinitionBlock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's better.
| implementsClasses: Seq[String] = Seq.empty, | ||
| extraCodeBody: String = ""): File = { | ||
| val extendsText = Option(baseClass).map { c => s" extends ${c}" }.getOrElse("") | ||
| val implementsText = implementsClasses.map(", " + _).mkString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe?
| val implementsText = implementsClasses.map(", " + _).mkString | |
| val implementsText = | |
| s"implements ${implementsClasses ++ Seq("java.io.Serializable").mkString(", ")}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. To be much simpler, I'll make it (implementsClasses :+ "java.io.Serializable").mkString(", ").
|
|
||
| metricsPoller.start() | ||
|
|
||
| // Plugins need to load using a class loader that includes the executor's user classpath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add comments to explain why we need to initialize plugin after heartbeater?
|
@Ngone51 Thanks for the feedback. I'll fix them. |
Tweaked heartbeat interval for the testcase.
|
Test build #125789 has finished for PR 29002 at commit
|
|
Test build #126105 has finished for PR 29002 at commit
|
|
cc: @vanzin @tgravescs |
core/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala
Outdated
Show resolved
Hide resolved
|
Yeah I think this is ok, especially since normally if your plugin is less then heartbeat time there shouldn't be any difference. I would like to test this with a couple scenarios though. When we first put this in, I brought up that the plugin could block other executor functionality by running a long time, one thought was we could put plugin into another thread so it didn't block the executor. I think the thought at the time was if you have a long running executor plugin, the user should handle that by doing it in another thread or something. Now you could make the argument you have something long running in the plugin and you want to block the executor for other initialization. The heartbeat is 10seconds by default and I thought the timeout was the network timeout by default (120s), which seems like a really long time. I guess if you change those timeouts, but even then seems like a long time. |
|
sorry had stuff come up internally, I'll look more today. |
|
ok looked at this a bit more and tested a few scenarios, I think this is fine, I would like it put right after heartbeater unless we have reason and would be nice to get test to run faster. |
|
Test build #126520 has finished for PR 29002 at commit
|
|
retest this please. |
|
Test build #126533 has finished for PR 29002 at commit
|
tgravescs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good pending Jenkins
|
it looks like arrow tests failing which shouldn't be related, kicked once more |
|
It's strange that the reason of the R test failure implies that deprecated function is used. But it's resolved in #29252 . |
|
Test build #126742 has finished for PR 29002 at commit
|
|
LGTM |
…Plugin and starting heartbeat thread ### What changes were proposed in this pull request? This PR changes the order between initialization for ExecutorPlugin and starting heartbeat thread in Executor. ### Why are the changes needed? In the current master, heartbeat thread in a executor starts after plugin initialization so if the initialization takes long time, heartbeat is not sent to driver and the executor will be removed from cluster. ### Does this PR introduce _any_ user-facing change? Yes. Plugins for executors will be allowed to take long time for initialization. ### How was this patch tested? New testcase. Closes #29002 from sarutak/fix-heartbeat-issue. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Thomas Graves <[email protected]> (cherry picked from commit 9be0883) Signed-off-by: Thomas Graves <[email protected]>
|
finally, tests pass, merged to master and branch-3.0. thanks @sarutak |
|
Would you guys mind taking a look for the flaky tests added here? It seems failing in GitHub Actions as below: https://github.com/apache/spark/runs/928145740 The failure seems pretty frequent. |
|
@HyukjinKwon I'll look into that. Should we ignore the test first? |
|
Personally I would be fine with just removing the test. I'm not sure how much benefit it really adds for the time it takes to run the test and if its being flaky I'm guessing you are going to have to increase the times. |
|
Yeah, I wonder we need to increase the times. The testcase is to detect regressions which can happen in the future but I've left the comment about the order so removing the test can be fine. |
|
Sure, removing sounds fine and simpler to me. |
|
O.K, I'll make a PR for removing it. |
### What changes were proposed in this pull request? This PR removes a test added in SPARK-32175(#29002). ### Why are the changes needed? That test is flaky. It can be mitigated by increasing the timeout but it would rather be simpler to remove the test. See also the [discussion](#29002 (comment)). ### Does this PR introduce _any_ user-facing change? No. Closes #29314 from sarutak/remove-flaky-test. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Kousuke Saruta <[email protected]>
### What changes were proposed in this pull request? This PR removes a test added in SPARK-32175(#29002). ### Why are the changes needed? That test is flaky. It can be mitigated by increasing the timeout but it would rather be simpler to remove the test. See also the [discussion](#29002 (comment)). ### Does this PR introduce _any_ user-facing change? No. Closes #29314 from sarutak/remove-flaky-test. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Kousuke Saruta <[email protected]> (cherry picked from commit 9d7b1d9) Signed-off-by: Kousuke Saruta <[email protected]>
What changes were proposed in this pull request?
This PR changes the order between initialization for ExecutorPlugin and starting heartbeat thread in Executor.
Why are the changes needed?
In the current master, heartbeat thread in a executor starts after plugin initialization so if the initialization takes long time, heartbeat is not sent to driver and the executor will be removed from cluster.
Does this PR introduce any user-facing change?
Yes. Plugins for executors will be allowed to take long time for initialization.
How was this patch tested?
New testcase.