-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23015][WINDOWS] Fix bug in Windows where starting multiple Spark instances within the same second causes a failure #43706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tiple instances of Spark within a second fails Windows' %RANDOM% is seeded with 1-second granularity. If you attempt to start multiple Spark instances within a second there is a high likelihood that this spark-class-launcher-output file will have the same name for multiple instances, causing the Spark launcher to fail. Although not a 100% fix to this bug, appending the Windows %TIME% to the end of this file name increases the granularity from 1 second to 10 ms, making the likelihood of launching two overlapping Spark instances less probable.
|
I enabled github actions on my forked repo but it's not clear to me how to re-kick the failed build now that I've done that. Perhaps a maintainer will know how to do this. I haven't used github's CI/CD before. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change itself seems fine .. would be great if a committer can test this patch on Windows before merging it (since we don't have the CI for this - we do tests SparkR on Windows but not for others)
|
Wanted to chime in again and note that my team has been using this patched version of the spark-class2.cmd to launch our Spark instances for a couple weeks now and have not encountered the aforementioned bug anymore, or any other launch issues. Obviously this change doesn't //fix// the underlying issue, as you could still launch 2 Spark instances within the same 10ms period and encounter this bug, but it's much less likely than launching 2 Spark instances within the same second. My team launches around 20 instances in Windows within a couple seconds of each other each day and were seeing this issue almost daily until this patch. |
|
Let me ask the dev mailing list and see if we can have others test this patch |
|
friendly ping @panbingkun , if convenient, such as through offline communication, please help to verify this patch on Windows. Thanks ~ |
Okay, I'll verify it a little later. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
I'm very sorry, I lost it. I will verify it on a Windows machine this weekend. |
|
remove the |
|
Hello all, I'm going to do one final edit on this PR. We found that We've been using this new approach for months now without a single issue. Apologies for not updating the PR sooner. |
…ions when booting multiple Spark instances
|
Alright, that should be my final edit. Like I said, we've had no issues since applying this fix a few months ago and we have since increased the amount of concurrent Spark instances that we launch on Windows. |
bin/spark-class2.cmd
Outdated
| rem SPARK-28302: %RANDOM% would return the same number if we call it instantly after last call, | ||
| rem so we should make it sure to generate unique file to avoid process collision of writing into | ||
| rem the same file concurrently. | ||
| FOR /F %%a IN ('POWERSHELL -COMMAND "$([guid]::NewGuid().ToString())"') DO (SET NEWGUID=%%a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use the GUID mechanism, do we still need this logic?
:gen
...
if exist %LAUNCHER_OUTPUT% goto :gen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, the current logic strongly relies on Powershell. I am worried about that the lack of Powershell on older versions of Windows will cause launch Spark failures.
Can we eliminate this strong dependence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use the
GUIDmechanism, do we still need this logic?:gen ... if exist %LAUNCHER_OUTPUT% goto :gen
You are correct, we shouldn't need this anymore. I can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, the current logic strongly relies on
Powershell. I am worried about that the lack ofPowershellonolderversions of Windows will cause launch Spark failures. Can weeliminatethis strong dependence?
As for this concern, I think this is the most lightweight way to generate a GUID in a .cmd script (without installing any dependencies). Powershell is available in Windows 7+ and Windows Server 2008R2+. I would be surprised if there were users running Spark on Windows XP/Vista or Server versions older than 2008. In its current state, you can't reliably instantiate multiple Spark instances on Windows, so I think it's a worthwhile tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to remove below goto statement.
:gen
...
if exist %LAUNCHER_OUTPUT% goto :gen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, the current logic strongly relies on
Powershell. I am worried about that the lack ofPowershellonolderversions of Windows will cause launch Spark failures. Can weeliminatethis strong dependence?As for this concern, I think this is the most lightweight way to generate a GUID in a .cmd script (without installing any dependencies). Powershell is available in Windows 7+ and Windows Server 2008R2+. I would be surprised if there were users running Spark on Windows XP/Vista or Server versions older than 2008. In its current state, you can't reliably instantiate multiple Spark instances on Windows, so I think it's a worthwhile tradeoff.
Perhaps we can first check whether Powershell has been installed. If not, use the original logic (until we find a simpler way to generate GUID), and if it is installed, use the logic based on Powershell to generate GUID?
Or prompt to install Powershell if it is not installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to add a simple check for powershell.exe and if not exists, fall back to using the Windows %RANDOM% which is the current behavior of this script.
… before using it and otherwise use Windows %RANDOM%
|
Bumping this PR - any other concerns? |
|
I think there is no problem with the overall logic, but I am a little worried about the Lines 21 to 59 in da92293
|
|
problem is that python is not bundled with Windows IIRC .. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with this given that it was tested properly.
|
Thanks folks, sorry for the delay, got sidetracked and kept forgetting to check this issue again. As mentioned, my company has been using this exact code in our production process for some months now which starts up tens of instances of Spark in parallel to run a host of parallel jobs. As far as I'm concerned, it's been tested thoroughly given this, but I'm not sure if you all have any other standard tests that you run. Given that this is a launch script that doesn't get edited often, I imagine it doesn't have a standard testing procedure. Also I don't have permissions to merge this, so if one of the reviewers can merge it, that would be great. thanks! |
|
Merged to master. |


What changes were proposed in this pull request?
Problem
If you attempt to start multiple Spark instances within a second there is a high likelihood that this spark-class-launcher-output file will have the same name for multiple instances, causing the Spark launcher to fail. The error will look something like this:
Windows' %RANDOM% is seeded with 1-second granularity. We often start ~20 instances at the same time daily in Windows and encounter this bug on a weekly basis
Proposed Fix
Instead of relying on %RANDOM% which has poor granularity, use Powershell to generate a GUID and append that to the end of the temp file name. We have been using this in production for around 2-3 months and have never encountered this bug since.
Why are the changes needed?
My team runs Spark on Windows and we boot up 20+ instances within a few seconds on a daily basis. We encountered this bug weekly and have taken steps to mitigate it without changing the Spark source code like adding a random sleep between 1-300 seconds before starting Spark. Even with a random sleep, 20+ instances have a likelihood of sleeping a similar amount of time and starting at the same time. Also, relying on a random sleep before starting Spark is clunky, unreliable, and not a deterministic way to avoid this issue.
Eventually our team went ahead and edited the code in this .cmd file with this fix. I figured I should make a pull request for this as well.
Does this PR introduce any user-facing change?
no
How was this patch tested?
You can pretty reliably recreate this bug by submitting 30 Spark jobs in Windows using spark-submit. Eventually the Spark launcher will overlap with another Spark launcher and fail.
You can pull my fixed spark-class2.cmd and try this again and there should be no incidence of this bug.
Was this patch authored or co-authored using generative AI tooling?
no