Skip to content

Conversation

@Rafnel
Copy link
Contributor

@Rafnel Rafnel commented Nov 7, 2023

What changes were proposed in this pull request?

Problem
If you attempt to start multiple Spark instances within a second there is a high likelihood that this spark-class-launcher-output file will have the same name for multiple instances, causing the Spark launcher to fail. The error will look something like this:

WARNING - The process cannot access the file because it is being used by another process.
WARNING - The system cannot find the file C:\Users\CURRENTUSER\AppData\Local\Temp\spark-class-launcher-output-21229.txt.
WARNING - The process cannot access the file because it is being used by another process.

Windows' %RANDOM% is seeded with 1-second granularity. We often start ~20 instances at the same time daily in Windows and encounter this bug on a weekly basis

Proposed Fix
Instead of relying on %RANDOM% which has poor granularity, use Powershell to generate a GUID and append that to the end of the temp file name. We have been using this in production for around 2-3 months and have never encountered this bug since.

Why are the changes needed?

My team runs Spark on Windows and we boot up 20+ instances within a few seconds on a daily basis. We encountered this bug weekly and have taken steps to mitigate it without changing the Spark source code like adding a random sleep between 1-300 seconds before starting Spark. Even with a random sleep, 20+ instances have a likelihood of sleeping a similar amount of time and starting at the same time. Also, relying on a random sleep before starting Spark is clunky, unreliable, and not a deterministic way to avoid this issue.

Eventually our team went ahead and edited the code in this .cmd file with this fix. I figured I should make a pull request for this as well.

Does this PR introduce any user-facing change?

no

How was this patch tested?

You can pretty reliably recreate this bug by submitting 30 Spark jobs in Windows using spark-submit. Eventually the Spark launcher will overlap with another Spark launcher and fail.

You can pull my fixed spark-class2.cmd and try this again and there should be no incidence of this bug.

Was this patch authored or co-authored using generative AI tooling?

no

…tiple instances of Spark within a second fails

Windows' %RANDOM% is seeded with 1-second granularity. If you attempt to start multiple Spark instances within a second there is a high likelihood that this spark-class-launcher-output file will have the same name for multiple instances, causing the Spark launcher to fail.

Although not a 100% fix to this bug, appending the Windows %TIME% to the end of this file name increases the granularity from 1 second to 10 ms, making the likelihood of launching two overlapping Spark instances less probable.
@Rafnel
Copy link
Contributor Author

Rafnel commented Nov 7, 2023

I enabled github actions on my forked repo but it's not clear to me how to re-kick the failed build now that I've done that. Perhaps a maintainer will know how to do this. I haven't used github's CI/CD before.

@Rafnel
Copy link
Contributor Author

Rafnel commented Nov 7, 2023

I also wanted to note someone attempted to fix this issue with this PR in 2019: #25076 and this did get merged, but does not appear to fix the issue. We are using Spark 3.2.1 which does have #25076 in the spark-class2.cmd but we encounter this bug all the same.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself seems fine .. would be great if a committer can test this patch on Windows before merging it (since we don't have the CI for this - we do tests SparkR on Windows but not for others)

@Rafnel
Copy link
Contributor Author

Rafnel commented Nov 20, 2023

Wanted to chime in again and note that my team has been using this patched version of the spark-class2.cmd to launch our Spark instances for a couple weeks now and have not encountered the aforementioned bug anymore, or any other launch issues.

Obviously this change doesn't //fix// the underlying issue, as you could still launch 2 Spark instances within the same 10ms period and encounter this bug, but it's much less likely than launching 2 Spark instances within the same second. My team launches around 20 instances in Windows within a couple seconds of each other each day and were seeing this issue almost daily until this patch.

@HyukjinKwon HyukjinKwon changed the title [SPARK-23015] Mitigate bug in Windows where starting multiple Spark instances within the same second causes a failure [SPARK-23015][WINDOWS] Mitigate bug in Windows where starting multiple Spark instances within the same second causes a failure Nov 22, 2023
@HyukjinKwon
Copy link
Member

Let me ask the dev mailing list and see if we can have others test this patch

@HyukjinKwon
Copy link
Member

@LuciferYang
Copy link
Contributor

friendly ping @panbingkun , if convenient, such as through offline communication, please help to verify this patch on Windows. Thanks ~

@panbingkun
Copy link
Contributor

friendly ping @panbingkun , if convenient, such as through offline communication, please help to verify this patch on Windows. Thanks ~

Okay, I'll verify it a little later.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 14, 2024
@panbingkun
Copy link
Contributor

I'm very sorry, I lost it. I will verify it on a Windows machine this weekend.

@LuciferYang LuciferYang removed the Stale label Mar 14, 2024
@LuciferYang
Copy link
Contributor

remove the Stale tag

@Rafnel
Copy link
Contributor Author

Rafnel commented Mar 14, 2024

Hello all, I'm going to do one final edit on this PR. We found that set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%%TIME::=0%.txt still resulted in the occasional race condition error, as we launch tens of Spark instances at the same time each day on Windows. %RANDOM%%TIME::=0% still has a small chance of overlapping. Our team found that using GUIDs at the end of the temp file name completely fixed this bug and removed the need for %RANDOM% and %TIME%, entirely.

We've been using this new approach for months now without a single issue. Apologies for not updating the PR sooner.

@Rafnel
Copy link
Contributor Author

Rafnel commented Mar 14, 2024

Alright, that should be my final edit. Like I said, we've had no issues since applying this fix a few months ago and we have since increased the amount of concurrent Spark instances that we launch on Windows.

@Rafnel Rafnel changed the title [SPARK-23015][WINDOWS] Mitigate bug in Windows where starting multiple Spark instances within the same second causes a failure [SPARK-23015][WINDOWS] Fix bug in Windows where starting multiple Spark instances within the same second causes a failure Mar 14, 2024
rem SPARK-28302: %RANDOM% would return the same number if we call it instantly after last call,
rem so we should make it sure to generate unique file to avoid process collision of writing into
rem the same file concurrently.
FOR /F %%a IN ('POWERSHELL -COMMAND "$([guid]::NewGuid().ToString())"') DO (SET NEWGUID=%%a)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the GUID mechanism, do we still need this logic?

:gen
...
if exist %LAUNCHER_OUTPUT% goto :gen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, the current logic strongly relies on Powershell. I am worried about that the lack of Powershell on older versions of Windows will cause launch Spark failures.
Can we eliminate this strong dependence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the GUID mechanism, do we still need this logic?

:gen
...
if exist %LAUNCHER_OUTPUT% goto :gen

You are correct, we shouldn't need this anymore. I can remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, the current logic strongly relies on Powershell. I am worried about that the lack of Powershell on older versions of Windows will cause launch Spark failures. Can we eliminate this strong dependence?

As for this concern, I think this is the most lightweight way to generate a GUID in a .cmd script (without installing any dependencies). Powershell is available in Windows 7+ and Windows Server 2008R2+. I would be surprised if there were users running Spark on Windows XP/Vista or Server versions older than 2008. In its current state, you can't reliably instantiate multiple Spark instances on Windows, so I think it's a worthwhile tradeoff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to remove below goto statement.

:gen
...
if exist %LAUNCHER_OUTPUT% goto :gen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, the current logic strongly relies on Powershell. I am worried about that the lack of Powershell on older versions of Windows will cause launch Spark failures. Can we eliminate this strong dependence?

As for this concern, I think this is the most lightweight way to generate a GUID in a .cmd script (without installing any dependencies). Powershell is available in Windows 7+ and Windows Server 2008R2+. I would be surprised if there were users running Spark on Windows XP/Vista or Server versions older than 2008. In its current state, you can't reliably instantiate multiple Spark instances on Windows, so I think it's a worthwhile tradeoff.

Perhaps we can first check whether Powershell has been installed. If not, use the original logic (until we find a simpler way to generate GUID), and if it is installed, use the logic based on Powershell to generate GUID?

Or prompt to install Powershell if it is not installed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to add a simple check for powershell.exe and if not exists, fall back to using the Windows %RANDOM% which is the current behavior of this script.

@panbingkun
Copy link
Contributor

panbingkun commented Mar 16, 2024

My Windows version is Windows 10
This can indeed generate a unique output file name (I added some print logs), as follows:
image

image

And spark-shell starts normally.

@Rafnel
Copy link
Contributor Author

Rafnel commented Apr 29, 2024

Bumping this PR - any other concerns?

@panbingkun
Copy link
Contributor

I think there is no problem with the overall logic, but I am a little worried about the UUID generation.
I suggest using python to generate UUID, because in our native logic, we have used python to generate SPARK_HOME,
This is to eliminate dependence on the outside, WDYT? @HyukjinKwon @LuciferYang

set FIND_SPARK_HOME_PYTHON_SCRIPT=%~dp0find_spark_home.py
rem Default to standard python3 interpreter unless told otherwise
set PYTHON_RUNNER=python3
rem If PYSPARK_DRIVER_PYTHON is set, it overwrites the python version
if not "x%PYSPARK_DRIVER_PYTHON%"=="x" (
set PYTHON_RUNNER=%PYSPARK_DRIVER_PYTHON%
)
rem If PYSPARK_PYTHON is set, it overwrites the python version
if not "x%PYSPARK_PYTHON%"=="x" (
set PYTHON_RUNNER=%PYSPARK_PYTHON%
)
rem If there is python installed, trying to use the root dir as SPARK_HOME
where %PYTHON_RUNNER% > nul 2>&1
if %ERRORLEVEL% neq 0 (
if not exist %PYTHON_RUNNER% (
if "x%SPARK_HOME%"=="x" (
echo Missing Python executable '%PYTHON_RUNNER%', defaulting to '%~dp0..' for SPARK_HOME ^
environment variable. Please install Python or specify the correct Python executable in ^
PYSPARK_DRIVER_PYTHON or PYSPARK_PYTHON environment variable to detect SPARK_HOME safely.
set SPARK_HOME=%~dp0..
)
)
)
rem Only attempt to find SPARK_HOME if it is not set.
if "x%SPARK_HOME%"=="x" (
if not exist "%FIND_SPARK_HOME_PYTHON_SCRIPT%" (
rem If we are not in the same directory as find_spark_home.py we are not pip installed so we don't
rem need to search the different Python directories for a Spark installation.
rem Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or
rem spark-submit in another directory we want to use that version of PySpark rather than the
rem pip installed version of PySpark.
set SPARK_HOME=%~dp0..
) else (
rem We are pip installed, use the Python script to resolve a reasonable SPARK_HOME
for /f "delims=" %%i in ('%PYTHON_RUNNER% "%FIND_SPARK_HOME_PYTHON_SCRIPT%"') do set SPARK_HOME=%%i
)

@HyukjinKwon
Copy link
Member

problem is that python is not bundled with Windows IIRC ..

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with this given that it was tested properly.

@Rafnel
Copy link
Contributor Author

Rafnel commented May 28, 2024

Thanks folks, sorry for the delay, got sidetracked and kept forgetting to check this issue again. As mentioned, my company has been using this exact code in our production process for some months now which starts up tens of instances of Spark in parallel to run a host of parallel jobs. As far as I'm concerned, it's been tested thoroughly given this, but I'm not sure if you all have any other standard tests that you run. Given that this is a launch script that doesn't get edited often, I imagine it doesn't have a standard testing procedure.

Also I don't have permissions to merge this, so if one of the reviewers can merge it, that would be great. thanks!

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants