[SPARK-23015][WINDOWS] Fix bug in Windows where starting multiple Spark instances within the same second causes a failure #43706

Rafnel · 2023-11-07T21:46:15Z

What changes were proposed in this pull request?

Problem
If you attempt to start multiple Spark instances within a second there is a high likelihood that this spark-class-launcher-output file will have the same name for multiple instances, causing the Spark launcher to fail. The error will look something like this:

WARNING - The process cannot access the file because it is being used by another process.
WARNING - The system cannot find the file C:\Users\CURRENTUSER\AppData\Local\Temp\spark-class-launcher-output-21229.txt.
WARNING - The process cannot access the file because it is being used by another process.

Windows' %RANDOM% is seeded with 1-second granularity. We often start ~20 instances at the same time daily in Windows and encounter this bug on a weekly basis

Proposed Fix
Instead of relying on %RANDOM% which has poor granularity, use Powershell to generate a GUID and append that to the end of the temp file name. We have been using this in production for around 2-3 months and have never encountered this bug since.

Why are the changes needed?

My team runs Spark on Windows and we boot up 20+ instances within a few seconds on a daily basis. We encountered this bug weekly and have taken steps to mitigate it without changing the Spark source code like adding a random sleep between 1-300 seconds before starting Spark. Even with a random sleep, 20+ instances have a likelihood of sleeping a similar amount of time and starting at the same time. Also, relying on a random sleep before starting Spark is clunky, unreliable, and not a deterministic way to avoid this issue.

Eventually our team went ahead and edited the code in this .cmd file with this fix. I figured I should make a pull request for this as well.

Does this PR introduce any user-facing change?

no

How was this patch tested?

You can pretty reliably recreate this bug by submitting 30 Spark jobs in Windows using spark-submit. Eventually the Spark launcher will overlap with another Spark launcher and fail.

You can pull my fixed spark-class2.cmd and try this again and there should be no incidence of this bug.

Was this patch authored or co-authored using generative AI tooling?

no

…tiple instances of Spark within a second fails Windows' %RANDOM% is seeded with 1-second granularity. If you attempt to start multiple Spark instances within a second there is a high likelihood that this spark-class-launcher-output file will have the same name for multiple instances, causing the Spark launcher to fail. Although not a 100% fix to this bug, appending the Windows %TIME% to the end of this file name increases the granularity from 1 second to 10 ms, making the likelihood of launching two overlapping Spark instances less probable.

Rafnel · 2023-11-07T21:53:32Z

I enabled github actions on my forked repo but it's not clear to me how to re-kick the failed build now that I've done that. Perhaps a maintainer will know how to do this. I haven't used github's CI/CD before.

Rafnel · 2023-11-07T22:08:51Z

I also wanted to note someone attempted to fix this issue with this PR in 2019: #25076 and this did get merged, but does not appear to fix the issue. We are using Spark 3.2.1 which does have #25076 in the spark-class2.cmd but we encounter this bug all the same.

HyukjinKwon

The change itself seems fine .. would be great if a committer can test this patch on Windows before merging it (since we don't have the CI for this - we do tests SparkR on Windows but not for others)

Rafnel · 2023-11-20T16:30:56Z

Wanted to chime in again and note that my team has been using this patched version of the spark-class2.cmd to launch our Spark instances for a couple weeks now and have not encountered the aforementioned bug anymore, or any other launch issues.

Obviously this change doesn't //fix// the underlying issue, as you could still launch 2 Spark instances within the same 10ms period and encounter this bug, but it's much less likely than launching 2 Spark instances within the same second. My team launches around 20 instances in Windows within a couple seconds of each other each day and were seeing this issue almost daily until this patch.

HyukjinKwon · 2023-11-22T00:59:57Z

Let me ask the dev mailing list and see if we can have others test this patch

HyukjinKwon · 2023-11-22T01:02:15Z

Sent https://lists.apache.org/thread/9y0xrxo08cv3nt27wt40jrpncgzz1kt1

LuciferYang · 2023-11-29T11:22:44Z

friendly ping @panbingkun , if convenient, such as through offline communication, please help to verify this patch on Windows. Thanks ~

panbingkun · 2023-12-04T01:38:40Z

friendly ping @panbingkun , if convenient, such as through offline communication, please help to verify this patch on Windows. Thanks ~

Okay, I'll verify it a little later.

github-actions · 2024-03-14T00:17:49Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

panbingkun · 2024-03-14T00:44:26Z

I'm very sorry, I lost it. I will verify it on a Windows machine this weekend.

LuciferYang · 2024-03-14T16:54:20Z

remove the Stale tag

Rafnel · 2024-03-14T17:06:56Z

Hello all, I'm going to do one final edit on this PR. We found that set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%%TIME::=0%.txt still resulted in the occasional race condition error, as we launch tens of Spark instances at the same time each day on Windows. %RANDOM%%TIME::=0% still has a small chance of overlapping. Our team found that using GUIDs at the end of the temp file name completely fixed this bug and removed the need for %RANDOM% and %TIME%, entirely.

We've been using this new approach for months now without a single issue. Apologies for not updating the PR sooner.

…ions when booting multiple Spark instances

Rafnel · 2024-03-14T17:19:23Z

Alright, that should be my final edit. Like I said, we've had no issues since applying this fix a few months ago and we have since increased the amount of concurrent Spark instances that we launch on Windows.

panbingkun · 2024-03-16T12:44:36Z

bin/spark-class2.cmd

-rem SPARK-28302: %RANDOM% would return the same number if we call it instantly after last call,
-rem so we should make it sure to generate unique file to avoid process collision of writing into
-rem the same file concurrently.
+FOR /F %%a IN ('POWERSHELL -COMMAND "$([guid]::NewGuid().ToString())"') DO (SET NEWGUID=%%a)


If we use the GUID mechanism, do we still need this logic?

:gen ... if exist %LAUNCHER_OUTPUT% goto :gen

In addition, the current logic strongly relies on Powershell. I am worried about that the lack of Powershell on older versions of Windows will cause launch Spark failures.
Can we eliminate this strong dependence?

If we use the GUID mechanism, do we still need this logic?

:gen ... if exist %LAUNCHER_OUTPUT% goto :gen

You are correct, we shouldn't need this anymore. I can remove it.

In addition, the current logic strongly relies on Powershell. I am worried about that the lack of Powershell on older versions of Windows will cause launch Spark failures. Can we eliminate this strong dependence?

As for this concern, I think this is the most lightweight way to generate a GUID in a .cmd script (without installing any dependencies). Powershell is available in Windows 7+ and Windows Server 2008R2+. I would be surprised if there were users running Spark on Windows XP/Vista or Server versions older than 2008. In its current state, you can't reliably instantiate multiple Spark instances on Windows, so I think it's a worthwhile tradeoff.

Updated to remove below goto statement.

:gen ... if exist %LAUNCHER_OUTPUT% goto :gen

In addition, the current logic strongly relies on Powershell. I am worried about that the lack of Powershell on older versions of Windows will cause launch Spark failures. Can we eliminate this strong dependence?

As for this concern, I think this is the most lightweight way to generate a GUID in a .cmd script (without installing any dependencies). Powershell is available in Windows 7+ and Windows Server 2008R2+. I would be surprised if there were users running Spark on Windows XP/Vista or Server versions older than 2008. In its current state, you can't reliably instantiate multiple Spark instances on Windows, so I think it's a worthwhile tradeoff.

Perhaps we can first check whether Powershell has been installed. If not, use the original logic (until we find a simpler way to generate GUID), and if it is installed, use the logic based on Powershell to generate GUID?

Or prompt to install Powershell if it is not installed?

Updated to add a simple check for powershell.exe and if not exists, fall back to using the Windows %RANDOM% which is the current behavior of this script.

panbingkun · 2024-03-16T14:01:32Z

My Windows version is Windows 10
This can indeed generate a unique output file name (I added some print logs), as follows:

And spark-shell starts normally.

… before using it and otherwise use Windows %RANDOM%

Rafnel · 2024-04-29T20:54:07Z

Bumping this PR - any other concerns?

panbingkun · 2024-04-30T02:29:41Z

I think there is no problem with the overall logic, but I am a little worried about the UUID generation.
I suggest using python to generate UUID, because in our native logic, we have used python to generate SPARK_HOME,
This is to eliminate dependence on the outside, WDYT? @HyukjinKwon @LuciferYang

spark/bin/find-spark-home.cmd

Lines 21 to 59 in da92293

    
           set FIND_SPARK_HOME_PYTHON_SCRIPT=%~dp0find_spark_home.py 
        
           rem Default to standard python3 interpreter unless told otherwise 
        
           set PYTHON_RUNNER=python3 
        
           rem If PYSPARK_DRIVER_PYTHON is set, it overwrites the python version 
        
           if not "x%PYSPARK_DRIVER_PYTHON%"=="x" ( 
        
             set PYTHON_RUNNER=%PYSPARK_DRIVER_PYTHON% 
        
           ) 
        
           rem If PYSPARK_PYTHON is set, it overwrites the python version 
        
           if not "x%PYSPARK_PYTHON%"=="x" ( 
        
             set PYTHON_RUNNER=%PYSPARK_PYTHON% 
        
           ) 
        
           rem If there is python installed, trying to use the root dir as SPARK_HOME 
        
           where %PYTHON_RUNNER% > nul 2>&1 
        
           if %ERRORLEVEL% neq 0 ( 
        
             if not exist %PYTHON_RUNNER% ( 
        
               if "x%SPARK_HOME%"=="x" ( 
        
                 echo Missing Python executable '%PYTHON_RUNNER%', defaulting to '%~dp0..' for SPARK_HOME ^ 
        
           environment variable. Please install Python or specify the correct Python executable in ^ 
        
           PYSPARK_DRIVER_PYTHON or PYSPARK_PYTHON environment variable to detect SPARK_HOME safely. 
        
                 set SPARK_HOME=%~dp0.. 
        
               ) 
        
             ) 
        
           ) 
        
           rem Only attempt to find SPARK_HOME if it is not set. 
        
           if "x%SPARK_HOME%"=="x" ( 
        
             if not exist "%FIND_SPARK_HOME_PYTHON_SCRIPT%" ( 
        
               rem If we are not in the same directory as find_spark_home.py we are not pip installed so we don't 
        
               rem need to search the different Python directories for a Spark installation. 
        
               rem Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or 
        
               rem spark-submit in another directory we want to use that version of PySpark rather than the 
        
               rem pip installed version of PySpark. 
        
               set SPARK_HOME=%~dp0.. 
        
             ) else ( 
        
               rem We are pip installed, use the Python script to resolve a reasonable SPARK_HOME 
        
               for /f "delims=" %%i in ('%PYTHON_RUNNER% "%FIND_SPARK_HOME_PYTHON_SCRIPT%"') do set SPARK_HOME=%%i 
        
             )

HyukjinKwon · 2024-04-30T02:50:27Z

problem is that python is not bundled with Windows IIRC ..

HyukjinKwon

I am fine with this given that it was tested properly.

Rafnel · 2024-05-28T15:40:35Z

Thanks folks, sorry for the delay, got sidetracked and kept forgetting to check this issue again. As mentioned, my company has been using this exact code in our production process for some months now which starts up tens of instances of Spark in parallel to run a host of parallel jobs. As far as I'm concerned, it's been tested thoroughly given this, but I'm not sure if you all have any other standard tests that you run. Given that this is a launch script that doesn't get edited often, I imagine it doesn't have a standard testing procedure.

Also I don't have permissions to merge this, so if one of the reviewers can merge it, that would be great. thanks!

HyukjinKwon · 2024-05-29T02:08:27Z

Merged to master.

github-actions bot added the WINDOWS label Nov 7, 2023

Remove space between RANDOM and TIME in Spark launch file

eed35fe

HyukjinKwon reviewed Nov 10, 2023

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-23015] Mitigate bug in Windows where starting multiple Spark instances within the same second causes a failure~~ [SPARK-23015][WINDOWS] Mitigate bug in Windows where starting multiple Spark instances within the same second causes a failure Nov 22, 2023

github-actions bot added the Stale label Mar 14, 2024

LuciferYang removed the Stale label Mar 14, 2024

Remove %RANDOM% from temporary Spark launcher file to fix race condit…

316fc48

…ions when booting multiple Spark instances

Rafnel changed the title ~~[SPARK-23015][WINDOWS] Mitigate bug in Windows where starting multiple Spark instances within the same second causes a failure~~ [SPARK-23015][WINDOWS] Fix bug in Windows where starting multiple Spark instances within the same second causes a failure Mar 14, 2024

panbingkun reviewed Mar 16, 2024

View reviewed changes

Rafnel added 2 commits March 21, 2024 10:10

Remove unnecessary goto statement in Spark Windows launcher

a5b952f

Update Windows Spark launcher to use check if Powershell is installed…

ba8b056

… before using it and otherwise use Windows %RANDOM%

HyukjinKwon approved these changes Apr 30, 2024

View reviewed changes

HyukjinKwon closed this in 56c65d5 May 29, 2024

[SPARK-23015][WINDOWS] Fix bug in Windows where starting multiple Spark instances within the same second causes a failure #43706

[SPARK-23015][WINDOWS] Fix bug in Windows where starting multiple Spark instances within the same second causes a failure #43706

Uh oh!

Conversation

Rafnel commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Rafnel commented Nov 7, 2023

Uh oh!

Rafnel commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Rafnel commented Nov 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Nov 22, 2023

Uh oh!

HyukjinKwon commented Nov 22, 2023

Uh oh!

LuciferYang commented Nov 29, 2023

Uh oh!

panbingkun commented Dec 4, 2023

Uh oh!

github-actions bot commented Mar 14, 2024

Uh oh!

panbingkun commented Mar 14, 2024

Uh oh!

LuciferYang commented Mar 14, 2024

Uh oh!

Rafnel commented Mar 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rafnel commented Mar 14, 2024

Uh oh!

panbingkun Mar 16, 2024

Choose a reason for hiding this comment

Uh oh!

panbingkun Mar 16, 2024

Choose a reason for hiding this comment

Uh oh!

Rafnel Mar 21, 2024

Choose a reason for hiding this comment

Uh oh!

Rafnel Mar 21, 2024

Choose a reason for hiding this comment

Uh oh!

Rafnel Mar 21, 2024

Choose a reason for hiding this comment

Uh oh!

panbingkun Mar 22, 2024

Choose a reason for hiding this comment

Uh oh!

Rafnel Apr 2, 2024

Choose a reason for hiding this comment

Uh oh!

panbingkun commented Mar 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rafnel commented Apr 29, 2024

Uh oh!

panbingkun commented Apr 30, 2024

Uh oh!

HyukjinKwon commented Apr 30, 2024

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Rafnel commented May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 29, 2024

Uh oh!

Rafnel commented Nov 7, 2023 •

edited

Loading

Rafnel commented Nov 7, 2023 •

edited

Loading

Rafnel commented Nov 20, 2023 •

edited

Loading

Rafnel commented Mar 14, 2024 •

edited

Loading

panbingkun commented Mar 16, 2024 •

edited

Loading

Rafnel commented May 28, 2024 •

edited

Loading