libpython: Avoid race condition when reading region in use_temp_region() by wenzeslaus · Pull Request #638 · OSGeo/grass

wenzeslaus · 2020-05-15T03:53:06Z

When g.region is called without -u flag, it always adjusts and writes back the
computational region. When a parallel process is reading it too, it hits an empty
(or rarely corrupted) file.

In the provided test, this happens when reading the WIND file. At this point, the results
from the test must be expected manually.

When g.region is called without -u flag, it always adjusts and writes back the computational region. When a parallel process is reading it too, it hits an empty (or rarely corrupted) file. In the provided test, this happens when reading the WIND file. At this point, the results from the test must be expected manually.

Return codes of bg procs are not propagated, so something is needed to track processes. Using maps created at the end of each Python script to do that. Parametrize the Bash script to make it reusable also for a potential nesting region correctness test. Also be more robust when the Python script is not available (zero returned on some platforms).

wenzeslaus · 2020-05-17T04:09:57Z

I have completed the test which checks if the parallel scripts are running using raster maps created which adds another complexity and region access, although the resulting maps are checked only for existence. (A correct passing of the region is not the topic of this PR, but the scripts can be used for that.)

It should not show up in the list of failing tests in CI and the number of failing tests should be at least roughly the same as on master (39 on Ubuntu 18.04). You can run it locally in one of these three ways:

# Assuming: cd lib/python/script/
# Default parameters
./testsuite/test_parallel_temp_region.sh
# Custom parameters
./testsuite/test_parallel_temp_region.sh parallel nest 5 200 150 100,200,300 3
# Using the testing framework machinery (as in CI)
python3 -m grass.gunittest.main --grassdata $HOME/grassdata/ --location nc_spm --location-type nc --min-success 80

A failing test (when I remove the -u fix from this patch) looks like this:

...
ERROR: Region file __test_parallel_temp_region//WIND is empty
Traceback (most recent call last):
  File "./data/script_using_temporary_region.py", line 78, in <module>
    main()
  File "./data/script_using_temporary_region.py", line 73, in main
    script=this_file, size=size, remaining=remaining, nesting=nesting, map_name=map_name
  File "./data/script_using_temporary_region.py", line 25, in call_use_temp_region
    gs.use_temp_region()
  File "etc/python/grass/script/core.py", line 1281, in use_temp_region
    run_command("g.region", flags="", save=name, overwrite=True)
  File "etc/python/grass/script/core.py", line 500, in run_command
    return handle_errors(returncode, result=None, args=args, kwargs=kwargs)
  File "etc/python/grass/script/core.py", line 393, in handle_errors
    returncode=returncode)
grass.exceptions.CalledModuleError: Module run g.region g.region --o save=tmp.script_using_temporary_region.py.27369 ended with error
Process ended with non-zero return code 1. See errors in the (error) output.
ERROR: Region file __test_parallel_temp_region//WIND is empty
Traceback (most recent call last):
  File "./data/script_using_temporary_region.py", line 78, in <module>
    main()
  File "./data/script_using_temporary_region.py", line 73, in main
    script=this_file, size=size, remaining=remaining, nesting=nesting, map_name=map_name
  File "./data/script_using_temporary_region.py", line 25, in call_use_temp_region
    gs.use_temp_region()
  File "etc/python/grass/script/core.py", line 1281, in use_temp_region
    run_command("g.region", flags="", save=name, overwrite=True)
  File "etc/python/grass/script/core.py", line 500, in run_command
    return handle_errors(returncode, result=None, args=args, kwargs=kwargs)
  File "etc/python/grass/script/core.py", line 393, in handle_errors
    returncode=returncode)
grass.exceptions.CalledModuleError: Module run g.region g.region --o save=tmp.script_using_temporary_region.py.27434 ended with error
Process ended with non-zero return code 1. See errors in the (error) output.
+ EXPECTED=100
+ g.list type=raster pattern=parallel_tmp_region_parallel_*_size_*_nesting_* mapset=.
+ wc -l
+ NUM=98
+ [ 98 -ne 100 ]
+ echo Pure parallel test: Got 98 but expected 100 maps
+ exit 1
+ cleanup
+ EXIT_CODE=1
+ g.remove type=raster pattern=parallel_tmp_region_parallel_*_size_*_nesting_* -f --quiet
+ g.remove type=raster pattern=parallel_tmp_region_nest_*_size_*_nesting_* -f --quiet
WARNING: No data base element files found
+ exit 1

HuidaeCho

It looks good to me. Just a couple typos and a comment about g.region's behavior.

HuidaeCho · 2020-05-17T12:59:41Z

    """
    name = "tmp.%s.%d" % (os.path.basename(sys.argv[0]), os.getpid())
-    run_command("g.region", save=name, overwrite=True)
+    run_command("g.region", flags="u", save=name, overwrite=True)


I think we should fix g.region so that it updates WIND only when needed. For example, in this case, it just saves the current region (no changes to the region) then why rewrite WIND? It should be a read-only operation. That's for a new PR though.

I agree, but as a heads-up, the code there looks pretty intentional.

wenzeslaus · 2020-05-18T03:25:33Z

Given what is written in Trac 2230 ('g.region -p' writes new WIND file, causes race condition for parallel jobs), this is actually pretty clear. I'm merging. We can have the discussion about g.region elsewhere.

Test introduced in OSGeo#638 uses raster maps to determine if the subprocess was successfully executed. However, in GitHub Actions this fails randomly with a random number of rasters missing. This PR makes the test wait for couple seconds before counting the rasters in case the file system needs to catch up. If there is an actual error, waiting won't fix it, so the test will still do its job. If running and rerunning the tests here is successful, it suggests that it is a good fix at least in the sense of accomodating the given environment. Alternative fix for the test would be to replace Bash by Python which would allow leaving out the raster creation check (which is replacing return code check here) and provide more control. On one hand, we want to make the test run, on the other, we want to have the actual code robust, but maybe writing 50 rasters in parallel in a small VM in GitHub Actions is not a use case we need to cover in code and thus modifying the test is enough.

wenzeslaus added 2 commits May 14, 2020 23:52

wenzeslaus marked this pull request as ready for review May 17, 2020 04:10

wenzeslaus requested review from HuidaeCho and metzm May 17, 2020 04:10

HuidaeCho approved these changes May 17, 2020

View reviewed changes

wenzeslaus added 2 commits May 17, 2020 13:29

Decrease the number of parallel processes in the automated test

a987768

Fix typos and variable name in disabled line

61035af

wenzeslaus removed the request for review from metzm May 18, 2020 03:03

wenzeslaus merged commit 71c8a1f into OSGeo:master May 18, 2020

wenzeslaus deleted the no-region-update-from-use_temp_region branch May 18, 2020 03:20

wenzeslaus mentioned this pull request May 20, 2020

r.import and v.import: Use PID and node for tmp location #653

Merged

wenzeslaus mentioned this pull request Jun 18, 2021

Simplify raster band reference management #1272

Merged

wenzeslaus mentioned this pull request Jun 19, 2021

libpython: Make parallel region test somewhat more robust #1662

Closed

neteler added this to the 8.0.0 milestone Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

libpython: Avoid race condition when reading region in use_temp_region()#638

libpython: Avoid race condition when reading region in use_temp_region()#638
wenzeslaus merged 4 commits into
OSGeo:masterfrom
wenzeslaus:no-region-update-from-use_temp_region

wenzeslaus commented May 15, 2020

Uh oh!

wenzeslaus commented May 17, 2020

Uh oh!

HuidaeCho left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuidaeCho May 17, 2020

Uh oh!

wenzeslaus May 17, 2020

Uh oh!

wenzeslaus commented May 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

wenzeslaus commented May 15, 2020

Uh oh!

wenzeslaus commented May 17, 2020

Uh oh!

HuidaeCho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuidaeCho May 17, 2020

Choose a reason for hiding this comment

Uh oh!

wenzeslaus May 17, 2020

Choose a reason for hiding this comment

Uh oh!

wenzeslaus commented May 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants