Skip to content

libpython: Avoid race condition when reading region in use_temp_region()#638

Merged
wenzeslaus merged 4 commits into
OSGeo:masterfrom
wenzeslaus:no-region-update-from-use_temp_region
May 18, 2020
Merged

libpython: Avoid race condition when reading region in use_temp_region()#638
wenzeslaus merged 4 commits into
OSGeo:masterfrom
wenzeslaus:no-region-update-from-use_temp_region

Conversation

@wenzeslaus

Copy link
Copy Markdown
Member

When g.region is called without -u flag, it always adjusts and writes back the
computational region. When a parallel process is reading it too, it hits an empty
(or rarely corrupted) file.

In the provided test, this happens when reading the WIND file. At this point, the results
from the test must be expected manually.

When g.region is called without -u flag, it always adjusts and writes back the
computational region. When a parallel process is reading it too, it hits an empty
(or rarely corrupted) file.

In the provided test, this happens when reading the WIND file. At this point, the results
from the test must be expected manually.
Return codes of bg procs are not propagated, so something is needed to track processes.
Using maps created at the end of each Python script to do that.
Parametrize the Bash script to make it reusable also for a potential nesting region correctness test.
Also be more robust when the Python script is not available (zero returned on some platforms).
@wenzeslaus

Copy link
Copy Markdown
Member Author

I have completed the test which checks if the parallel scripts are running using raster maps created which adds another complexity and region access, although the resulting maps are checked only for existence. (A correct passing of the region is not the topic of this PR, but the scripts can be used for that.)

It should not show up in the list of failing tests in CI and the number of failing tests should be at least roughly the same as on master (39 on Ubuntu 18.04). You can run it locally in one of these three ways:

# Assuming: cd lib/python/script/
# Default parameters
./testsuite/test_parallel_temp_region.sh
# Custom parameters
./testsuite/test_parallel_temp_region.sh parallel nest 5 200 150 100,200,300 3
# Using the testing framework machinery (as in CI)
python3 -m grass.gunittest.main --grassdata $HOME/grassdata/ --location nc_spm --location-type nc --min-success 80

A failing test (when I remove the -u fix from this patch) looks like this:

...
ERROR: Region file __test_parallel_temp_region//WIND is empty
Traceback (most recent call last):
  File "./data/script_using_temporary_region.py", line 78, in <module>
    main()
  File "./data/script_using_temporary_region.py", line 73, in main
    script=this_file, size=size, remaining=remaining, nesting=nesting, map_name=map_name
  File "./data/script_using_temporary_region.py", line 25, in call_use_temp_region
    gs.use_temp_region()
  File "etc/python/grass/script/core.py", line 1281, in use_temp_region
    run_command("g.region", flags="", save=name, overwrite=True)
  File "etc/python/grass/script/core.py", line 500, in run_command
    return handle_errors(returncode, result=None, args=args, kwargs=kwargs)
  File "etc/python/grass/script/core.py", line 393, in handle_errors
    returncode=returncode)
grass.exceptions.CalledModuleError: Module run g.region g.region --o save=tmp.script_using_temporary_region.py.27369 ended with error
Process ended with non-zero return code 1. See errors in the (error) output.
ERROR: Region file __test_parallel_temp_region//WIND is empty
Traceback (most recent call last):
  File "./data/script_using_temporary_region.py", line 78, in <module>
    main()
  File "./data/script_using_temporary_region.py", line 73, in main
    script=this_file, size=size, remaining=remaining, nesting=nesting, map_name=map_name
  File "./data/script_using_temporary_region.py", line 25, in call_use_temp_region
    gs.use_temp_region()
  File "etc/python/grass/script/core.py", line 1281, in use_temp_region
    run_command("g.region", flags="", save=name, overwrite=True)
  File "etc/python/grass/script/core.py", line 500, in run_command
    return handle_errors(returncode, result=None, args=args, kwargs=kwargs)
  File "etc/python/grass/script/core.py", line 393, in handle_errors
    returncode=returncode)
grass.exceptions.CalledModuleError: Module run g.region g.region --o save=tmp.script_using_temporary_region.py.27434 ended with error
Process ended with non-zero return code 1. See errors in the (error) output.
+ EXPECTED=100
+ g.list type=raster pattern=parallel_tmp_region_parallel_*_size_*_nesting_* mapset=.
+ wc -l
+ NUM=98
+ [ 98 -ne 100 ]
+ echo Pure parallel test: Got 98 but expected 100 maps
+ exit 1
+ cleanup
+ EXIT_CODE=1
+ g.remove type=raster pattern=parallel_tmp_region_parallel_*_size_*_nesting_* -f --quiet
+ g.remove type=raster pattern=parallel_tmp_region_nest_*_size_*_nesting_* -f --quiet
WARNING: No data base element files found
+ exit 1

@wenzeslaus wenzeslaus marked this pull request as ready for review May 17, 2020 04:10
@wenzeslaus wenzeslaus requested review from HuidaeCho and metzm May 17, 2020 04:10

@HuidaeCho HuidaeCho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. Just a couple typos and a comment about g.region's behavior.

Comment thread lib/python/script/testsuite/test_parallel_temp_region.sh Outdated
Comment thread lib/python/script/testsuite/data/script_using_temporary_region.py Outdated
Comment thread lib/python/script/testsuite/data/script_using_temporary_region.py Outdated
Comment thread lib/python/script/core.py
"""
name = "tmp.%s.%d" % (os.path.basename(sys.argv[0]), os.getpid())
run_command("g.region", save=name, overwrite=True)
run_command("g.region", flags="u", save=name, overwrite=True)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix g.region so that it updates WIND only when needed. For example, in this case, it just saves the current region (no changes to the region) then why rewrite WIND? It should be a read-only operation. That's for a new PR though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but as a heads-up, the code there looks pretty intentional.

@wenzeslaus wenzeslaus removed the request for review from metzm May 18, 2020 03:03
@wenzeslaus wenzeslaus merged commit 71c8a1f into OSGeo:master May 18, 2020
@wenzeslaus wenzeslaus deleted the no-region-update-from-use_temp_region branch May 18, 2020 03:20
@wenzeslaus

Copy link
Copy Markdown
Member Author

Given what is written in Trac 2230 ('g.region -p' writes new WIND file, causes race condition for parallel jobs), this is actually pretty clear. I'm merging. We can have the discussion about g.region elsewhere.

wenzeslaus added a commit to wenzeslaus/grass that referenced this pull request Jun 19, 2021
Test introduced in OSGeo#638 uses raster maps to determine if the subprocess was
successfully executed. However, in GitHub Actions this fails randomly
with a random number of rasters missing.

This PR makes the test wait for couple seconds before counting the rasters
in case the file system needs to catch up. If there is an actual error,
waiting won't fix it, so the test will still do its job.

If running and rerunning the tests here is successful, it suggests
that it is a good fix at least in the sense of accomodating
the given environment.

Alternative fix for the test would be to replace Bash by Python
which would allow leaving out the raster creation check (which is
replacing return code check here) and provide more control.

On one hand, we want to make the test run, on the other,
we want to have the actual code robust, but maybe writing 50
rasters in parallel in a small VM in GitHub Actions is not
a use case we need to cover in code and thus modifying the test is enough.
@neteler neteler added this to the 8.0.0 milestone Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants