Unset Python Endpoint close callback upon destruction#206
Conversation
In some cases, particularly with `distributed-ucxx`, the C++ Endpoint outlives the Python object that owns a close callback registered with `set_close_callback()`, making it an invalid reference and causing segfaults. To prevent that it is necessary to remove the close callback, preventing it from being called after it's not valid anymore.
It is not entirely clear why, but when attempting to reconnect Distributed may fail to complete async tasks, leaving UCXX references still alive. For now we disable those errors that only occur during the teardown phase of the aforementioned test.
…e-close-callback-upon-ep-destruction
|
|
||
|
|
||
| @gen_test() | ||
| @gen_test(timeout=60) |
There was a problem hiding this comment.
With some local experimentation it looks like teardown failures are in essence a gen_test timeout, but it doesn't show up as such. I tried increasing the timeout here but one of the tests nevertheless failed. Locally though I see this test completing in under 10 seconds or in 60 seconds with this change even though it ultimately passes after the full 60 seconds. With that I'm trying to understand whether this is a problem with UCXX or if this is something with the Distributed test suite (e.g., gen_test) as I would expect it to fail every time if it times out.
wence-
left a comment
There was a problem hiding this comment.
Minor tidying suggestions
Co-authored-by: Lawrence Mitchell <[email protected]>
This reverts commit c1cbbb4.
…n-ep-destruction' into python-remove-close-callback-upon-ep-destruction
|
I just reverted the commit increasing timeout for |
|
Thanks for taking the time for reviewing @wence- ! |
|
/merge |
In some cases, particularly with
distributed-ucxx, the C++ Endpoint outlives the Python object that owns a close callback registered withset_close_callback(), making it an invalid reference and causing segfaults. To prevent that it is necessary to remove the close callback, preventing it from being called after it's not valid anymore.