Add non-blocking ucxx::Endpoint::close() request#192
Merged
rapids-bot[bot] merged 27 commits intorapidsai:branch-0.37from Mar 15, 2024
Merged
Add non-blocking ucxx::Endpoint::close() request#192rapids-bot[bot] merged 27 commits intorapidsai:branch-0.37from
ucxx::Endpoint::close() request#192rapids-bot[bot] merged 27 commits intorapidsai:branch-0.37from
Conversation
This was referenced Feb 19, 2024
wence-
reviewed
Mar 14, 2024
Contributor
wence-
left a comment
There was a problem hiding this comment.
General logic but not tests or python changes yet...
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
wence-
approved these changes
Mar 14, 2024
| // _worker->progress(); | ||
| // return closeRequest->isCompleted(); | ||
| // }; | ||
| // loopWithTimeout(std::chrono::milliseconds(5000), f); |
Contributor
There was a problem hiding this comment.
Just uncommenting this and commenting out the code on line 280 didn't let me see the problem you describe...
Member
Author
There was a problem hiding this comment.
You should definitely see it if you have a UCX debug build. Just tried it now with UCX 1.15 debug and I'm able to reproduce it:
[dgx13:3842411:0:3842411] ucp_worker.c:2888 Assertion `worker->inprogress++ == 0' failed
/src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_worker.c: [ ucp_worker_progress() ]
...
2885 UCP_WORKER_THREAD_CS_ENTER_CONDITIONAL(worker);
2886
2887 /* check that ucp_worker_progress is not called from within ucp_worker_progress */
==> 2888 ucs_assert(worker->inprogress++ == 0);
2889 count = uct_worker_progress(worker->uct);
2890 ucs_async_check_miss(&worker->async);
2891
==== backtrace (tid:3842411) ====
0 0x000000000005fd79 ucp_worker_progress() /src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_worker.c:2888
1 0x000000000005bebe ucxx::Worker::progressOnce() /src/ucxx/cpp/src/worker.cpp:254
2 0x000000000005beec ucxx::Worker::progressPending() /src/ucxx/cpp/src/worker.cpp:260
3 0x000000000005ca07 ucxx::Worker::progress() /src/ucxx/cpp/src/worker.cpp:268
4 0x000000000005ca07 std::mutex::lock() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_mutex.h:100
5 0x000000000005ca07 std::lock_guard<std::mutex>::lock_guard() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_mutex.h:229
6 0x000000000005ca07 ucxx::Worker::progress() /src/ucxx/cpp/src/worker.cpp:272
7 0x000000000004e1c1 std::_Function_handler<bool (), (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::{lambda()#2}>::_M_invoke() /src/ucxx/cpp/tests/listener.cpp:318
8 0x000000000004e1c1 std::__shared_ptr_access<ucxx::Request, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:993
9 0x000000000004e1c1 std::__shared_ptr_access<ucxx::Request, (__gnu_cxx::_Lock_policy)2, false, false>::operator->() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:987
10 0x000000000004e1c1 operator()() /src/ucxx/cpp/tests/listener.cpp:319
11 0x000000000004e1c1 __invoke_impl<bool, (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::<lambda()>&>() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:61
12 0x000000000004e1c1 __invoke_r<bool, (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::<lambda()>&>() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:114
13 0x000000000004e1c1 _M_invoke() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:290
14 0x000000000003ef43 std::function<void (ucs_status_t, std::shared_ptr<void>)>::operator()() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:590
15 0x000000000003ef43 std::__shared_ptr<void, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:1154
16 0x000000000003ef43 std::shared_ptr<void>::~shared_ptr() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr.h:122
17 0x000000000003ef43 operator()() /src/ucxx/cpp/src/endpoint.cpp:162
18 0x000000000003ef43 __invoke_impl<void, ucxx::Endpoint::close(bool, ucxx::EndpointCloseCallbackUserFunction, ucxx::EndpointCloseCallbackUserData)::<lambda(ucs_status_t, ucxx::EndpointCloseCallbackUserData)>&, ucs_status_t, std::shared_ptr<void> >() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:61
19 0x000000000003ef43 __invoke_r<void, ucxx::Endpoint::close(bool, ucxx::EndpointCloseCallbackUserFunction, ucxx::EndpointCloseCallbackUserData)::<lambda(ucs_status_t, ucxx::EndpointCloseCallbackUserData)>&, ucs_status_t, std::shared_ptr<void> >() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:111
20 0x000000000003ef43 _M_invoke() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:290
21 0x0000000000048c72 std::function<void (ucs_status_t, std::shared_ptr<void>)>::operator()() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:590
22 0x0000000000048c72 std::__shared_ptr<void, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:1154
23 0x0000000000048c72 std::shared_ptr<void>::~shared_ptr() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr.h:122
24 0x0000000000048c72 ucxx::Request::setStatus() /src/ucxx/cpp/src/request.cpp:230
25 0x0000000000049209 ucxx::Request::callback() /src/ucxx/cpp/src/request.cpp:149
26 0x0000000000040014 ucp_request_complete_send() /src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_request.inl:249
27 0x0000000000040014 ucp_ep_local_disconnect_progress() /src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_ep.c:1625
28 0x000000000005e7f9 ucs_callbackq_slow_proxy() /src/ucx-1.15/build/src/ucs/../../../src/ucs/datastruct/callbackq.c:404
29 0x000000000005fc4a ucs_callbackq_dispatch() /src/ucx-1.15/build/../src/ucs/datastruct/callbackq.h:211
30 0x000000000005fc4a uct_worker_progress() /src/ucx-1.15/build/../src/uct/api/uct.h:2777
31 0x000000000005fc4a ucp_worker_progress() /src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_worker.c:2889
32 0x000000000005bebe ucxx::Worker::progressOnce() /src/ucxx/cpp/src/worker.cpp:254
33 0x000000000005beec ucxx::Worker::progressPending() /src/ucxx/cpp/src/worker.cpp:260
34 0x000000000005ca07 ucxx::Worker::progress() /src/ucxx/cpp/src/worker.cpp:268
35 0x000000000005ca07 std::mutex::lock() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_mutex.h:100
36 0x000000000005ca07 std::lock_guard<std::mutex>::lock_guard() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_mutex.h:229
37 0x000000000005ca07 ucxx::Worker::progress() /src/ucxx/cpp/src/worker.cpp:272
38 0x000000000004e1c1 std::_Function_handler<bool (), (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::{lambda()#2}>::_M_invoke() /src/ucxx/cpp/tests/listener.cpp:318
39 0x000000000004e1c1 std::__shared_ptr_access<ucxx::Request, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:993
40 0x000000000004e1c1 std::__shared_ptr_access<ucxx::Request, (__gnu_cxx::_Lock_policy)2, false, false>::operator->() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:987
41 0x000000000004e1c1 operator()() /src/ucxx/cpp/tests/listener.cpp:319
42 0x000000000004e1c1 __invoke_impl<bool, (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::<lambda()>&>() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:61
43 0x000000000004e1c1 __invoke_r<bool, (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::<lambda()>&>() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:114
44 0x000000000004e1c1 _M_invoke() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:290
45 0x000000000007820d std::function<bool ()>::operator()() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:590
46 0x000000000005456a (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody() /src/ucxx/cpp/tests/listener.cpp:321
47 0x000000000005456a (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody() /src/ucxx/cpp/tests/listener.cpp:321
48 0x000000000004d40e testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>() ???:0
49 0x000000000004d6a1 testing::Test::Run() ???:0
50 0x000000000004da2f testing::TestInfo::Run() ???:0
51 0x000000000004deff testing::TestSuite::Run() ???:0
52 0x0000000000059423 testing::internal::UnitTestImpl::RunAllTests() ???:0
53 0x000000000004dfdd testing::UnitTest::Run() ???:0
54 0x0000000000001072 main() ???:0
55 0x0000000000024083 __libc_start_main() /build/glibc-wuryBv/glibc-2.31/csu/../csu/libc-start.c:308
56 0x000000000002971e _start() ???:0
=================================
Aborted
Contributor
There was a problem hiding this comment.
Ah, ok, I didn't build in debug
A race condition may occur if `ucxx::Endpoint::setCloseCallback()` occurs while `ucxx::Endpoint::errorCallback()` is being executed, for example during Python `remove_close_callback()` call, causing `errorCallback` to attempt executing the callback which is not available anymore.
Member
Author
Member
Author
Member
Author
|
/merge |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Due to the primary use with Python, endpoint closing was initially implemented in blocking-mode only. With C++ usage that is not always desired, thus exposing a non-blocking close option that returns a
ucxx::Requestis beneficial to C++ users.