-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Wait for goal result after cancelling BT action node (Galactic branch) #2436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for goal result after cancelling BT action node (Galactic branch) #2436
Conversation
|
I agree with you in concept about this PR, I'm digging into the code a little to verify this for myself. One thing though is that this should be targeted to the |
| } | ||
|
|
||
| auto start_time = node_->now(); | ||
| while (rclcpp::ok() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll counter this suggestion with moving this logic up a bit. We have the future_cancel object to actually spin in the if()... clause that would be the most proper thing to be playing with. I suggest instead you put the while loop there so that we're spinning until the future_cancel is completed rather than just spin_some() more blindly. That should simply alot of the code you added too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly agree with you, the while loop could be simplified. However:
-
I think we need to keep the timeout in order to prevent infinite looping/spinning
-
I would rather not run a
spin_some()blindly too, but we can not use thefuture_cancelto do so. Indeed the variablefuture_cancelcan only be used to check on the cancel result availability.
To check goal result availability, one can use the variablegoal_result_available_(which is a boolean variable already implemented, not a future). As this variable is only updated while thecallback_group_executor_is spinning, thespin_some()has been added here to check on the goal result availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assertion (which you could test and let me know if I'm wrong) is that when the future_cancel returns true after spinning, that you're guaranteed to have a result message waiting in your callback queue for a spin_some() to pick up. That might be a subtle detail I picked up from reading the rclcpp_action codebase. If that is accurate, then there is no need for a while() loop at all, therefore no requirement for a timeout from that while loop.
That still doesn't help you if the cancel_future failed, but that's not the issue we're addressing here and you're screwed either way 😄
| auto start_time = node_->now(); | ||
| while (rclcpp::ok() | ||
| && !goal_result_available_ | ||
| && node_->now() - start_time < server_timeout_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the condition node_->now() - start_time < server_timeout_ basically is just giving the cancel future only 1 more set of server_timeout_ to get a cancel. The spin until future complete in L269 will block for server_timeout_ waiting on this while loop will also block for another set of server_timeout_ but is that really what you want to do? Its just giving that now 2 chances of that same length to do so, so you could still run into this issue if it exceeds that time frame.
I don't think you actually really care about the goal_result_available_ as much as you care that the cancel_future is completed, indicating that the server actually canceled successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your server_timeout set to? We did increase this to I believe 20ms up from 10ms a few months ago when this logic for the server timeout was changed to enable the BT loop duration to move at a consistent rate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mention
This will force the result_callback to be executed after the sub-action is cancelled, allowing to get a clean BtNavigator's action result.
and point to the spin_some(). So is some of this that you actually just want to spin to get a result callback more than anything else? If that' true, we could add a spin_some() to halt() after cancellation as well. However if the return code on the future_cancel is SUCCESS then we know it triggered so just 1 spin_some() should do the trick without having to do any of the rest of it. So if really all you want to do is spin_some() and you're not having an issue with future_cancel not completely finishing, then none of the while loop stuff is required at all and some of my suggestions above are irrelevant.
As far as I can tell from this PR there are 2 distinct issues you're trying to address (or just 1 and I'm misinterpreting for the second):
cancel_futuredoesn't complete inserver_timeout_so you want to spin for a bit longer until it actually does successfully (or is that not an issue you're running into, with that error log?)- Because you don't spin after canceling, the callback for result isn't triggered in that same BT run so its messing up the state for the next run, so you're adding in a
spin_some()to process the cancel result callback before exitinghalt()to clean things up for the next run.
If only (B) is your issue, no changes are required to the code above this block and this block can be largely simplified without the while to just be spin_some() since if cancel_future is successful, you're promised to have that waiting in your callback queue. If (A) is your issue, then we should work on combining what you have now with the while loop with spinning the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the condition
node_->now() - start_time < server_timeout_basically is just giving the cancel future only 1 more set ofserver_timeout_
Yes, that's right ...
to get a cancel
... but no, not a cancel. This condition does not aim at getting the cancel result, but the goal result.
In my understanding, here is what happen when entering the halt() function:
- First step of the
halt()function, already implemented: The cancel request is sent to the sub-action. For a time period of maximumserver_timeout_, we wait for the cancel result. Considering we have no trouble for getting the cancel future, there are two possibilities at the end of this first set ofserver_timeout_:
- Case A: The cancel result and the goal result are available
- Case B: Only the cancel result is result is available. Actually, in my opinion, this is not a exceptional case at all, since the sub-action can automatically accept the cancel requests (so the cancel result will be "instantaneously" available after sending the cancel request), but this same sub-action can take some time to properly end/stop the on-going goal after the cancel acceptation (so the goal result will not be available right after the cancel acceptation). I think it depends on the behavior you want your sub-action to have in case of a (successful) cancel.
- Second step of the
halt()function, object of this PR: We want to check for the goal result availability before ending the halt of the BT. This second step depends on the case obtained at step 1:
- If Case A: we don't even enter the while loop since the goal result is already available.
- If Case B: we enter the while loop. For a time period of maximum
server_timeout_, we wait for the goal result.
So I think that waiting for maximum two set of server_timeout_ for properly ending the sub-action in case of successful cancel is:
- neither only useful for isolated use cases,
- neither too much cautious / a possible slow down cause,
since it strongly depends on the desired behavior of the sub-action when it is being cancelled, and since it seems normal to me to wait for the sub-action to properly ends before halting the BT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your
server_timeoutset to? We did increase this to I believe 20ms up from 10ms a few months ago when this logic for the server timeout was changed to enable the BT loop duration to move at a consistent rate
In our use case, we set the server_timeout to 100 ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is some of this that you actually just want to spin to get a result callback more than anything else?
From a behavior point of view, what we want more than anything else is the sub-action to be cleanly finished before halting the BT. To know if it is indeed cleanly finished, we check on the availability of the sub-action goal result, since the goal result is available only once the sub-action has ended. Once the sub-action goal result is available, the goal result callback is automatically executed. So yes, saying that what we want more than anything else is the sub-action to be cleanly finished, is actually saying we want to spin to get a result callback more than anything else.
However if the return code on the future_cancel is
SUCCESSthen we know it triggered so just 1spin_some()should do the trick without having to do any of the rest of it.
No I don't agree with this, there is an explanation in this previous comment
As far as I can tell from this PR there are 2 distinct issues you're trying to address (or just 1 and I'm misinterpreting for the second):
* `cancel_future` doesn't complete in `server_timeout_` so you want to spin for a bit longer until it actually does successfully (or is that not an issue you're running into, with that error log?) * Because you don't spin after canceling, the callback for result isn't triggered in that same BT run so its messing up the state for the next run, so you're adding in a `spin_some()` to process the cancel result callback before exiting `halt()` to clean things up for the next run.If only (B) is your issue, no changes are required to the code above this block and this block can be largely simplified without the
whileto just bespin_some()since ifcancel_futureis successful, you're promised to have that waiting in your callback queue. If (A) is your issue, then we should work on combining what you have now with the while loop with spinning the future.
Well, I would said only (B) is an issue, but not exactly. I would reformulate it this way to make it accurate (changes are in bold):
Because you don't spin after canceling, the callback for goal result isn't triggered in that same BT run so its messing up the state for the next run, so you're adding in a spin_some() to process the goal (Not cancel) result callback before exiting halt() to clean things up for the next run.
So, I don't really agree with your conclusion about only (B) being an issue. If the cancel_future is successful, it doesn't mean the sub-action will have enough time to finish cleanly before halting the BT. It only means that the cancel request has been accepted. Which is great, but not enough, since the sub-action might not stop the current goal right after having accepting the cancel request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it -- see comment in the thread above. I think we're on the same page, but I think we can simplify this if we leverage some internal knowledge about how the actions work. If that turns out not to be true, then we can discuss what to do so we can make sure we process the goal callback. What you've suggested is not unreasonable though.
Ok, sorry I didn't target the right branch. As we use Galactic at the moment, and our fork of this repo is based on the Galactic branch, I did think of doing this PR on the Here is the PR #2440 targeted on Should we continue the discussion here or on the "main" PR? |
|
Both CI don't seem to be completely executed, this is the output: Build-agent version 1.0.76865-2132ecc2 (2021-07-05T15:58:20+0000)
System information:
Server Version: 19.03.13
Storage Driver: overlay2
Backing Filesystem: xfs
Cgroup Driver: cgroupfs
Kernel Version: 4.15.0-1103-aws
Operating System: Ubuntu 18.04.5 LTS
OSType: linux
Architecture: x86_64
Starting container rosplanning/navigation2:main.release
Warning: No authentication provided, using CircleCI credentials for pulls from Docker Hub.
image cache not found on this host, downloading rosplanning/navigation2:main.release
Error response from daemon: manifest for rosplanning/navigation2:main.release not found: manifest unknown: manifest unknownI don't know how to solve this. Could you help me with it please? |
|
Lets continue on the |
Basic Info
Description of the problem fixed by this PR
Hello,
I fixed a problematic behavior that happens quite often after cancelling a
bt_action_node(approximately once every 5 cancel requests in our case).What we do:
BtNavigator, which controls the action "GoToPose" for instance.bt_action_nodes, in order to call sub-actions like"/follow_path"for instance.bt_action_nodes executehalt()function, hence they send a cancel request to their respective sub-actions.What should happen:
bt_action_nodeshould wait for the sub-action to properly end, before returning from thehalt()function.What actually happen:
bt_action_nodeonly waits for the cancel request to return.BtNavigator's action, the sub-actions'result_callbacks are not always executed.Related PR
Before Galactic syncing recent master PRs #2384 was merged, the next action call after a cancel often results in an instant end of the BT, because the goal result of the previously canceled sub-action is received. Indeed, cancelling the
BtNavigator's action halts thebt_action_nodewithout waiting for the sub-action goal result.After Galactic syncing recent master PRs #2384, if the goal result of the previously cancelled sub-action is received, it is filtered out using the
future_goal_handle_. So, the next action call does not return instantaneously. But this might not be the cleaner way to solve the problem, since the sub-actions'result_callbacks are still not always executed.This PR's description
This PR allows to wait for the sub-actions results, by making the
bt_action_nodespin until the sub-action goal result is available, after the cancel request is sent in thehalt()function. This will force theresult_callbackto be executed after the sub-action is cancelled, allowing to get a cleanBtNavigator's action goal result.A timeout is implemented to prevent from infinitely looping while halting the
bt_action_node.If the sub-action goal result is not available within the timeout, there is an error output in logs.
Future work that may be required in bullet points
Thank you for reviewing this PR.
For Maintainers: