Skip to content

Conversation

@anaelle-sw
Copy link
Contributor

@anaelle-sw anaelle-sw commented Jul 6, 2021

Basic Info

Info Please fill out this column
Ticket(s) this addresses I didn't find any issue that is currently open, and that would be related to this PR. But I sure can open one if you want me to. The PR #2384 is related, as described below (see Related PR section).
Primary OS tested on Ubuntu 20.04, ROS 2 Galactic from binaries
Robotic platform tested on Gazebo simulation of Wyca's robot, and real Wyca's robot

Description of the problem fixed by this PR

Hello,
I fixed a problematic behavior that happens quite often after cancelling a bt_action_node (approximately once every 5 cancel requests in our case).

What we do:

  • We start a nav2 BtNavigator, which controls the action "GoToPose" for instance.
  • The BT has several bt_action_nodes, in order to call sub-actions like "/follow_path" for instance.
  • When cancelling the "GoToPose" action, all the bt_action_nodes execute halt() function, hence they send a cancel request to their respective sub-actions.

What should happen:

  • The bt_action_node should wait for the sub-action to properly end, before returning from the halt() function.

What actually happen:

Related PR

  • Before Galactic syncing recent master PRs #2384 was merged, the next action call after a cancel often results in an instant end of the BT, because the goal result of the previously canceled sub-action is received. Indeed, cancelling the BtNavigator's action halts the bt_action_node without waiting for the sub-action goal result.

  • After Galactic syncing recent master PRs #2384, if the goal result of the previously cancelled sub-action is received, it is filtered out using the future_goal_handle_. So, the next action call does not return instantaneously. But this might not be the cleaner way to solve the problem, since the sub-actions' result_callbacks are still not always executed.

This PR's description

  • This PR allows to wait for the sub-actions results, by making the bt_action_node spin until the sub-action goal result is available, after the cancel request is sent in the halt() function. This will force the result_callback to be executed after the sub-action is cancelled, allowing to get a clean BtNavigator's action goal result.

  • A timeout is implemented to prevent from infinitely looping while halting the bt_action_node.

  • If the sub-action goal result is not available within the timeout, there is an error output in logs.

Future work that may be required in bullet points

  • Maybe the timeout can be handled differently
  • The case "the goal result is not received after the timeout" may be handled (in a better way than just outputting an error in logs)

Thank you for reviewing this PR.


For Maintainers:

  • Check that any new parameters added are updated in navigation.ros.org
  • Check that any significant change is added to the migration guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists

@SteveMacenski
Copy link
Member

I agree with you in concept about this PR, I'm digging into the code a little to verify this for myself. One thing though is that this should be targeted to the main branch and we can then open a PR to the galactic branch once that is merged in. We target all new development to the main branch such that it can be reflected in all new distribution releases.

}

auto start_time = node_->now();
while (rclcpp::ok()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll counter this suggestion with moving this logic up a bit. We have the future_cancel object to actually spin in the if()... clause that would be the most proper thing to be playing with. I suggest instead you put the while loop there so that we're spinning until the future_cancel is completed rather than just spin_some() more blindly. That should simply alot of the code you added too!

Copy link
Contributor Author

@anaelle-sw anaelle-sw Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly agree with you, the while loop could be simplified. However:

  • I think we need to keep the timeout in order to prevent infinite looping/spinning

  • I would rather not run a spin_some() blindly too, but we can not use the future_cancel to do so. Indeed the variable future_cancel can only be used to check on the cancel result availability.
    To check goal result availability, one can use the variable goal_result_available_ (which is a boolean variable already implemented, not a future). As this variable is only updated while the callback_group_executor_ is spinning, the spin_some() has been added here to check on the goal result availability.

Copy link
Member

@SteveMacenski SteveMacenski Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assertion (which you could test and let me know if I'm wrong) is that when the future_cancel returns true after spinning, that you're guaranteed to have a result message waiting in your callback queue for a spin_some() to pick up. That might be a subtle detail I picked up from reading the rclcpp_action codebase. If that is accurate, then there is no need for a while() loop at all, therefore no requirement for a timeout from that while loop.

That still doesn't help you if the cancel_future failed, but that's not the issue we're addressing here and you're screwed either way 😄

auto start_time = node_->now();
while (rclcpp::ok()
&& !goal_result_available_
&& node_->now() - start_time < server_timeout_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the condition node_->now() - start_time < server_timeout_ basically is just giving the cancel future only 1 more set of server_timeout_ to get a cancel. The spin until future complete in L269 will block for server_timeout_ waiting on this while loop will also block for another set of server_timeout_ but is that really what you want to do? Its just giving that now 2 chances of that same length to do so, so you could still run into this issue if it exceeds that time frame.

I don't think you actually really care about the goal_result_available_ as much as you care that the cancel_future is completed, indicating that the server actually canceled successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your server_timeout set to? We did increase this to I believe 20ms up from 10ms a few months ago when this logic for the server timeout was changed to enable the BT loop duration to move at a consistent rate

Copy link
Member

@SteveMacenski SteveMacenski Jul 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention

This will force the result_callback to be executed after the sub-action is cancelled, allowing to get a clean BtNavigator's action result.

and point to the spin_some(). So is some of this that you actually just want to spin to get a result callback more than anything else? If that' true, we could add a spin_some() to halt() after cancellation as well. However if the return code on the future_cancel is SUCCESS then we know it triggered so just 1 spin_some() should do the trick without having to do any of the rest of it. So if really all you want to do is spin_some() and you're not having an issue with future_cancel not completely finishing, then none of the while loop stuff is required at all and some of my suggestions above are irrelevant.

As far as I can tell from this PR there are 2 distinct issues you're trying to address (or just 1 and I'm misinterpreting for the second):

  • cancel_future doesn't complete in server_timeout_ so you want to spin for a bit longer until it actually does successfully (or is that not an issue you're running into, with that error log?)
  • Because you don't spin after canceling, the callback for result isn't triggered in that same BT run so its messing up the state for the next run, so you're adding in a spin_some() to process the cancel result callback before exiting halt() to clean things up for the next run.

If only (B) is your issue, no changes are required to the code above this block and this block can be largely simplified without the while to just be spin_some() since if cancel_future is successful, you're promised to have that waiting in your callback queue. If (A) is your issue, then we should work on combining what you have now with the while loop with spinning the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the condition node_->now() - start_time < server_timeout_ basically is just giving the cancel future only 1 more set of server_timeout_

Yes, that's right ...

to get a cancel

... but no, not a cancel. This condition does not aim at getting the cancel result, but the goal result.
In my understanding, here is what happen when entering the halt() function:

  1. First step of the halt() function, already implemented: The cancel request is sent to the sub-action. For a time period of maximum server_timeout_, we wait for the cancel result. Considering we have no trouble for getting the cancel future, there are two possibilities at the end of this first set of server_timeout_:
  • Case A: The cancel result and the goal result are available
  • Case B: Only the cancel result is result is available. Actually, in my opinion, this is not a exceptional case at all, since the sub-action can automatically accept the cancel requests (so the cancel result will be "instantaneously" available after sending the cancel request), but this same sub-action can take some time to properly end/stop the on-going goal after the cancel acceptation (so the goal result will not be available right after the cancel acceptation). I think it depends on the behavior you want your sub-action to have in case of a (successful) cancel.
  1. Second step of the halt() function, object of this PR: We want to check for the goal result availability before ending the halt of the BT. This second step depends on the case obtained at step 1:
  • If Case A: we don't even enter the while loop since the goal result is already available.
  • If Case B: we enter the while loop. For a time period of maximum server_timeout_, we wait for the goal result.

So I think that waiting for maximum two set of server_timeout_ for properly ending the sub-action in case of successful cancel is:

  • neither only useful for isolated use cases,
  • neither too much cautious / a possible slow down cause,

since it strongly depends on the desired behavior of the sub-action when it is being cancelled, and since it seems normal to me to wait for the sub-action to properly ends before halting the BT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your server_timeout set to? We did increase this to I believe 20ms up from 10ms a few months ago when this logic for the server timeout was changed to enable the BT loop duration to move at a consistent rate

In our use case, we set the server_timeout to 100 ms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is some of this that you actually just want to spin to get a result callback more than anything else?

From a behavior point of view, what we want more than anything else is the sub-action to be cleanly finished before halting the BT. To know if it is indeed cleanly finished, we check on the availability of the sub-action goal result, since the goal result is available only once the sub-action has ended. Once the sub-action goal result is available, the goal result callback is automatically executed. So yes, saying that what we want more than anything else is the sub-action to be cleanly finished, is actually saying we want to spin to get a result callback more than anything else.

However if the return code on the future_cancel is SUCCESS then we know it triggered so just 1 spin_some() should do the trick without having to do any of the rest of it.

No I don't agree with this, there is an explanation in this previous comment

As far as I can tell from this PR there are 2 distinct issues you're trying to address (or just 1 and I'm misinterpreting for the second):

* `cancel_future` doesn't complete in `server_timeout_` so you want to spin for a bit longer until it actually does successfully (or is that not an issue you're running into, with that error log?)

* Because you don't spin after canceling, the callback for result isn't triggered in that same BT run so its messing up the state for the next run, so you're adding in a `spin_some()` to process the cancel result callback before exiting `halt()` to clean things up for the next run.

If only (B) is your issue, no changes are required to the code above this block and this block can be largely simplified without the while to just be spin_some() since if cancel_future is successful, you're promised to have that waiting in your callback queue. If (A) is your issue, then we should work on combining what you have now with the while loop with spinning the future.

Well, I would said only (B) is an issue, but not exactly. I would reformulate it this way to make it accurate (changes are in bold):
Because you don't spin after canceling, the callback for goal result isn't triggered in that same BT run so its messing up the state for the next run, so you're adding in a spin_some() to process the goal (Not cancel) result callback before exiting halt() to clean things up for the next run.

So, I don't really agree with your conclusion about only (B) being an issue. If the cancel_future is successful, it doesn't mean the sub-action will have enough time to finish cleanly before halting the BT. It only means that the cancel request has been accepted. Which is great, but not enough, since the sub-action might not stop the current goal right after having accepting the cancel request.

Copy link
Member

@SteveMacenski SteveMacenski Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- see comment in the thread above. I think we're on the same page, but I think we can simplify this if we leverage some internal knowledge about how the actions work. If that turns out not to be true, then we can discuss what to do so we can make sure we process the goal callback. What you've suggested is not unreasonable though.

@anaelle-sw anaelle-sw changed the title Wait for goal result after cancelling BT action node Wait for goal result after cancelling BT action node (Galactic branch) Jul 7, 2021
@anaelle-sw
Copy link
Contributor Author

One thing though is that this should be targeted to the main branch and we can then open a PR to the galactic branch once that is merged in. We target all new development to the main branch such that it can be reflected in all new distribution releases.

Ok, sorry I didn't target the right branch. As we use Galactic at the moment, and our fork of this repo is based on the Galactic branch, I did think of doing this PR on the master branch, my bad.

Here is the PR #2440 targeted on main branch.

Should we continue the discussion here or on the "main" PR?

@anaelle-sw
Copy link
Contributor Author

Both CI don't seem to be completely executed, this is the output:

Build-agent version 1.0.76865-2132ecc2 (2021-07-05T15:58:20+0000)
System information:
 Server Version: 19.03.13
 Storage Driver: overlay2
  Backing Filesystem: xfs
 Cgroup Driver: cgroupfs
 Kernel Version: 4.15.0-1103-aws
 Operating System: Ubuntu 18.04.5 LTS
 OSType: linux
 Architecture: x86_64

Starting container rosplanning/navigation2:main.release
Warning: No authentication provided, using CircleCI credentials for pulls from Docker Hub.
  image cache not found on this host, downloading rosplanning/navigation2:main.release

Error response from daemon: manifest for rosplanning/navigation2:main.release not found: manifest unknown: manifest unknown

I don't know how to solve this. Could you help me with it please?

@SteveMacenski
Copy link
Member

Lets continue on the main branch PR but I'll respond to your comments here. It might be good to just close this PR because I can just cherry pick this over to galactic once completed without a new PR. CI isn't quite running on the galactic branch yet, nothing you did. We've been experimenting with new CI technologies and docker environments so things are still in a state of flux.

@anaelle-sw anaelle-sw closed this Jul 9, 2021
@anaelle-sw anaelle-sw deleted the Wait_for_bt_action_result_after_cancel branch July 22, 2021 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants