Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,22 @@ class BtActionNode : public BT::ActionNodeBase
}
}

auto start_time = node_->now();
while (rclcpp::ok()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll counter this suggestion with moving this logic up a bit. We have the future_cancel object to actually spin in the if()... clause that would be the most proper thing to be playing with. I suggest instead you put the while loop there so that we're spinning until the future_cancel is completed rather than just spin_some() more blindly. That should simply alot of the code you added too!

Copy link
Contributor Author

@anaelle-sw anaelle-sw Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly agree with you, the while loop could be simplified. However:

  • I think we need to keep the timeout in order to prevent infinite looping/spinning

  • I would rather not run a spin_some() blindly too, but we can not use the future_cancel to do so. Indeed the variable future_cancel can only be used to check on the cancel result availability.
    To check goal result availability, one can use the variable goal_result_available_ (which is a boolean variable already implemented, not a future). As this variable is only updated while the callback_group_executor_ is spinning, the spin_some() has been added here to check on the goal result availability.

Copy link
Member

@SteveMacenski SteveMacenski Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assertion (which you could test and let me know if I'm wrong) is that when the future_cancel returns true after spinning, that you're guaranteed to have a result message waiting in your callback queue for a spin_some() to pick up. That might be a subtle detail I picked up from reading the rclcpp_action codebase. If that is accurate, then there is no need for a while() loop at all, therefore no requirement for a timeout from that while loop.

That still doesn't help you if the cancel_future failed, but that's not the issue we're addressing here and you're screwed either way 😄

&& !goal_result_available_
&& node_->now() - start_time < server_timeout_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the condition node_->now() - start_time < server_timeout_ basically is just giving the cancel future only 1 more set of server_timeout_ to get a cancel. The spin until future complete in L269 will block for server_timeout_ waiting on this while loop will also block for another set of server_timeout_ but is that really what you want to do? Its just giving that now 2 chances of that same length to do so, so you could still run into this issue if it exceeds that time frame.

I don't think you actually really care about the goal_result_available_ as much as you care that the cancel_future is completed, indicating that the server actually canceled successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your server_timeout set to? We did increase this to I believe 20ms up from 10ms a few months ago when this logic for the server timeout was changed to enable the BT loop duration to move at a consistent rate

Copy link
Member

@SteveMacenski SteveMacenski Jul 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention

This will force the result_callback to be executed after the sub-action is cancelled, allowing to get a clean BtNavigator's action result.

and point to the spin_some(). So is some of this that you actually just want to spin to get a result callback more than anything else? If that' true, we could add a spin_some() to halt() after cancellation as well. However if the return code on the future_cancel is SUCCESS then we know it triggered so just 1 spin_some() should do the trick without having to do any of the rest of it. So if really all you want to do is spin_some() and you're not having an issue with future_cancel not completely finishing, then none of the while loop stuff is required at all and some of my suggestions above are irrelevant.

As far as I can tell from this PR there are 2 distinct issues you're trying to address (or just 1 and I'm misinterpreting for the second):

  • cancel_future doesn't complete in server_timeout_ so you want to spin for a bit longer until it actually does successfully (or is that not an issue you're running into, with that error log?)
  • Because you don't spin after canceling, the callback for result isn't triggered in that same BT run so its messing up the state for the next run, so you're adding in a spin_some() to process the cancel result callback before exiting halt() to clean things up for the next run.

If only (B) is your issue, no changes are required to the code above this block and this block can be largely simplified without the while to just be spin_some() since if cancel_future is successful, you're promised to have that waiting in your callback queue. If (A) is your issue, then we should work on combining what you have now with the while loop with spinning the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the condition node_->now() - start_time < server_timeout_ basically is just giving the cancel future only 1 more set of server_timeout_

Yes, that's right ...

to get a cancel

... but no, not a cancel. This condition does not aim at getting the cancel result, but the goal result.
In my understanding, here is what happen when entering the halt() function:

  1. First step of the halt() function, already implemented: The cancel request is sent to the sub-action. For a time period of maximum server_timeout_, we wait for the cancel result. Considering we have no trouble for getting the cancel future, there are two possibilities at the end of this first set of server_timeout_:
  • Case A: The cancel result and the goal result are available
  • Case B: Only the cancel result is result is available. Actually, in my opinion, this is not a exceptional case at all, since the sub-action can automatically accept the cancel requests (so the cancel result will be "instantaneously" available after sending the cancel request), but this same sub-action can take some time to properly end/stop the on-going goal after the cancel acceptation (so the goal result will not be available right after the cancel acceptation). I think it depends on the behavior you want your sub-action to have in case of a (successful) cancel.
  1. Second step of the halt() function, object of this PR: We want to check for the goal result availability before ending the halt of the BT. This second step depends on the case obtained at step 1:
  • If Case A: we don't even enter the while loop since the goal result is already available.
  • If Case B: we enter the while loop. For a time period of maximum server_timeout_, we wait for the goal result.

So I think that waiting for maximum two set of server_timeout_ for properly ending the sub-action in case of successful cancel is:

  • neither only useful for isolated use cases,
  • neither too much cautious / a possible slow down cause,

since it strongly depends on the desired behavior of the sub-action when it is being cancelled, and since it seems normal to me to wait for the sub-action to properly ends before halting the BT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your server_timeout set to? We did increase this to I believe 20ms up from 10ms a few months ago when this logic for the server timeout was changed to enable the BT loop duration to move at a consistent rate

In our use case, we set the server_timeout to 100 ms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is some of this that you actually just want to spin to get a result callback more than anything else?

From a behavior point of view, what we want more than anything else is the sub-action to be cleanly finished before halting the BT. To know if it is indeed cleanly finished, we check on the availability of the sub-action goal result, since the goal result is available only once the sub-action has ended. Once the sub-action goal result is available, the goal result callback is automatically executed. So yes, saying that what we want more than anything else is the sub-action to be cleanly finished, is actually saying we want to spin to get a result callback more than anything else.

However if the return code on the future_cancel is SUCCESS then we know it triggered so just 1 spin_some() should do the trick without having to do any of the rest of it.

No I don't agree with this, there is an explanation in this previous comment

As far as I can tell from this PR there are 2 distinct issues you're trying to address (or just 1 and I'm misinterpreting for the second):

* `cancel_future` doesn't complete in `server_timeout_` so you want to spin for a bit longer until it actually does successfully (or is that not an issue you're running into, with that error log?)

* Because you don't spin after canceling, the callback for result isn't triggered in that same BT run so its messing up the state for the next run, so you're adding in a `spin_some()` to process the cancel result callback before exiting `halt()` to clean things up for the next run.

If only (B) is your issue, no changes are required to the code above this block and this block can be largely simplified without the while to just be spin_some() since if cancel_future is successful, you're promised to have that waiting in your callback queue. If (A) is your issue, then we should work on combining what you have now with the while loop with spinning the future.

Well, I would said only (B) is an issue, but not exactly. I would reformulate it this way to make it accurate (changes are in bold):
Because you don't spin after canceling, the callback for goal result isn't triggered in that same BT run so its messing up the state for the next run, so you're adding in a spin_some() to process the goal (Not cancel) result callback before exiting halt() to clean things up for the next run.

So, I don't really agree with your conclusion about only (B) being an issue. If the cancel_future is successful, it doesn't mean the sub-action will have enough time to finish cleanly before halting the BT. It only means that the cancel request has been accepted. Which is great, but not enough, since the sub-action might not stop the current goal right after having accepting the cancel request.

Copy link
Member

@SteveMacenski SteveMacenski Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- see comment in the thread above. I think we're on the same page, but I think we can simplify this if we leverage some internal knowledge about how the actions work. If that turns out not to be true, then we can discuss what to do so we can make sure we process the goal callback. What you've suggested is not unreasonable though.

RCLCPP_DEBUG(
node_->get_logger(),
"Waiting for goal result after cancelling, for action %s", action_name_.c_str());
callback_group_executor_.spin_some();
rclcpp::sleep_for(std::chrono::milliseconds(1));
}
if (!goal_result_available_) {
RCLCPP_ERROR_STREAM(
node_->get_logger(),
"Failed to get result for action " << action_name_.c_str() << " after cancelling. BT might not finish cleanly. ");
}

setStatus(BT::NodeStatus::IDLE);
}

Expand Down