-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Check if goal_status_ is valid. #2305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check if goal_status_ is valid. #2305
Conversation
|
Does #2304 resolve your issue as well? This seems related |
|
I tried to reproduce SIGSEGV with this patch. The problem was still the same. |
|
I think it might be an |
|
You are correct -- that was an oversight on my part. Let me give this a fresh look |
SteveMacenski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A final question I had was in checking goal_handle_, I don't see anywhere where a reset() on the shared pointer was done in terminal cases so that the bool of the shared pointer would return false after the goal handle was no longer valid.
It seems like result_callback and halt() should be resetting the goal handle potentially in those terminal cases?
| { | ||
| goal_updated_ = false; | ||
| on_new_goal_received(); | ||
| if (goal_handle_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this condition is not true (e.g. the goal_handle_ is not valid), then we pass over this which seems reasonable since we have no valid status. However, then we just continue the looping process. Is the idea that the result callback will be triggered sometime in the near-term future to set result_ so that this BT node would then exit? That would make sense to me, but see the comment below -- it looks like you block that from occurring as well. So how would the BT node exit RUNNING?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not think about this case.
Is following snippet ok?
if (goal_handle_) {
auto goal_status = goal_handle_->get_status();
if (goal_updated_ && (goal_status == action_msgs::msg::GoalStatus::STATUS_EXECUTING ||
goal_status == action_msgs::msg::GoalStatus::STATUS_ACCEPTED))
{
goal_updated_ = false;
on_new_goal_received();
}
rclcpp::spin_some(node_);
// check if, after invoking spin_some(), we finally received the result
if (!goal_result_available_) {
// Yield this Action, returning RUNNING
return BT::NodeStatus::RUNNING;
}
}
this snippet is not tested yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this solves the problem -- if goal_handle_ == false, then the node that's being ticked will never exit, there should be some exit criteria. This is what both of these comments are pointing to, there's an issue where if not goal_handle_, then the BT node would never exit the running state. One or the other here needs a way to handle the case (probably the other makes more sense) so that it can exit.
| // if goal ids are not matched, the older goal call this callback so ignore the result | ||
| // if matched, it must be processed (including aborted) | ||
| if (this->goal_handle_->get_goal_id() == result.goal_id) { | ||
| if (this->goal_handle_ && (this->goal_handle_->get_goal_id() == result.goal_id)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, if this callback is triggered, there must be a result so the goal handle has to be valid still, right? Did you end up actually having an issue in this callback? If not, I think this one can be removed.
If this one does need to stay, then we need to have some condition to check if the goal handle is invalid and if so, think closely about the value we should be setting for goal_result_available_ and result_ in that case. This would create a problem if goal_handle_ was invalid that a result was triggered by this callback but then the BT node's tick() never has a result_ to exit https://github.com/ros-planning/navigation2/blob/e95a2299be32862cd37a25ff5571e3ceed2aaca8/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp#L204-L212 and move onto other nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to experiment, this check is needed.
This reverts commit d77e5b9.
|
Any thoughts? |
|
The code should have been better thought out. |
nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp
Outdated
Show resolved
Hide resolved
nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp
Outdated
Show resolved
Hide resolved
nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| if (!goal_handle_) { | ||
| throw std::logic_error("BtActionNode::Tick: invalid goal handle"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.... that would make it exit in the condition that the goal_handle_ is already complete, but lets think about this more strategically -- because otherwise your navigation system every time you run into this issue is going to be throwing this error all over the place.
What situations do you have that goal_handle_ is not valid? Answer that main question first. Is it only when canceled or aborted? Where is it happening? If so, then in the result callback, you can check if its not valid and then have an else statement to set the result type to that return code https://github.com/ros-planning/navigation2/blob/7877a1225c549aa0ced78a0f263cbff31829761c/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp#L276-L279. Then since result_ has a populated value, you get a clean exit on the next iteration.
Rather than masking this issue, we should work to fix it. When / why is goal_handle_ being turned into a nullptr? Answer that, and we can find a real solution -- whether that's fixing an issue in this file, or in the file that's nulling out the pointer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will start investigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
send_goal_options.result_callback =
[this](const typename rclcpp_action::ClientGoalHandle<ActionT>::WrappedResult & result) {
// TODO(#1652): a work around until rcl_action interface is updated
// if goal ids are not matched, the older goal call this callback so ignore the result
// if matched, it must be processed (including aborted)
if (!this->goal_handle_) {
RCLCPP_ERROR(node_->get_logger(), "error id: %d", result.code);
}
if (this->goal_handle_ && (this->goal_handle_->get_goal_id() == result.goal_id)) {
goal_result_available_ = true;
result_ = result;
}
};
[bt_navigator-11] [ERROR] [1618637514.271798607] [bt_navigator_rclcpp_node]: error id: 4
[bt_navigator-11] [ERROR] [1618637514.272602967] [bt_navigator_rclcpp_node]: error id: 5
[bt_navigator-11] [ERROR] [1618637647.774230873] [bt_navigator_rclcpp_node]: error id: 4
[bt_navigator-11] [ERROR] [1618637647.839067741] [bt_navigator_rclcpp_node]: error id: 4
[bt_navigator-11] [ERROR] [1618637698.121035886] [bt_navigator_rclcpp_node]: error id: 5
[bt_navigator-11] [ERROR] [1618637712.920679316] [bt_navigator_rclcpp_node]: error id: 5
According to experiment, result.code is 4 or 5(namely SUCCEEDED or CANCELED).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What situations do you have that goal_handle_ is not valid?
It is when on_new_goal_received is not called yet.
When / why is goal_handle_ being turned into a nullptr?
As far as I can see, goal_handle_ is not turned into a nullptr.
goal_handle_ is originally a nullptr when BT node is created.
But original code made assumption that goal_handle_ is always valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tzskp1 https://github.com/ros-planning/navigation2/blob/d7beb7104b952f89399fac7b78b80074a7d55ddc/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp#L177 on_new_goal_received is the first function called after on_tick() which populates that goal_ on the first iteration.
It is incorrect that we always assume its valid, but we need to then reset this handle properly when its terminated and also know what a nullpointer means in the context. The goal_handle_ inside of the lambda in on_new_goal_received isn't called until the result message is received (its another function, not code called on the first call of on_new_goal_received)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its another function, not code called on the first call of
on_new_goal_received
As far as I know, this spin calls the callback.
https://github.com/ros-planning/navigation2/blob/7877a1225c549aa0ced78a0f263cbff31829761c/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp#L284
So the assignment statement follows the callback.
My intention is that nullptr should be one of the following.
- BT node is created. (
goal_handle_is initialized to nullptr by constructor.) - BT node is halted.
|
My elementary reasoning is that the point is how As far as I can see, when the BT node is firstly ticked, There is only one the assignment of So if It means that original code has potential of null pointer access. Because result callback may be called before that the assignment statement is called. Is my understanding correct? |
| // if goal ids are not matched, the older goal call this callback so ignore the result | ||
| // if matched, it must be processed (including aborted) | ||
| if (this->goal_handle_->get_goal_id() == result.goal_id) { | ||
| if (!this->goal_handle_ || (this->goal_handle_->get_goal_id() == result.goal_id)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!this->goal_handle_ is if and only if first tick.
|
I'm not sure where we're at on this PR anymore to be frank. You reset the You mention you only see it on cancel and success, I'm not sure what the best action is here. I think the cancel could only have come from the |
|
I haven't checked, but I think success will happen on the first tick. |
Could you elaborate please? |
|
I wanted to mention as part of an unrelated PR, alot of this was done to maintain state better: #2320. Can you try this PR and see if it resolves your issues? I think maybe the
The result callback can be triggered from a number of different goal handles since we pass the same lambda to the action request. Checking if the goal IDs are equal is an important step because there is no promise that they're the same. We could cache the goal handle somewhere else so it doesn't try to access there. But either way, if the goal handle isn't valid for some reason, that result callback wasn't triggered for nothing, there's something going on there that we need to process. Just ignoring it isn't a real solution. |
|
I could reproduce the problem at the PR: #2320. This call may be called before that the assign statement is called. backtrace is here: |
|
Have you found a way (or test case) to reproduce the problem? |
|
No I haven't. |
|
Oh. Did you run the navigation2 without any modifications? Such as the |
|
The original problem is checked by pure navigation2. |
|
Do you have any logs? By the way, which ARM development board are you using? Raspberry Pi 4 Model B? |
|
full launch log is here:
I am using Jetson AGX Xavier devkit. |
|
According to the logs, I probably have found the reason of this problem. |
|
Is that true? |
|
I think the reason is that it did not check if the But I am not sure what else potential problems with that PR. |
|
Would you mind using the pure Navigation2 to reproduce the problem again? If my analysis is correct, when you reproduce this problem you could see a With the pure Navigation2, this problem occurs in this case.
To solve this problem, it's probably to set BT status to @SteveMacenski Maybe this issue is related to this PR #2320. |
|
I don't think these are directly related, but the implementation of #2320 has implication for this issue. That traceback also gives us a little bit of an understanding, but without
I don't understand how you expect to
https://github.com/ros-planning/navigation2/blob/f83612d28ce0db76436cf711071129eb17099c8f/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp#L291 #2320 does this |
|
I understand. |
I haven't thought it carefully. I was considering about how could we avoid |
|
I reproduced SEGV on pure navigation2 (i.e #2320 without modification). I may have found a way to reproduce it.
|
|
I'm afraid you are still applying the patch(#2320), because there is no If you want to figure out the initial issue you're running into, you could checkout to branch |
|
@SteveMacenski here is a short segment of logs: here is the backtrace:
I noticed that |
That's not a bad point, but actually this PR: #2334 changes that so that each Note that PR is still in review, so there are some conflicts / changes I requested, but by in large it should work (the review requested changes arent on the action nodes, but you would have to deal with the conflicts locally when you pulled in master) |
|
To my understanding, this PR: #2334 could guarantee that each BT node would only receive its messages, please correct me if I am wrong. If so, maybe that could solve the issue. |
|
Thank you for your support. backtrace is here: |
|
@tzskp1 Can you test with a new pull of rclcpp master? @KavenYau just merged something that was occasionally not sending result callbacks in rclcpp itself that might have caused this problem. @KavenYau given your now deep familiarity with rclcpp_actions, do you think this is potentially another issue in there or something in nav2 itself? |
|
@SteveMacenski According to that backtrace, I think this is a https://github.com/ros-planning/navigation2/blob/4277d00d52e391de7c0e7c52658bba7b00447563/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp#L345-L359 In this case, if it sent a goal before, we could try to get To make it more clearly, I make a PR: #2356. @tzskp1 You may want to have a try? There might be some compiling errors on the latest commit of branch |
|
I will try to reproduce the problem. |
|
No upsetting 😄 . But It seems a little more complicated than I thought. |
|
I could reproduce the problem at rclcpp: 1c617515403db16b8cbabd834073eaca0f9a452d. |
|
Yes, it seems that it receives/handles a |
|
Did you test @KavenYau ‘s suggestion? They kindly made a PR to support! |
Basic Info
Description of contribution in a few bullet points
goal_handle_for avoiding race conditions.For Maintainers: