-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5280] Fix taskmanager concurrent registry #7306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5280] Fix taskmanager concurrent registry #7306
Conversation
…hread when the taskmanager concurrently registered metrics
KnightChess
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is #6968 can not fix it?
|
related to HUDI-5041 |
| return lockDuration; | ||
| if (registry.getMetrics().get(metricName) == null) { | ||
| synchronized (Registry.class) { | ||
| if (registry.getMetrics().get(metricName) == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the REGISTRY_LOCK can not work correctly here, can you elaborate the explanation then ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么
REGISTRY_LOCK不能在这里正常工作,你能详细解释一下吗?
When using the reentrant lock, the Task threads are isolated from each other before, but there will be a situation where the second Task thread enters the synchronized scope and waits after the first Task thread registers the value. At this time, the metrics may not be obtained. So there is a duplicate registration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the REGISTRY_LOCK can still be used instead of the class obj lock Registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is not necessary to lock the whole class. I will make the correction in these two days
I'm sorry that I have been busy with work recently, so I didn't pay attention to your news. When using the reentrant lock, the Task threads are isolated from each other before, but there will be a situation where the second Task thread enters the synchronized scope and waits after the first Task thread registers the value. At this time, the metrics may not be obtained. So there is a duplicate registration |
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 is this fix still needed?
|
Should be fixed. |
Change Logs
When running the MOR task on flink, we found that the taskmanager log repeatedly loaded related metrics, so we thought that the task thread did not isolate it during concurrent access, causing this error to occur in the next Task thread access. After the repair, the problem disappeared
Impact
Risk level (write none, low medium or high below)
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist