Skip to content

Conversation

@kaori-seasons
Copy link

@kaori-seasons kaori-seasons commented Nov 26, 2022

Change Logs

When running the MOR task on flink, we found that the taskmanager log repeatedly loaded related metrics, so we thought that the task thread did not isolate it during concurrent access, causing this error to occur in the next Task thread access. After the repair, the problem disappeared

Impact

error

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

chengxy added 2 commits November 23, 2022 15:55
…hread when the taskmanager concurrently registered metrics
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@KnightChess KnightChess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is #6968 can not fix it?

@codope codope added priority:medium Moderate impact; usability gaps area:metrics Metrics and monitoring labels Nov 29, 2022
@codope
Copy link
Member

codope commented Nov 29, 2022

related to HUDI-5041
@complone Can you please explain what was the gap in #6968 ?
cc: @hbgstc123 @nsivabalan

return lockDuration;
if (registry.getMetrics().get(metricName) == null) {
synchronized (Registry.class) {
if (registry.getMetrics().get(metricName) == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the REGISTRY_LOCK can not work correctly here, can you elaborate the explanation then ?

Copy link
Author

@kaori-seasons kaori-seasons Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么REGISTRY_LOCK不能在这里正常工作,你能详细解释一下吗?

When using the reentrant lock, the Task threads are isolated from each other before, but there will be a situation where the second Task thread enters the synchronized scope and waits after the first Task thread registers the value. At this time, the metrics may not be obtained. So there is a duplicate registration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the REGISTRY_LOCK can still be used instead of the class obj lock Registry.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is not necessary to lock the whole class. I will make the correction in these two days

@kaori-seasons
Copy link
Author

kaori-seasons commented Dec 5, 2022

is #6968 can not fix it?

I'm sorry that I have been busy with work recently, so I didn't pay attention to your news.

When using the reentrant lock, the Task threads are isolated from each other before, but there will be a situation where the second Task thread enters the synchronized scope and waits after the first Task thread registers the value. At this time, the metrics may not be obtained. So there is a duplicate registration
cc @codope @danny0405 @KnightChess

@nsivabalan nsivabalan added priority:blocker Production down; release blocker priority:critical Production degraded; pipelines stalled release-0.12.2 Patches targetted for 0.12.2 and removed priority:medium Moderate impact; usability gaps priority:blocker Production down; release blocker labels Dec 5, 2022
@codope codope removed the release-0.12.2 Patches targetted for 0.12.2 label Dec 7, 2022
@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Feb 26, 2024
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 is this fix still needed?

@danny0405
Copy link
Contributor

Should be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:metrics Metrics and monitoring priority:critical Production degraded; pipelines stalled size:S PR with lines of changes in (10, 100]

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

7 participants