-
Notifications
You must be signed in to change notification settings - Fork 9.2k
YARN-11924. Add zkManager.exists(path) check to ZKConfigurationStore:… #8222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
|
💔 -1 overall
This message was automatically generated. |
Hean-Chhinling
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big thanks to you, @ferdelyi for working on this.
This is a huge help to the intermittent RM start up failures.
The code overall is really good.
I just have some improvement ideas and some questions
.../apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
| // when the /confstore/CONF_STORE path does not exist, hence the | ||
| // getZkData method returns a null value causing the RM to fail. | ||
| // To prevent this, added a re-try mechanism before giving up. | ||
| int maxRetries = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of making the Max retries configurable like the sleepBetweenRetries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've discussed with @K0K0V0K, and he suggested that we are OK to only have the retry seconds configurable. But I am OK either way. It feels a bit overkill to me. We already have so many configs.
| protected byte[] getZkData(String path) throws Exception { | ||
| // Should a 'yarn resourcemanager -format-state-store' command is issued | ||
| // while one of the RM is in a starting state, there is a time period | ||
| // when the /confstore/CONF_STORE path does not exist, hence the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue also occur when the /confstore/CONF_STORE path exists but the data is not fully written this also cause another RM to fail without the root.default queue. Thus retry and make sure the data is not empty.
I think we could add the followings:
"when the /confstore/CONF_STORE path does exists or when the data is not fully written"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we format the confstore, because of YARN-11551, my current understanding is that we will end up with an empty /confstore, hence the /confstore/CONF_STORE does not exist. Can you please verify if my understanding is correct?
.../apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
...che/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
...che/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java
Show resolved
Hide resolved
|
|
||
| /** Retry period for ZKConfigurationStore will create znodes. */ | ||
| @Private | ||
| @Unstable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just curious here.
What does the "unstable" annotation means here and why do we it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I clicked through it, it led to an explanation, but it is a bit unclear to me if we are still using it. @K0K0V0K do you know more on this topic?
...che/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
|
💔 -1 overall
This message was automatically generated. |
…getZkData() and retry mechanism Should a "yarn resourcemanager -format-state-store" command be issued while one of the RM is starting and in the INIT state (because of YARN-11551), there is a time period when the /confstore/CONF_STORE path does not exist, hence the getZkData method returns a null value, causing the RM to fail. To prevent this, add a check and re-try mechanism before giving up.
|
💔 -1 overall
This message was automatically generated. |
…getZkData() and retry mechanism
Should a "yarn resourcemanager -format-state-store" command be issued while one of the RM is starting and in the INIT state (because of YARN-11551), there is a time period when the /confstore/CONF_STORE path does not exist, hence the getZkData method returns a null value, causing the RM to fail. To prevent this, add a check and re-try mechanism before giving up.
Description of PR
Rare race condition is addressed when "yarn resourcemanager -format-state-store" issued when an RM is in the INITING state (already initialized the confstore) right before reading it. This change avoids a null pointer exception.
How was this patch tested?
Manually with locks introduced in the RM at the confstore format step with sleep, so while one of the RM is formatting the statestore, the other RM will be at the getZkData method trying to read the confstore in the INIT state.
Also with added unit tests.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?AI Tooling
If an AI tool was used:
where is the name of the AI tool used.
https://www.apache.org/legal/generative-tooling.html