-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-29070 Balancer cost function epsilon is imprecise #6597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if (balancedCluster == null) { | ||
| return "null"; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A null list here likely means a test failure, but you're likely to find a more clear failure message somewhere other than here
| assertFullyBalancedForReplicas); | ||
| } | ||
|
|
||
| protected void testWithCluster(Map<ServerName, List<RegionInfo>> serverMap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there are any tests where it actually makes sense to run without iteration
| protected void increaseMaxRunTimeOrFail() { | ||
| long current = getCurrentMaxRunTimeMs(); | ||
| assertTrue(current < MAX_MAX_RUN_TIME_MS); | ||
| setMaxRunTime(Math.max(MAX_MAX_RUN_TIME_MS, current * 2)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal here was just to make this test framework ambitious, but forgiving. If your cluster is too complicated to be evaluated meaningfully in the initial runtime, then we should bump it up before failing
| StochasticLoadTestBalancer() { | ||
| super(new DummyMetricsStochasticBalancer()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The balancer test suite actually couldn't be run in its entirety without this addition, because otherwise there are JMX conflicts. So this is just an unrelated bug fix that I forgot to separate in my planning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we see the impact of this bug in our CI runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprisingly we don't. I don't understand exactly how the CI env differs from my local env to make that the case, but presumably it has better isolation between the tests so that there's no conflict in the JMX setup when running many tests at once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a difference between running in IDE vs directly in maven cli?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
90b49ea to
bfac6e2
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
!!! |
ndimiduk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits. This is good cleanup and the shorted runtime is a bonus!
|
|
||
| public static final double COST_EPSILON = 0.0001; | ||
| public static double getCostEpsilon(double cost) { | ||
| return Math.ulp(cost); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL
| conf.setClass("hbase.util.ip.to.rack.determiner", MockMapping.class, DNSToSwitchMapping.class); | ||
| conf.setFloat("hbase.master.balancer.stochastic.localityCost", 0); | ||
| conf.setBoolean("hbase.master.balancer.stochastic.runMaxSteps", true); | ||
| conf.setLong(StochasticLoadBalancer.MAX_RUNNING_TIME_KEY, 250); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why override with such a small value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many tests really only need this much, so starting small makes the test suite much faster
...lancer/src/test/java/org/apache/hadoop/hbase/master/balancer/StochasticBalancerTestBase.java
Outdated
Show resolved
Hide resolved
| StochasticLoadTestBalancer() { | ||
| super(new DummyMetricsStochasticBalancer()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we see the impact of this bug in our CI runs?
...lancer/src/test/java/org/apache/hadoop/hbase/master/balancer/StochasticBalancerTestBase.java
Outdated
Show resolved
Hide resolved
...lancer/src/test/java/org/apache/hadoop/hbase/master/balancer/StochasticBalancerTestBase.java
Outdated
Show resolved
Hide resolved
| loadBalancer.loadConf(conf); | ||
| } | ||
|
|
||
| protected void testWithClusterWithIteration(int numNodes, int numRegions, int numRegionsPerServer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not your code, but this argument list is long enough and has enough overlapping argument types that I think a build object is warranted. It would be nice if we could introduce something like http://immutables.github.io/ but that's not this PR.
bfac6e2 to
807c1c4
Compare
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
… (#6607) Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
… (#6610) Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
…apache#6600) Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
…cise (apache#6597) (apache#6600) (will be in 2.6.3) Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
…cise (apache#6597) (apache#6600) (will be in 2.6.3) Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
…cise (apache#6597) (apache#6600) (will be in 2.6.3) Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
…apache#6600) Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
…apache#6600) (apache#6610) Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
Jira
On top of actually making the balancer better, this PR reduces the runtime of the balancer test suite on my machine from about 30min to about 8min
cc @charlesconnell @hgromer @krconv @ksravista