-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-29251 Procedure gets stuck if the procedure state cannot be persisted #6910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if (updateFailForTest) { | ||
| // test for HBASE-29251 | ||
| throw new IOException("Update failed"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a good way to test his, but since MasterRegion is final class, extending it is also not possible.
Using this, we can reproduce the exact issue with the test if we don't abort master with IOE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be fine to remove final from MasterRegion class so that we can extend it for testing purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do that and then make this test more clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MasterRegion has a private constructor, that's why we mark it as final.
Since UpdateMasterRegion is just a interface, I think it is very easy to verify the changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And inside MasterRegion we just use HRegion, for HRegion, we have a way to inject specify implementation class. Please see HRegion.newHRegion
public static HRegion newHRegion(Path tableDir, WAL wal, FileSystem fs, Configuration conf,
RegionInfo regionInfo, final TableDescriptor htd, RegionServerServices rsServices) {
try {
@SuppressWarnings("unchecked")
Class<? extends HRegion> regionClass =
(Class<? extends HRegion>) conf.getClass(HConstants.REGION_IMPL, HRegion.class);
Constructor<? extends HRegion> c =
regionClass.getConstructor(Path.class, WAL.class, FileSystem.class, Configuration.class,
RegionInfo.class, TableDescriptor.class, RegionServerServices.class);
return c.newInstance(tableDir, wal, fs, conf, regionInfo, htd, rsServices);
} catch (Throwable e) {
// todo: what should I throw here?
throw new IllegalStateException("Could not instantiate a region instance.", e);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any custom implementation of UpdateMasterRegion also needs custom hooks in Procedure executor classes, which make it more complicated to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to define what you want to test, if you just want to make sure that if there is an exception you will call abort, it is very easy, and even do not need to bring up a cluster.
If you want to do something like a integration tests, you can extend HRegion, and there are bunch of ways to decide whether to throw an exception in batchMutate method. You can make the specific HRegion implementation an inner class of the testcase, and set a static field in the test class to control whether to throw an exception...
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
apurtell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved but do consider a cleaner test
| if (updateFailForTest) { | ||
| // test for HBASE-29251 | ||
| throw new IOException("Update failed"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do that and then make this test more clean.
| server.abort("WAL sync timeout", e); | ||
| } catch (IOException e) { | ||
| LOG.error(HBaseMarkers.FATAL, "MasterRegion mutation is not successful. Aborting server."); | ||
| server.abort("MasterRegion mutation is not successful", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aborting is a start.
The rest of my question here is addressed by the discussion on the JIRA.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| } catch (WALSyncTimeoutIOException e) { | ||
| LOG.error(HBaseMarkers.FATAL, "WAL sync timeout. Aborting server."); | ||
| server.abort("WAL sync timeout", e); | ||
| } catch (IOException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better add some comments here, to summary the discussion on the jira, and also give a pointer to the jira, to let later developpers know why here we will abort for any IOExceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
| public static void setUpBeforeClass() throws Exception { | ||
| TEST_UTIL.getConfiguration().setClass(HConstants.REGION_IMPL, TestRegion.class, HRegion.class); | ||
| StartTestingClusterOption.Builder builder = StartTestingClusterOption.builder(); | ||
| builder.numMasters(4).numRegionServers(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need 4 masters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 would also work (2 are aborted) but i just kept one additional. I can keep it 3 if you are not fine with this.
This comment has been minimized.
This comment has been minimized.
Reidddddd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| // RegionTooBusyException is the type of IOException for which we can retry | ||
| // for few times before aborting the active master. The master region might | ||
| // have genuine case for delayed flushes and/or some procedure bug causing | ||
| // heavy pressure on the memstore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If RegionTooBusyException is caught, here can trigger flusherAndCompactor.onUpdate();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean for tries == (maxRetriesForRegionUpdates - 1) condition? Otherwise, it will do it anyways for all retries as per the above loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean like this:
} catch (RegionTooBusyException e) {
flusherAndCompactor.onUpdate();
if (tries == (maxRetriesForRegionUpdates - 1)) {
***
}
}
The time interval hasn't been reached, but the changesAfterLastFlush threshold has been met, could happen, so when this exception caught. you need to trigger flusherAndCompactor.onUpdate();
Or otherwise, you need to switch the execution order
flusherAndCompactor.onUpdate(); // first
action.update(region); // after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
|
Will you add back-off feature in next ticket? |
| abortServer(e); | ||
| } | ||
| LOG.info("Master region {} is too busy... retry attempt: {}", region, tries); | ||
| Threads.sleep(ConnectionUtils.getPauseTime(regionUpdateRetryPauseTime, tries)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Reidddddd exponential backoff is added here. I think I should add comment here, because single line is not readable enough, let me do that.
Reidddddd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Thanks everyone for the reviews! Awaiting final build results before merging the PR, the jenkins build is still stuck scheduling the build to a VM. |
This comment has been minimized.
This comment has been minimized.
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
…sisted (#6910) Signed-off-by: Andrew Purtell <[email protected]> Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Reid Chan <[email protected]> Signed-off-by: gvprathyusha6 <[email protected]>
…sisted (#6916) (#6910) Signed-off-by: Andrew Purtell <[email protected]> Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Reid Chan <[email protected]> Signed-off-by: gvprathyusha6 <[email protected]>
…sisted (#6916) (#6910) Signed-off-by: Andrew Purtell <[email protected]> Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Reid Chan <[email protected]> Signed-off-by: gvprathyusha6 <[email protected]>
…sisted (#6916) (#6910) Signed-off-by: Andrew Purtell <[email protected]> Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Reid Chan <[email protected]> Signed-off-by: gvprathyusha6 <[email protected]>
…sisted (apache#6916) (apache#6910) Signed-off-by: Andrew Purtell <[email protected]> Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Reid Chan <[email protected]> Signed-off-by: gvprathyusha6 <[email protected]>
Jira: HBASE-29251