-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-24813 ReplicationSource should clear buffer usage on Replicatio… #2546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
8c49c1b
af8166f
9dafe5f
8e5c77b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,6 +23,8 @@ | |
| import java.io.IOException; | ||
| import java.util.List; | ||
| import java.util.concurrent.PriorityBlockingQueue; | ||
| import java.util.concurrent.atomic.LongAccumulator; | ||
|
|
||
| import org.apache.hadoop.conf.Configuration; | ||
| import org.apache.hadoop.fs.Path; | ||
| import org.apache.hadoop.hbase.Cell; | ||
|
|
@@ -325,4 +327,53 @@ void stopWorker() { | |
| public boolean isFinished() { | ||
| return state == WorkerState.FINISHED; | ||
| } | ||
|
|
||
| /** | ||
| * Attempts to properly update <code>ReplicationSourceManager.totalBufferUser</code>, | ||
| * in case there were unprocessed entries batched by the reader to the shipper, | ||
| * but the shipper didn't manage to ship those because the replication source is being terminated. | ||
| * In that case, it iterates through the batched entries and decrease the pending | ||
| * entries size from <code>ReplicationSourceManager.totalBufferUser</code> | ||
| * <p/> | ||
| * <b>NOTES</b> | ||
| * 1) This method should only be called upon replication source termination. | ||
| * It blocks waiting for both shipper and reader threads termination, | ||
| * to make sure no race conditions | ||
| * when updating <code>ReplicationSourceManager.totalBufferUser</code>. | ||
| * | ||
| * 2) It <b>does not</b> attempt to terminate reader and shipper threads. Those <b>must</b> | ||
| * have been triggered interruption/termination prior to calling this method. | ||
| */ | ||
| void clearWALEntryBatch() { | ||
| long timeout = System.currentTimeMillis() + this.shipEditsTimeout; | ||
| while(this.isAlive() || this.entryReader.isAlive()){ | ||
| try { | ||
| if (System.currentTimeMillis() >= timeout) { | ||
| LOG.warn("Interrupting source thread for peer {} without cleaning buffer usage " | ||
| + "because clearWALEntryBatch method timed out whilst waiting reader/shipper " | ||
| + "thread to stop.", this.source.getPeerId()); | ||
| Thread.currentThread().interrupt(); | ||
|
||
| } else { | ||
| // Wait both shipper and reader threads to stop | ||
| Thread.sleep(this.sleepForRetries); | ||
| } | ||
| } catch (InterruptedException e) { | ||
| LOG.warn("{} Interrupted while waiting {} to stop on clearWALEntryBatch: {}", | ||
| this.source.getPeerId(), this.getName(), e); | ||
| Thread.currentThread().interrupt(); | ||
|
||
| } | ||
| } | ||
| LongAccumulator totalToDecrement = new LongAccumulator((a,b) -> a + b, 0); | ||
| entryReader.entryBatchQueue.forEach(w -> { | ||
| entryReader.entryBatchQueue.remove(w); | ||
ankitsinghal marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| w.getWalEntries().forEach(e -> { | ||
| long entrySizeExcludeBulkLoad = entryReader.getEntrySizeExcludeBulkLoad(e); | ||
wchevreuil marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| totalToDecrement.accumulate(entrySizeExcludeBulkLoad); | ||
| }); | ||
| }); | ||
|
|
||
| LOG.trace("Decrementing totalBufferUsed by {}B while stopping Replication WAL Readers.", | ||
wchevreuil marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| totalToDecrement.longValue()); | ||
| source.getSourceManager().getTotalBufferUsed().addAndGet(-totalToDecrement.longValue()); | ||
wchevreuil marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a worker is doing some async work when it is asked to stop and can take time. then I think we should keep the implementation as it was done before, like ask all to stop at once and then wait. because if no. of workers gets large due to backlog and someone changes wait time config to 10s of seconds, then removePeer command/procedure has to wait for a long time (no. of workers * (sleep time + time for clearWalEntryBatch) ) to terminate the replication source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not following your concern here. I don't see how the extra loop in the same method context just setting two a flag in the shipper and other in the reader can help with the contention scenario described, terminate execution would be stuck in the second for loop anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, let me try to explain again.
I was referring to restore this loop.
As your current flow is stopping the worker in a linear manner:-
So in the worst case, you would have to wait for the number of workers * min(time taken by the worker to finish, timeout)
though by restoring the old loop, you are parallelizing the stopping of the workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you, thanks for explaining in more details. Will address it on next commit.