-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas #4410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…mber of usable replicas
This comment was marked as outdated.
This comment was marked as outdated.
|
The failing unit test is: This seems to be unrelated to my change & a flaky unit test: |
|
Thanks @KevinWikant for the PR. We have seen this bug coming up few times in last few months. And the PR seems to fix the bug. Although I am thinking could there be a case where the block can show up in both liveReplicas and decommissioning, which could lead to any unnecessarily call to invalidateCorruptReplicas() Edge case coming to my mind is when the we are considering the block on decommissioning node as useable but the very next moment, the node is decommissioned. |
|
Hi @ashutoshcipher thank you for reviewing the PR! I have tried to address your comments below: I am thinking could there be a case where the block can show up in both liveReplicas and decommissioning, which could lead to any unnecessarily call to invalidateCorruptReplicas()I am not sure its possible for a block replica to be reported as both a liveReplica & a decommissioningReplica at the same time. Its my understanding that a replica on decommissioning datanode is counted as a decommissioningReplica & a non-corrupt replica on a live datanode is counted as a liveReplica. So a block replica on a decommissioning node will only be counted towards decommissioningReplica count & not liveReplica count. In my experience I have not seen the behavior you are mentioning before; let me know if you have a reference JIRA. If the case you described is possible then in theory numUsableReplica would be greater than it should be. In the typical case where "dfs.namenode.replication.min = 1" this makes no difference because even if there is only 1 non-corrupt block replica on 1 decommissioning node then the corrupt blocks are invalidated regardless of wether or not numUsableReplica=1 or numUsableReplica=2 (due to double counting of replica as liveReplica & decommissioningReplica). In the case where "dfs.namenode.replication.min > 1" there could arguably be a difference because the corrupt replicas would not be invalidated if numUsableReplica=1 but they will be invalidated if numUsableReplica=2 (due to double counting of replica as liveReplica & decommissioningReplica). I think if this scenario is possible the correct fix would be to ensure that each block replica is only counted once towards either liveReplicas or decommissioningReplicas. Let me know if there is a JIRA for this issue & I can potentially look into the bug fix separately. Edge case coming to my mind is when the we are considering the block on decommissioning node as useable but the very next moment, the node is decommissioned.Fair point, I had considered this & mentioned this edge case in the PR description: Any replicas on decommissioning nodes will not be decommissioned until there are more liveReplicas than the replication factor for the block. Its only possible for decommissioningReplicas to be decommissioned at the same time corruptReplicas are invalidated if there are sufficient liveReplicas to satisfy the replication factor; in this case, because of the liveReplicas its safe to eliminate both the decommissioningReplicas & the corruptReplicas. If there is not a sufficient number of liveReplicas then the decommissioningReplicas will not be decommissioned but the corruptReplicas will be invalidated; the block will not be lost because the decommissioningReplicas will still exist. One case you could argue is that if:
By invalidating the corruptReplicas we are exposing the block to a risk of loss should the decommissioningReplica be unexpectedly terminated/failed at around the same time the corruptReplicas are invalidated. This is true, but this same risk already exists where:
Thus far, I am unable to think of any scenario where this change introduces an increased risk of data loss. In fact, this change helps improve block redundancy (reducing risk of data loss) in scenarios where a block cannot be sufficiently replicated because corruptReplicas cannot be invalidated. |
aajisaka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KevinWikant Nice catch, and thank you for your detailed explanation.
Would you remove the unused imports in the test class? I'm +1 if that is addressed.
|
@KevinWikant Nice catch +1. I learned a lot from it. thanks~ |
…mber of usable replicas, revision 2
|
💔 -1 overall
This message was automatically generated. |
aajisaka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Thank you @KevinWikant for the update. The Javadoc error in Java 11 is not related to the PR. I'll merge this in this weekend if there is no objection.
|
Filed HDFS-16635 to fix javadoc error. |
@aajisaka Raised PR for the same. |
|
Merged. Thank you @KevinWikant for your contribution and thank you @ashutoshcipher @ZanderXu for your review! |
…mber of usable replicas (#4410) Co-authored-by: Kevin Wikant <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]> (cherry picked from commit cfceaeb)
…mber of usable replicas (#4410) Co-authored-by: Kevin Wikant <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]> (cherry picked from commit cfceaeb) Conflicts: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
…mber of usable replicas (apache#4410) Co-authored-by: Kevin Wikant <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]>
… based on number of usable replicas (apache#4410) Co-authored-by: Kevin Wikant <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]>
… based on number of usable replicas (apache#4410) Co-authored-by: Kevin Wikant <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]>
… based on number of usable replicas (apache#4410) Co-authored-by: Kevin Wikant <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]>
HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas
Description of PR
Bug fix for a re-occurring HDFS bug which can result in datanodes being unable to complete decommissioning indefinitely. In short, the bug is a chicken & egg problem where:
In this scenario, the block(s) are minimally replicated but the logic the Namenode uses to determine if a block is minimally replicated is flawed. Should be checking if usableReplicas < "dfs.namenode.replication.min" because decommissioning & entering maintenance nodes can have valid block replicas which should be replicated to the other datanodes with corrupt replicas if liveReplicas < "dfs.replication".
To understand the bug further we must first establish some background information.
Background Information
Givens:
Under certain scenarios the DataStreamer client can detect a bad link when trying to append to the block pipeline, in this case the DataStreamer client can shift the block pipeline by replacing the bad link with a new datanode. When this happens the replica on the datanode that was shifted away from becomes corrupted because it no longer has the latest generation stamp for the block. As a more concrete example:
The key behavior being highlighted here is that when the DataStreamer client shifts the block pipeline due to append failures on a subset of the datanode(s) in the pipeline, a corrupt block replica gets leftover on the datanode(s) that were shifted away from.
This corrupt block replica makes the datanode ineligible as a replication target for the block because of the following exception described in [https://issues.apache.org/jira/browse/HDFS-16064]:
What typically will occur is that these corrupt block replicas will be invalidated by the Namenode which will cause the corrupt replica to the be deleted on the datanode, thus allowing the datanode to once again be a replication target for the block. Note that the Namenode will not identify the corrupt block replica until the datanode sends its next block report, this can take up to 6 hours with the default block report interval.
As long as there is 1 live replica of the block, all the corrupt replicas should be invalidated based on this condition:
hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
Line 1928 in 7bd7725
When there are 0 live replicas, the corrupt replicas are not invalidated; I assume the reasoning behind this is that its better to have some corrupt copies of the block than to have no copies at all.
Description of Problem
The problem comes into play when the aforementioned behavior is coupled with decommissioning and/or entering maintenance.
Consider the following scenario:
This scenario is essentially a replication/decommissioning deadlock because:
This does not need to be a deadlock, the corrupt replicas could be invalidated & the live replica could be transferred from A to B & C.
The same problem can exist on a larger scale, the requirements are that:
In this case the corrupt replicas will not be invalidated by the Namenode which means that the decommissioning and entering maintenance replicas will never be sufficiently replicated and therefore will never finish decommissioning or entering maintenance.
The symptom of this issue in the logs is that right after a node with a corrupt replica sends its block report, rather than the block being invalidated it just gets counted as a corrupt replica:
Proposed Solution
Rather than checking if "dfs.namenode.replication.min" is satisfied based on liveReplicas, check based on usableReplicas where "usableReplicas = liveReplicas + decommissioningReplicas + enteringMaintenanceReplicas". This will allow the corrupt replicas to be invalidated so that the live replica(s) on the decommissioning node(s) to be sufficiently replicated.
The only perceived risk here would be that the corrupt blocks are invalidated at around the same time the decommissioning and entering maintenance nodes are decommissioned. This could in theory bring the overall number of replicas below the "dfs.namenode.replication.min" (i.e. to 0 replicas in the worst case). This is however not an actual risk because the decommissioning and entering maintenance nodes will not finish decommissioning until their is a sufficient number of liveReplicas; so there is no possibility that the decommissioning and entering maintenance nodes will be decommissioned prematurely.
If the decommissioningReplicas are in fact corrupt, then because liveReplicas=0 there are no more uncorrupted replicas & the block cannot be recovered. So the additional corrupt replicas will be invalidated leaving only the decommissioning corrupt replicas which will contain the corrupted blocks data.
How was this patch tested?
Added a unit test "testDeleteCorruptReplicaForUnderReplicatedBlock"
This test reproduces a scenario where:
The test then validates that:
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?