HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas #4410

KevinWikant · 2022-06-06T14:22:11Z

HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Description of PR

Bug fix for a re-occurring HDFS bug which can result in datanodes being unable to complete decommissioning indefinitely. In short, the bug is a chicken & egg problem where:

in order for a datanode to be decommissioned its blocks must be sufficiently replicated (which is not true if liveReplicas < "dfs.replication")
datanode cannot sufficiently replicate some block(s) because of corrupt block replicas on target datanodes (i.e. https://issues.apache.org/jira/browse/HDFS-16064)
corrupt block replicas will never be invalidated by the Namenode because the block(s) are not minimally replicated (because liveReplicas < "dfs.namenode.replication.min")

In this scenario, the block(s) are minimally replicated but the logic the Namenode uses to determine if a block is minimally replicated is flawed. Should be checking if usableReplicas < "dfs.namenode.replication.min" because decommissioning & entering maintenance nodes can have valid block replicas which should be replicated to the other datanodes with corrupt replicas if liveReplicas < "dfs.replication".

To understand the bug further we must first establish some background information.

Background Information

Givens:

FSDataOutputStream is being used to write the HDFS file, under the hood this uses a class DataStreamer
for the sake of example we will say the HDFS file has a replication factor of 2, though this is not a requirement to reproduce the issue
the file is being appended to intermittently over an extended period of time (in general, this issue needs minutes/hours to reproduce because corrupt replicas need to be generated over time)
HDFS is configured with typical default configurations

Under certain scenarios the DataStreamer client can detect a bad link when trying to append to the block pipeline, in this case the DataStreamer client can shift the block pipeline by replacing the bad link with a new datanode. When this happens the replica on the datanode that was shifted away from becomes corrupted because it no longer has the latest generation stamp for the block. As a more concrete example:

DataStreamer client creates block pipeline on datanodes A & B, each have a block replica with generation stamp 1
DataStreamer client tries to append the block pipeline by sending block transfer (with generation stamp 2) to datanode A
Datanode A succeeds in writing the block transfer & then attempts to forward the transfer to datanode B
Datanode B fails the transfer for some reason and responds with a PipelineAck failure code
Datanode A sends a PipelineAck to DataStreamer indicating datanode A succeeded in the append & datanode B failed in the append. The DataStreamer detects datanode B as a bad link which will be replaced before the next append operation
at this point datanode A has live replica with generation stamp 2 & datanode B has corrupt replica with generation stamp 1
the next time DataStreamer tries to append the block it will call Namenode "getAdditionalDatanode" API which returns some other datanode C
DataStreamer sends data transfer (with generation stamp 3) to the new block pipeline containing datanodes A & C, the append succeeds to both datanodes
end state is that:
- datanodes A & C have live replicas with latest generation stamp 3
- datanode B has a corrupt replica because its lagging behind with generation stamp 1

The key behavior being highlighted here is that when the DataStreamer client shifts the block pipeline due to append failures on a subset of the datanode(s) in the pipeline, a corrupt block replica gets leftover on the datanode(s) that were shifted away from.

This corrupt block replica makes the datanode ineligible as a replication target for the block because of the following exception described in [https://issues.apache.org/jira/browse/HDFS-16064]:

2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 dst: /DN3:9866; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.

What typically will occur is that these corrupt block replicas will be invalidated by the Namenode which will cause the corrupt replica to the be deleted on the datanode, thus allowing the datanode to once again be a replication target for the block. Note that the Namenode will not identify the corrupt block replica until the datanode sends its next block report, this can take up to 6 hours with the default block report interval.

As long as there is 1 live replica of the block, all the corrupt replicas should be invalidated based on this condition:

hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java

Line 1928 in 7bd7725

boolean corruptedDuringWrite = minReplicationSatisfied &&

When there are 0 live replicas, the corrupt replicas are not invalidated; I assume the reasoning behind this is that its better to have some corrupt copies of the block than to have no copies at all.

Description of Problem

The problem comes into play when the aforementioned behavior is coupled with decommissioning and/or entering maintenance.

Consider the following scenario:

block has replication factor of 2
there are 3 datanodes A, B, & C
datanode A has decommissioning replica
datanodes B & C have corrupt replicas

This scenario is essentially a replication/decommissioning deadlock because:

corrupt replicas on B & C will not be invalidated because there are 0 live replicas (as per Namenode logic)
datanode A cannot finish decommissioning until the block is replicated to datanodes B & C
the block cannot be replicated to datanodes B & C until their corrupt replicas are invalidated

This does not need to be a deadlock, the corrupt replicas could be invalidated & the live replica could be transferred from A to B & C.

The same problem can exist on a larger scale, the requirements are that:

liveReplicas < "dfs.namenode.replication.min" = 1 by default
decommissioningAndEnteringMaintenanceReplicas > 0
liveReplicas + decommissioningAndEnteringMaintenanceReplicas + corruptReplicas = numberOfDatanodes

In this case the corrupt replicas will not be invalidated by the Namenode which means that the decommissioning and entering maintenance replicas will never be sufficiently replicated and therefore will never finish decommissioning or entering maintenance.

The symptom of this issue in the logs is that right after a node with a corrupt replica sends its block report, rather than the block being invalidated it just gets counted as a corrupt replica:

2021-12-04 17:20:17,059 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XYZ_123, Expected Replicas: 2, live replicas: 0, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.0.0.1:15076 , Current Datanode: 10.0.0.1:15076, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
2021-12-04 17:20:43,533 INFO BlockStateChange (Block report processor): BLOCK* processReport 0x456: from storage DS-XYZ node DatanodeRegistration(10.0.0.2:15076, ...), blocks: 57, hasStaleStorage: false, processing time: 3 msecs, invalidatedBlocks: 0
2021-12-04 17:20:47,059 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XYZ_123, Expected Replicas: 2, live replicas: 0, corrupt replicas: 1, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.0.0.1:15076 10.0.0.2:15076 , Current Datanode: 10.0.0.1:15076, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
...
2021-12-04 17:44:47,059 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XYZ_123, Expected Replicas: 2, live replicas: 0, corrupt replicas: 1, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.0.0.1:15076 10.0.0.2:15076 , Current Datanode: 10.0.0.1:15076, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
2021-12-04 17:44:47,293 INFO BlockStateChange (Block report processor): BLOCK* processReport 0x789: from storage DS-XYZ node DatanodeRegistration(10.0.0.3:15076, ...), blocks: 2014, hasStaleStorage: false, processing time: 4 msecs, invalidatedBlocks: 0
2021-12-04 17:45:17,059 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XYZ_123, Expected Replicas: 2, live replicas: 0, corrupt replicas: 2, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.0.0.1:15076 10.0.0.2:15076 10.0.0.3:15076 , Current Datanode: 10.0.0.1:15076, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
...

Proposed Solution

Rather than checking if "dfs.namenode.replication.min" is satisfied based on liveReplicas, check based on usableReplicas where "usableReplicas = liveReplicas + decommissioningReplicas + enteringMaintenanceReplicas". This will allow the corrupt replicas to be invalidated so that the live replica(s) on the decommissioning node(s) to be sufficiently replicated.

The only perceived risk here would be that the corrupt blocks are invalidated at around the same time the decommissioning and entering maintenance nodes are decommissioned. This could in theory bring the overall number of replicas below the "dfs.namenode.replication.min" (i.e. to 0 replicas in the worst case). This is however not an actual risk because the decommissioning and entering maintenance nodes will not finish decommissioning until their is a sufficient number of liveReplicas; so there is no possibility that the decommissioning and entering maintenance nodes will be decommissioned prematurely.

If the decommissioningReplicas are in fact corrupt, then because liveReplicas=0 there are no more uncorrupted replicas & the block cannot be recovered. So the additional corrupt replicas will be invalidated leaving only the decommissioning corrupt replicas which will contain the corrupted blocks data.

How was this patch tested?

Added a unit test "testDeleteCorruptReplicaForUnderReplicatedBlock"

This test reproduces a scenario where:

there are 3 datanodes in the cluster
there is a blk_X with replication factor 2
blk_X has:
- 2 corrupt replicas
- 1 decommissioning replica

The test then validates that:

with the BlockManager changes the 2 corrupt replicas are invalidated which allows the decommissioning replica to be sufficiently replicated & the datanode to be decommissioned
- note that in a MiniDFSCluster when restartDatanode is used, the datanode comes up with a different port number

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hdfs.TestDecommission
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 46.006 s - in org.apache.hadoop.hdfs.TestDecommission
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

2022-06-06 16:59:40,925 [Listener at localhost/64061] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2078)) - Block now has 2 corrupt replicas on [127.0.0.1:64049 , 127.0.0.1:64053] and 1 decommissioning replica on 127.0.0.1:64058
...
2022-06-06 16:59:40,925 [Listener at localhost/64061] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2082)) - Restarting stopped nodes 127.0.0.1:64049 , 127.0.0.1:64053
...
2022-06-06 16:59:42,466 [DataXceiver for client  at /127.0.0.1:64091 [Receiving block BP-73610852-192.168.2.16-1654549156791:blk_1073741825_1005]] INFO  datanode.DataNode (DataXceiver.java:run(308)) - 127.0.0.1:64087:DataXceiver error processing WRITE_BLOCK operation  src: /127.0.0.1:64091 dst: /127.0.0.1:64087; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-73610852-192.168.2.16-1654549156791:blk_1073741825_1005 already exists in state FINALIZED and thus cannot be created.
...
2022-06-06 16:59:43,173 [DatanodeAdminMonitor-0] INFO  BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(375)) - Block: blk_1073741825_1005, Expected Replicas: 2, live replicas: 0, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 127.0.0.1:64058 , Current Datanode: 127.0.0.1:64058, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
...
2022-06-06 16:59:54,490 [DataXceiver for client  at /127.0.0.1:64093 [Receiving block BP-73610852-192.168.2.16-1654549156791:blk_1073741825_1005]] INFO  datanode.DataNode (DataXceiver.java:writeBlock(935)) - Received BP-73610852-192.168.2.16-1654549156791:blk_1073741825_1005 src: /127.0.0.1:64093 dest: /127.0.0.1:64083 volume: /Users/wikak/apache/hadoop/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1 of size 1536
2022-06-06 16:59:54,492 [DataXceiver for client  at /127.0.0.1:64094 [Receiving block BP-73610852-192.168.2.16-1654549156791:blk_1073741825_1005]] INFO  datanode.DataNode (DataXceiver.java:writeBlock(935)) - Received BP-73610852-192.168.2.16-1654549156791:blk_1073741825_1005 src: /127.0.0.1:64094 dest: /127.0.0.1:64087 volume: /Users/wikak/apache/hadoop/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data3 of size 1536
...
2022-06-06 16:59:58,188 [DatanodeAdminMonitor-0] INFO  blockmanagement.DatanodeAdminManager (DatanodeAdminManager.java:setDecommissioned(301)) - Decommissioning complete for node 127.0.0.1:64058
...
2022-06-06 17:00:01,098 [Listener at localhost/64090] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2147)) - Block now has 2 live replicas on [127.0.0.1:64083 , 127.0.0.1:64087] and 1 decommissioned replica on 127.0.0.1:64058

without the BlockManager changes the 2 corrupt replicas are never invalidated which prevents the decommissioning replica from being sufficiently replicated & prevents the datanode from even being decommissioned

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hdfs.TestDecommission
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.916 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestDecommission
[ERROR] testDeleteCorruptReplicaForUnderReplicatedBlock(org.apache.hadoop.hdfs.TestDecommission)  Time elapsed: 66.771 s  <<< FAILURE!
java.lang.AssertionError: Node 127.0.0.1:64157 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:64157]
    at org.junit.Assert.fail(Assert.java:89)
    at org.apache.hadoop.hdfs.TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock(TestDecommission.java:2132)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
    at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.lang.Thread.run(Thread.java:750)

[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2132 Node 127.0.0.1:64157 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:64157]
[INFO] 
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0

2022-06-06 17:03:05,174 [Listener at localhost/64165] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2078)) - Block now has 2 corrupt replicas on [127.0.0.1:64153 , 127.0.0.1:64162] and 1 decommissioning replica on 127.0.0.1:64157
...
2022-06-06 17:03:05,174 [Listener at localhost/64165] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2082)) - Restarting stopped nodes 127.0.0.1:64153 , 127.0.0.1:64162
...
2022-06-06 17:03:05,971 [RedundancyMonitor] WARN  blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(489)) - Failed to place enough replicas, still in need of 2 to reach 2 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable:  unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2022-06-06 17:03:06,977 [DatanodeAdminMonitor-0] INFO  BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(375)) - Block: blk_1073741825_1005, Expected Replicas: 2, live replicas: 0, corrupt replicas: 2, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 127.0.0.1:64157 127.0.0.1:64183 127.0.0.1:64187 , Current Datanode: 127.0.0.1:64157, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
...
2022-06-06 17:03:39,026 [RedundancyMonitor] WARN  blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(489)) - Failed to place enough replicas, still in need of 2 to reach 2 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable:  unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2022-06-06 17:03:42,007 [DatanodeAdminMonitor-0] INFO  BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(375)) - Block: blk_1073741825_1005, Expected Replicas: 2, live replicas: 0, corrupt replicas: 2, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 127.0.0.1:64157 127.0.0.1:64183 127.0.0.1:64187 , Current Datanode: 127.0.0.1:64157, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
...
2022-06-06 17:03:45,336 [Listener at localhost/64190] ERROR hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2131)) - Node 127.0.0.1:64157 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:64157]

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
[n/a] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
[n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
[n/a] If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

…mber of usable replicas

KevinWikant · 2022-06-06T21:21:37Z

The failing unit test is:

Failed junit tests  |  hadoop.hdfs.tools.TestDFSAdmin

This seems to be unrelated to my change & a flaky unit test:

[ERROR] testAllDatanodesReconfig(org.apache.hadoop.hdfs.tools.TestDFSAdmin)  Time elapsed: 5.489 s  <<< FAILURE!
org.junit.ComparisonFailure: expected:<[SUCCESS: Changed property dfs.datanode.peer.stats.enabled]> but was:<[  From: "false"]>
    at org.junit.Assert.assertEquals(Assert.java:117)
    at org.junit.Assert.assertEquals(Assert.java:146)
    at org.apache.hadoop.hdfs.tools.TestDFSAdmin.testAllDatanodesReconfig(TestDFSAdmin.java:1208)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
    at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
    at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
    at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
    at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
    at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
    at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

hotcodemacha · 2022-06-14T00:52:39Z

Thanks @KevinWikant for the PR.

We have seen this bug coming up few times in last few months. And the PR seems to fix the bug.

Although I am thinking could there be a case where the block can show up in both liveReplicas and decommissioning, which could lead to any unnecessarily call to invalidateCorruptReplicas()

Edge case coming to my mind is when the we are considering the block on decommissioning node as useable but the very next moment, the node is decommissioned.

KevinWikant · 2022-06-14T02:02:35Z

Hi @ashutoshcipher thank you for reviewing the PR! I have tried to address your comments below:

I am thinking could there be a case where the block can show up in both liveReplicas and decommissioning, which could lead to any unnecessarily call to invalidateCorruptReplicas()

I am not sure its possible for a block replica to be reported as both a liveReplica & a decommissioningReplica at the same time. Its my understanding that a replica on decommissioning datanode is counted as a decommissioningReplica & a non-corrupt replica on a live datanode is counted as a liveReplica. So a block replica on a decommissioning node will only be counted towards decommissioningReplica count & not liveReplica count. In my experience I have not seen the behavior you are mentioning before; let me know if you have a reference JIRA.

If the case you described is possible then in theory numUsableReplica would be greater than it should be. In the typical case where "dfs.namenode.replication.min = 1" this makes no difference because even if there is only 1 non-corrupt block replica on 1 decommissioning node then the corrupt blocks are invalidated regardless of wether or not numUsableReplica=1 or numUsableReplica=2 (due to double counting of replica as liveReplica & decommissioningReplica). In the case where "dfs.namenode.replication.min > 1" there could arguably be a difference because the corrupt replicas would not be invalidated if numUsableReplica=1 but they will be invalidated if numUsableReplica=2 (due to double counting of replica as liveReplica & decommissioningReplica).

I think if this scenario is possible the correct fix would be to ensure that each block replica is only counted once towards either liveReplicas or decommissioningReplicas. Let me know if there is a JIRA for this issue & I can potentially look into the bug fix separately.

Edge case coming to my mind is when the we are considering the block on decommissioning node as useable but the very next moment, the node is decommissioned.

Fair point, I had considered this & mentioned this edge case in the PR description:

The only perceived risk here would be that the corrupt blocks are invalidated at around the same time the decommissioning and entering maintenance nodes are decommissioned. This could in theory bring the overall number of replicas below the "dfs.namenode.replication.min" (i.e. to 0 replicas in the worst case). This is however not an actual risk because the decommissioning and entering maintenance nodes will not finish decommissioning until their is a sufficient number of liveReplicas; so there is no possibility that the decommissioning and entering maintenance nodes will be decommissioned prematurely.

Any replicas on decommissioning nodes will not be decommissioned until there are more liveReplicas than the replication factor for the block. Its only possible for decommissioningReplicas to be decommissioned at the same time corruptReplicas are invalidated if there are sufficient liveReplicas to satisfy the replication factor; in this case, because of the liveReplicas its safe to eliminate both the decommissioningReplicas & the corruptReplicas. If there is not a sufficient number of liveReplicas then the decommissioningReplicas will not be decommissioned but the corruptReplicas will be invalidated; the block will not be lost because the decommissioningReplicas will still exist.

One case you could argue is that if:

corruptReplicas > 1
decommissioningReplicas = 1

By invalidating the corruptReplicas we are exposing the block to a risk of loss should the decommissioningReplica be unexpectedly terminated/failed at around the same time the corruptReplicas are invalidated. This is true, but this same risk already exists where:

corruptReplicas > 1
liveReplicas = 1
By invalidating the corruptReplicas we are exposing the block to a risk of loss should the liveReplica be unexpectedly terminated/failed at around the same time the corruptReplicas are invalidated

Thus far, I am unable to think of any scenario where this change introduces an increased risk of data loss. In fact, this change helps improve block redundancy (reducing risk of data loss) in scenarios where a block cannot be sufficiently replicated because corruptReplicas cannot be invalidated.

aajisaka

@KevinWikant Nice catch, and thank you for your detailed explanation.
Would you remove the unused imports in the test class? I'm +1 if that is addressed.

ZanderXu · 2022-06-16T06:25:27Z

@KevinWikant Nice catch +1. I learned a lot from it. thanks~

…mber of usable replicas, revision 2

KevinWikant · 2022-06-16T13:20:17Z

@ashutoshcipher @aajisaka @ZanderXu really appreciate the reviews on this PR, thank you!

@aajisaka I have removed the unused imports, please let me know if you have any other comments/concerns

hadoop-yetus · 2022-06-16T21:37:52Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 56s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	39m 25s		trunk passed
+1 💚	compile	1m 39s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	1m 31s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	1m 21s		trunk passed
+1 💚	mvnsite	1m 40s		trunk passed
-1 ❌	javadoc	1m 20s	/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt	hadoop-hdfs in trunk failed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.
+1 💚	javadoc	1m 44s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 43s		trunk passed
+1 💚	shadedclient	25m 58s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 24s		the patch passed
+1 💚	compile	1m 30s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javac	1m 30s		the patch passed
+1 💚	compile	1m 19s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	1m 19s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 2s	/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 100 unchanged - 0 fixed = 101 total (was 100)
+1 💚	mvnsite	1m 28s		the patch passed
-1 ❌	javadoc	1m 0s	/patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt	hadoop-hdfs in the patch failed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.
+1 💚	javadoc	1m 30s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 35s		the patch passed
+1 💚	shadedclient	26m 0s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	381m 44s		hadoop-hdfs in the patch passed.
+1 💚	asflicense	1m 1s		The patch does not generate ASF License warnings.
		498m 26s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/artifact/out/Dockerfile
GITHUB PR	#4410
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux efcbee072994 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `9dd2660`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/testReport/
Max. process+thread count	1965 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

aajisaka

+1. Thank you @KevinWikant for the update. The Javadoc error in Java 11 is not related to the PR. I'll merge this in this weekend if there is no objection.

aajisaka · 2022-06-17T06:25:17Z

Filed HDFS-16635 to fix javadoc error.

hotcodemacha · 2022-06-19T16:07:14Z

Filed HDFS-16635 to fix javadoc error.

@aajisaka Raised PR for the same.

aajisaka · 2022-06-20T02:20:45Z

Merged. Thank you @KevinWikant for your contribution and thank you @ashutoshcipher @ZanderXu for your review!

…mber of usable replicas (#4410) Co-authored-by: Kevin Wikant <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]> (cherry picked from commit cfceaeb)

…mber of usable replicas (#4410) Co-authored-by: Kevin Wikant <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]> (cherry picked from commit cfceaeb) Conflicts: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java

…mber of usable replicas (apache#4410) Co-authored-by: Kevin Wikant <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]>

… based on number of usable replicas (apache#4410) Co-authored-by: Kevin Wikant <[email protected]> Signed-off-by: Akira Ajisaka <[email protected]>

HDFS-16064. Determine when to invalidate corrupt replicas based on nu…

73dd8a9

…mber of usable replicas

This comment was marked as outdated.

Sign in to view

hotcodemacha approved these changes Jun 14, 2022

View reviewed changes

aajisaka reviewed Jun 16, 2022

View reviewed changes

HDFS-16064. Determine when to invalidate corrupt replicas based on nu…

9dd2660

…mber of usable replicas, revision 2

aajisaka approved these changes Jun 17, 2022

View reviewed changes

aajisaka merged commit cfceaeb into apache:trunk Jun 20, 2022

zz12341 mentioned this pull request Jan 10, 2024

[HDFS-16064] backporting HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas (#4410) #6437

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas #4410

HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas #4410

Uh oh!

KevinWikant commented Jun 6, 2022 •

edited

Loading

Uh oh!

This comment was marked as outdated.

KevinWikant commented Jun 6, 2022

Uh oh!

hotcodemacha commented Jun 14, 2022

Uh oh!

KevinWikant commented Jun 14, 2022 •

edited

Loading

Uh oh!

aajisaka left a comment

Uh oh!

ZanderXu commented Jun 16, 2022

Uh oh!

KevinWikant commented Jun 16, 2022

Uh oh!

hadoop-yetus commented Jun 16, 2022

Uh oh!

aajisaka left a comment

Uh oh!

aajisaka commented Jun 17, 2022

Uh oh!

hotcodemacha commented Jun 19, 2022

Uh oh!

aajisaka commented Jun 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas #4410

HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas #4410

Uh oh!

Conversation

KevinWikant commented Jun 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Background Information

Description of Problem

Proposed Solution

How was this patch tested?

For code changes:

Uh oh!

This comment was marked as outdated.

KevinWikant commented Jun 6, 2022

Uh oh!

hotcodemacha commented Jun 14, 2022

Uh oh!

KevinWikant commented Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I am thinking could there be a case where the block can show up in both liveReplicas and decommissioning, which could lead to any unnecessarily call to invalidateCorruptReplicas()

Edge case coming to my mind is when the we are considering the block on decommissioning node as useable but the very next moment, the node is decommissioned.

Uh oh!

aajisaka left a comment

Choose a reason for hiding this comment

Uh oh!

ZanderXu commented Jun 16, 2022

Uh oh!

KevinWikant commented Jun 16, 2022

Uh oh!

hadoop-yetus commented Jun 16, 2022

Uh oh!

aajisaka left a comment

Choose a reason for hiding this comment

Uh oh!

aajisaka commented Jun 17, 2022

Uh oh!

hotcodemacha commented Jun 19, 2022

Uh oh!

aajisaka commented Jun 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

KevinWikant commented Jun 6, 2022 •

edited

Loading

KevinWikant commented Jun 14, 2022 •

edited

Loading