Skip to content

Conversation

@KevinWikant
Copy link
Contributor

@KevinWikant KevinWikant commented Jun 6, 2022

HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Description of PR

Bug fix for a re-occurring HDFS bug which can result in datanodes being unable to complete decommissioning indefinitely. In short, the bug is a chicken & egg problem where:

  • in order for a datanode to be decommissioned its blocks must be sufficiently replicated (which is not true if liveReplicas < "dfs.replication")
  • datanode cannot sufficiently replicate some block(s) because of corrupt block replicas on target datanodes (i.e. https://issues.apache.org/jira/browse/HDFS-16064)
  • corrupt block replicas will never be invalidated by the Namenode because the block(s) are not minimally replicated (because liveReplicas < "dfs.namenode.replication.min")

In this scenario, the block(s) are minimally replicated but the logic the Namenode uses to determine if a block is minimally replicated is flawed. Should be checking if usableReplicas < "dfs.namenode.replication.min" because decommissioning & entering maintenance nodes can have valid block replicas which should be replicated to the other datanodes with corrupt replicas if liveReplicas < "dfs.replication".

To understand the bug further we must first establish some background information.

Background Information

Givens:

  • FSDataOutputStream is being used to write the HDFS file, under the hood this uses a class DataStreamer
  • for the sake of example we will say the HDFS file has a replication factor of 2, though this is not a requirement to reproduce the issue
  • the file is being appended to intermittently over an extended period of time (in general, this issue needs minutes/hours to reproduce because corrupt replicas need to be generated over time)
  • HDFS is configured with typical default configurations

Under certain scenarios the DataStreamer client can detect a bad link when trying to append to the block pipeline, in this case the DataStreamer client can shift the block pipeline by replacing the bad link with a new datanode. When this happens the replica on the datanode that was shifted away from becomes corrupted because it no longer has the latest generation stamp for the block. As a more concrete example:

  • DataStreamer client creates block pipeline on datanodes A & B, each have a block replica with generation stamp 1
  • DataStreamer client tries to append the block pipeline by sending block transfer (with generation stamp 2) to datanode A
  • Datanode A succeeds in writing the block transfer & then attempts to forward the transfer to datanode B
  • Datanode B fails the transfer for some reason and responds with a PipelineAck failure code
  • Datanode A sends a PipelineAck to DataStreamer indicating datanode A succeeded in the append & datanode B failed in the append. The DataStreamer detects datanode B as a bad link which will be replaced before the next append operation
  • at this point datanode A has live replica with generation stamp 2 & datanode B has corrupt replica with generation stamp 1
  • the next time DataStreamer tries to append the block it will call Namenode "getAdditionalDatanode" API which returns some other datanode C
  • DataStreamer sends data transfer (with generation stamp 3) to the new block pipeline containing datanodes A & C, the append succeeds to both datanodes
  • end state is that:
    • datanodes A & C have live replicas with latest generation stamp 3
    • datanode B has a corrupt replica because its lagging behind with generation stamp 1

The key behavior being highlighted here is that when the DataStreamer client shifts the block pipeline due to append failures on a subset of the datanode(s) in the pipeline, a corrupt block replica gets leftover on the datanode(s) that were shifted away from.

This corrupt block replica makes the datanode ineligible as a replication target for the block because of the following exception described in [https://issues.apache.org/jira/browse/HDFS-16064]:

2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 dst: /DN3:9866; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.

What typically will occur is that these corrupt block replicas will be invalidated by the Namenode which will cause the corrupt replica to the be deleted on the datanode, thus allowing the datanode to once again be a replication target for the block. Note that the Namenode will not identify the corrupt block replica until the datanode sends its next block report, this can take up to 6 hours with the default block report interval.

As long as there is 1 live replica of the block, all the corrupt replicas should be invalidated based on this condition:

When there are 0 live replicas, the corrupt replicas are not invalidated; I assume the reasoning behind this is that its better to have some corrupt copies of the block than to have no copies at all.

Description of Problem

The problem comes into play when the aforementioned behavior is coupled with decommissioning and/or entering maintenance.

Consider the following scenario:

  • block has replication factor of 2
  • there are 3 datanodes A, B, & C
  • datanode A has decommissioning replica
  • datanodes B & C have corrupt replicas

This scenario is essentially a replication/decommissioning deadlock because:

  • corrupt replicas on B & C will not be invalidated because there are 0 live replicas (as per Namenode logic)
  • datanode A cannot finish decommissioning until the block is replicated to datanodes B & C
  • the block cannot be replicated to datanodes B & C until their corrupt replicas are invalidated

This does not need to be a deadlock, the corrupt replicas could be invalidated & the live replica could be transferred from A to B & C.

The same problem can exist on a larger scale, the requirements are that:

  • liveReplicas < "dfs.namenode.replication.min" = 1 by default
  • decommissioningAndEnteringMaintenanceReplicas > 0
  • liveReplicas + decommissioningAndEnteringMaintenanceReplicas + corruptReplicas = numberOfDatanodes

In this case the corrupt replicas will not be invalidated by the Namenode which means that the decommissioning and entering maintenance replicas will never be sufficiently replicated and therefore will never finish decommissioning or entering maintenance.

The symptom of this issue in the logs is that right after a node with a corrupt replica sends its block report, rather than the block being invalidated it just gets counted as a corrupt replica:

2021-12-04 17:20:17,059 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XYZ_123, Expected Replicas: 2, live replicas: 0, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.0.0.1:15076 , Current Datanode: 10.0.0.1:15076, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
2021-12-04 17:20:43,533 INFO BlockStateChange (Block report processor): BLOCK* processReport 0x456: from storage DS-XYZ node DatanodeRegistration(10.0.0.2:15076, ...), blocks: 57, hasStaleStorage: false, processing time: 3 msecs, invalidatedBlocks: 0
2021-12-04 17:20:47,059 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XYZ_123, Expected Replicas: 2, live replicas: 0, corrupt replicas: 1, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.0.0.1:15076 10.0.0.2:15076 , Current Datanode: 10.0.0.1:15076, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
...
2021-12-04 17:44:47,059 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XYZ_123, Expected Replicas: 2, live replicas: 0, corrupt replicas: 1, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.0.0.1:15076 10.0.0.2:15076 , Current Datanode: 10.0.0.1:15076, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
2021-12-04 17:44:47,293 INFO BlockStateChange (Block report processor): BLOCK* processReport 0x789: from storage DS-XYZ node DatanodeRegistration(10.0.0.3:15076, ...), blocks: 2014, hasStaleStorage: false, processing time: 4 msecs, invalidatedBlocks: 0
2021-12-04 17:45:17,059 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XYZ_123, Expected Replicas: 2, live replicas: 0, corrupt replicas: 2, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 10.0.0.1:15076 10.0.0.2:15076 10.0.0.3:15076 , Current Datanode: 10.0.0.1:15076, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
...

Proposed Solution

Rather than checking if "dfs.namenode.replication.min" is satisfied based on liveReplicas, check based on usableReplicas where "usableReplicas = liveReplicas + decommissioningReplicas + enteringMaintenanceReplicas". This will allow the corrupt replicas to be invalidated so that the live replica(s) on the decommissioning node(s) to be sufficiently replicated.

The only perceived risk here would be that the corrupt blocks are invalidated at around the same time the decommissioning and entering maintenance nodes are decommissioned. This could in theory bring the overall number of replicas below the "dfs.namenode.replication.min" (i.e. to 0 replicas in the worst case). This is however not an actual risk because the decommissioning and entering maintenance nodes will not finish decommissioning until their is a sufficient number of liveReplicas; so there is no possibility that the decommissioning and entering maintenance nodes will be decommissioned prematurely.

If the decommissioningReplicas are in fact corrupt, then because liveReplicas=0 there are no more uncorrupted replicas & the block cannot be recovered. So the additional corrupt replicas will be invalidated leaving only the decommissioning corrupt replicas which will contain the corrupted blocks data.

How was this patch tested?

Added a unit test "testDeleteCorruptReplicaForUnderReplicatedBlock"

This test reproduces a scenario where:

  • there are 3 datanodes in the cluster
  • there is a blk_X with replication factor 2
  • blk_X has:
    • 2 corrupt replicas
    • 1 decommissioning replica

The test then validates that:

  • with the BlockManager changes the 2 corrupt replicas are invalidated which allows the decommissioning replica to be sufficiently replicated & the datanode to be decommissioned
    • note that in a MiniDFSCluster when restartDatanode is used, the datanode comes up with a different port number
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hdfs.TestDecommission
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 46.006 s - in org.apache.hadoop.hdfs.TestDecommission
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
2022-06-06 16:59:40,925 [Listener at localhost/64061] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2078)) - Block now has 2 corrupt replicas on [127.0.0.1:64049 , 127.0.0.1:64053] and 1 decommissioning replica on 127.0.0.1:64058
...
2022-06-06 16:59:40,925 [Listener at localhost/64061] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2082)) - Restarting stopped nodes 127.0.0.1:64049 , 127.0.0.1:64053
...
2022-06-06 16:59:42,466 [DataXceiver for client  at /127.0.0.1:64091 [Receiving block BP-73610852-192.168.2.16-1654549156791:blk_1073741825_1005]] INFO  datanode.DataNode (DataXceiver.java:run(308)) - 127.0.0.1:64087:DataXceiver error processing WRITE_BLOCK operation  src: /127.0.0.1:64091 dst: /127.0.0.1:64087; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-73610852-192.168.2.16-1654549156791:blk_1073741825_1005 already exists in state FINALIZED and thus cannot be created.
...
2022-06-06 16:59:43,173 [DatanodeAdminMonitor-0] INFO  BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(375)) - Block: blk_1073741825_1005, Expected Replicas: 2, live replicas: 0, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 127.0.0.1:64058 , Current Datanode: 127.0.0.1:64058, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
...
2022-06-06 16:59:54,490 [DataXceiver for client  at /127.0.0.1:64093 [Receiving block BP-73610852-192.168.2.16-1654549156791:blk_1073741825_1005]] INFO  datanode.DataNode (DataXceiver.java:writeBlock(935)) - Received BP-73610852-192.168.2.16-1654549156791:blk_1073741825_1005 src: /127.0.0.1:64093 dest: /127.0.0.1:64083 volume: /Users/wikak/apache/hadoop/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1 of size 1536
2022-06-06 16:59:54,492 [DataXceiver for client  at /127.0.0.1:64094 [Receiving block BP-73610852-192.168.2.16-1654549156791:blk_1073741825_1005]] INFO  datanode.DataNode (DataXceiver.java:writeBlock(935)) - Received BP-73610852-192.168.2.16-1654549156791:blk_1073741825_1005 src: /127.0.0.1:64094 dest: /127.0.0.1:64087 volume: /Users/wikak/apache/hadoop/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data3 of size 1536
...
2022-06-06 16:59:58,188 [DatanodeAdminMonitor-0] INFO  blockmanagement.DatanodeAdminManager (DatanodeAdminManager.java:setDecommissioned(301)) - Decommissioning complete for node 127.0.0.1:64058
...
2022-06-06 17:00:01,098 [Listener at localhost/64090] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2147)) - Block now has 2 live replicas on [127.0.0.1:64083 , 127.0.0.1:64087] and 1 decommissioned replica on 127.0.0.1:64058
  • without the BlockManager changes the 2 corrupt replicas are never invalidated which prevents the decommissioning replica from being sufficiently replicated & prevents the datanode from even being decommissioned
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hdfs.TestDecommission
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.916 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestDecommission
[ERROR] testDeleteCorruptReplicaForUnderReplicatedBlock(org.apache.hadoop.hdfs.TestDecommission)  Time elapsed: 66.771 s  <<< FAILURE!
java.lang.AssertionError: Node 127.0.0.1:64157 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:64157]
    at org.junit.Assert.fail(Assert.java:89)
    at org.apache.hadoop.hdfs.TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock(TestDecommission.java:2132)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
    at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.lang.Thread.run(Thread.java:750)

[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2132 Node 127.0.0.1:64157 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:64157]
[INFO] 
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0
2022-06-06 17:03:05,174 [Listener at localhost/64165] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2078)) - Block now has 2 corrupt replicas on [127.0.0.1:64153 , 127.0.0.1:64162] and 1 decommissioning replica on 127.0.0.1:64157
...
2022-06-06 17:03:05,174 [Listener at localhost/64165] INFO  hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2082)) - Restarting stopped nodes 127.0.0.1:64153 , 127.0.0.1:64162
...
2022-06-06 17:03:05,971 [RedundancyMonitor] WARN  blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(489)) - Failed to place enough replicas, still in need of 2 to reach 2 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable:  unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2022-06-06 17:03:06,977 [DatanodeAdminMonitor-0] INFO  BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(375)) - Block: blk_1073741825_1005, Expected Replicas: 2, live replicas: 0, corrupt replicas: 2, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 127.0.0.1:64157 127.0.0.1:64183 127.0.0.1:64187 , Current Datanode: 127.0.0.1:64157, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
...
2022-06-06 17:03:39,026 [RedundancyMonitor] WARN  blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(489)) - Failed to place enough replicas, still in need of 2 to reach 2 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable:  unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2022-06-06 17:03:42,007 [DatanodeAdminMonitor-0] INFO  BlockStateChange (DatanodeAdminManager.java:logBlockReplicationInfo(375)) - Block: blk_1073741825_1005, Expected Replicas: 2, live replicas: 0, corrupt replicas: 2, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 127.0.0.1:64157 127.0.0.1:64183 127.0.0.1:64187 , Current Datanode: 127.0.0.1:64157, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
...
2022-06-06 17:03:45,336 [Listener at localhost/64190] ERROR hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(2131)) - Node 127.0.0.1:64157 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:64157]

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • [n/a] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • [n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • [n/a] If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@hadoop-yetus

This comment was marked as outdated.

@KevinWikant
Copy link
Contributor Author

The failing unit test is:

Failed junit tests  |  hadoop.hdfs.tools.TestDFSAdmin

This seems to be unrelated to my change & a flaky unit test:

[ERROR] testAllDatanodesReconfig(org.apache.hadoop.hdfs.tools.TestDFSAdmin)  Time elapsed: 5.489 s  <<< FAILURE!
org.junit.ComparisonFailure: expected:<[SUCCESS: Changed property dfs.datanode.peer.stats.enabled]> but was:<[  From: "false"]>
    at org.junit.Assert.assertEquals(Assert.java:117)
    at org.junit.Assert.assertEquals(Assert.java:146)
    at org.apache.hadoop.hdfs.tools.TestDFSAdmin.testAllDatanodesReconfig(TestDFSAdmin.java:1208)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
    at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
    at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
    at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
    at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
    at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
    at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

@hotcodemacha
Copy link
Contributor

Thanks @KevinWikant for the PR.

We have seen this bug coming up few times in last few months. And the PR seems to fix the bug.

Although I am thinking could there be a case where the block can show up in both liveReplicas and decommissioning, which could lead to any unnecessarily call to invalidateCorruptReplicas()

Edge case coming to my mind is when the we are considering the block on decommissioning node as useable but the very next moment, the node is decommissioned.

@KevinWikant
Copy link
Contributor Author

KevinWikant commented Jun 14, 2022

Hi @ashutoshcipher thank you for reviewing the PR! I have tried to address your comments below:

I am thinking could there be a case where the block can show up in both liveReplicas and decommissioning, which could lead to any unnecessarily call to invalidateCorruptReplicas()

I am not sure its possible for a block replica to be reported as both a liveReplica & a decommissioningReplica at the same time. Its my understanding that a replica on decommissioning datanode is counted as a decommissioningReplica & a non-corrupt replica on a live datanode is counted as a liveReplica. So a block replica on a decommissioning node will only be counted towards decommissioningReplica count & not liveReplica count. In my experience I have not seen the behavior you are mentioning before; let me know if you have a reference JIRA.

If the case you described is possible then in theory numUsableReplica would be greater than it should be. In the typical case where "dfs.namenode.replication.min = 1" this makes no difference because even if there is only 1 non-corrupt block replica on 1 decommissioning node then the corrupt blocks are invalidated regardless of wether or not numUsableReplica=1 or numUsableReplica=2 (due to double counting of replica as liveReplica & decommissioningReplica). In the case where "dfs.namenode.replication.min > 1" there could arguably be a difference because the corrupt replicas would not be invalidated if numUsableReplica=1 but they will be invalidated if numUsableReplica=2 (due to double counting of replica as liveReplica & decommissioningReplica).

I think if this scenario is possible the correct fix would be to ensure that each block replica is only counted once towards either liveReplicas or decommissioningReplicas. Let me know if there is a JIRA for this issue & I can potentially look into the bug fix separately.

Edge case coming to my mind is when the we are considering the block on decommissioning node as useable but the very next moment, the node is decommissioned.

Fair point, I had considered this & mentioned this edge case in the PR description:

The only perceived risk here would be that the corrupt blocks are invalidated at around the same time the decommissioning and entering maintenance nodes are decommissioned. This could in theory bring the overall number of replicas below the "dfs.namenode.replication.min" (i.e. to 0 replicas in the worst case). This is however not an actual risk because the decommissioning and entering maintenance nodes will not finish decommissioning until their is a sufficient number of liveReplicas; so there is no possibility that the decommissioning and entering maintenance nodes will be decommissioned prematurely.

Any replicas on decommissioning nodes will not be decommissioned until there are more liveReplicas than the replication factor for the block. Its only possible for decommissioningReplicas to be decommissioned at the same time corruptReplicas are invalidated if there are sufficient liveReplicas to satisfy the replication factor; in this case, because of the liveReplicas its safe to eliminate both the decommissioningReplicas & the corruptReplicas. If there is not a sufficient number of liveReplicas then the decommissioningReplicas will not be decommissioned but the corruptReplicas will be invalidated; the block will not be lost because the decommissioningReplicas will still exist.

One case you could argue is that if:

  • corruptReplicas > 1
  • decommissioningReplicas = 1

By invalidating the corruptReplicas we are exposing the block to a risk of loss should the decommissioningReplica be unexpectedly terminated/failed at around the same time the corruptReplicas are invalidated. This is true, but this same risk already exists where:

  • corruptReplicas > 1
  • liveReplicas = 1
    By invalidating the corruptReplicas we are exposing the block to a risk of loss should the liveReplica be unexpectedly terminated/failed at around the same time the corruptReplicas are invalidated

Thus far, I am unable to think of any scenario where this change introduces an increased risk of data loss. In fact, this change helps improve block redundancy (reducing risk of data loss) in scenarios where a block cannot be sufficiently replicated because corruptReplicas cannot be invalidated.

Copy link
Member

@aajisaka aajisaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KevinWikant Nice catch, and thank you for your detailed explanation.
Would you remove the unused imports in the test class? I'm +1 if that is addressed.

@ZanderXu
Copy link
Contributor

@KevinWikant Nice catch +1. I learned a lot from it. thanks~

@KevinWikant
Copy link
Contributor Author

@ashutoshcipher @aajisaka @ZanderXu really appreciate the reviews on this PR, thank you!

@aajisaka I have removed the unused imports, please let me know if you have any other comments/concerns

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 56s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 39m 25s trunk passed
+1 💚 compile 1m 39s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 compile 1m 31s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 21s trunk passed
+1 💚 mvnsite 1m 40s trunk passed
-1 ❌ javadoc 1m 20s /branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt hadoop-hdfs in trunk failed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.
+1 💚 javadoc 1m 44s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 43s trunk passed
+1 💚 shadedclient 25m 58s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 24s the patch passed
+1 💚 compile 1m 30s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javac 1m 30s the patch passed
+1 💚 compile 1m 19s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 1m 19s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 1m 2s /results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 100 unchanged - 0 fixed = 101 total (was 100)
+1 💚 mvnsite 1m 28s the patch passed
-1 ❌ javadoc 1m 0s /patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt hadoop-hdfs in the patch failed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.
+1 💚 javadoc 1m 30s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 35s the patch passed
+1 💚 shadedclient 26m 0s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 381m 44s hadoop-hdfs in the patch passed.
+1 💚 asflicense 1m 1s The patch does not generate ASF License warnings.
498m 26s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/artifact/out/Dockerfile
GITHUB PR #4410
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux efcbee072994 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 9dd2660
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/testReport/
Max. process+thread count 1965 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Member

@aajisaka aajisaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thank you @KevinWikant for the update. The Javadoc error in Java 11 is not related to the PR. I'll merge this in this weekend if there is no objection.

@aajisaka
Copy link
Member

Filed HDFS-16635 to fix javadoc error.

@hotcodemacha
Copy link
Contributor

Filed HDFS-16635 to fix javadoc error.

@aajisaka Raised PR for the same.

@aajisaka aajisaka merged commit cfceaeb into apache:trunk Jun 20, 2022
@aajisaka
Copy link
Member

Merged. Thank you @KevinWikant for your contribution and thank you @ashutoshcipher @ZanderXu for your review!

aajisaka pushed a commit that referenced this pull request Jun 20, 2022
…mber of usable replicas (#4410)

Co-authored-by: Kevin Wikant <[email protected]>
Signed-off-by: Akira Ajisaka <[email protected]>
(cherry picked from commit cfceaeb)
aajisaka pushed a commit that referenced this pull request Jun 20, 2022
…mber of usable replicas (#4410)

Co-authored-by: Kevin Wikant <[email protected]>
Signed-off-by: Akira Ajisaka <[email protected]>
(cherry picked from commit cfceaeb)

 Conflicts:
	hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
…mber of usable replicas (apache#4410)

Co-authored-by: Kevin Wikant <[email protected]>
Signed-off-by: Akira Ajisaka <[email protected]>
zz12341 pushed a commit to zz12341/hadoop that referenced this pull request Jan 10, 2024
… based on number of usable replicas (apache#4410)

Co-authored-by: Kevin Wikant <[email protected]>
Signed-off-by: Akira Ajisaka <[email protected]>
zz12341 pushed a commit to zz12341/hadoop that referenced this pull request Jan 10, 2024
… based on number of usable replicas (apache#4410)

Co-authored-by: Kevin Wikant <[email protected]>
Signed-off-by: Akira Ajisaka <[email protected]>
zz12341 pushed a commit to zz12341/hadoop that referenced this pull request Jan 10, 2024
… based on number of usable replicas (apache#4410)

Co-authored-by: Kevin Wikant <[email protected]>
Signed-off-by: Akira Ajisaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants