HDFS-17358. EC: infinite lease recovery caused by the length of RWR equals to zero or datanode does not have the replica.#6509
Conversation
|
💔 -1 overall
This message was automatically generated. |
475304b to
8ef1574
Compare
|
@Hexiaoqiao @zhangshuyan0 @tomscut Sir, could you please take a look at this PR when you have free time? Thanks a lot. |
zhangshuyan0
left a comment
There was a problem hiding this comment.
Greate catch! I also reproduced this bug in my testing environment. If the client crashes when the amount of data written is less than 6 cells (use RS-6-3 for example), the file will never be able to be closed.
If you could add a UT, that would be even better.
|
Thanks a lot for responsing, Sir. Will add an UT soonly.
…---- Replied Message ----
| From | ***@***.***> |
| Date | 01/29/2024 18:45 |
| To | ***@***.***> |
| Cc | ***@***.***>***@***.***> |
| Subject | Re: [apache/hadoop] HDFS-17358. EC: infinite lease recovery caused by the length of RWR equals to zero. (PR #6509) |
@zhangshuyan0 commented on this pull request.
Greate catch! I also reproduced this bug in my testing environment. If the client crashes when the amount of data written is less than 6 cells (use RS-6-3 for example), the file will never be able to be closed.
If you could add a UT, that would be even better.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
if info.getNumBytes==0 ,then safeLength maybe as 0? |
Hi, sir. Thanks for your reviewing. Yes, if info.getNumBytes==0 ,then safeLength is 0. I agree with you for using safeLength as the condition because it is more readable. |
yeah,I mean is it possible to determine if safeLength is 0, then directly call the logic of deleting this block? |
8ef1574 to
5f436ea
Compare
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
96ecedc to
6c1081a
Compare
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
6c1081a to
64d91bc
Compare
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
4619de2 to
37b289f
Compare
36eae19 to
3a54b3d
Compare
|
@zhangshuyan0 @tasanuma @Hexiaoqiao Sir, code is ready for reviewing. Could you please help me review this PR when you have free time? Thanks a lot. |
3a54b3d to
3eeacd6
Compare
| safeLength = getSafeLength(syncBlocks); | ||
| } else { | ||
| safeLength = 0; | ||
| LOG.warn("Block recovery: More than {} datanodes do not have the replica of block {}." + |
There was a problem hiding this comment.
Suggest printing out the value of zeroLenReplicaCnt as well.
There was a problem hiding this comment.
Fixed it. Thanks sir.
There was a problem hiding this comment.
What does this "More than" mean?
There was a problem hiding this comment.
It seems useless here, have removed it . Originally, it was more than (9 - 6) datanodes ...
zhangshuyan0
left a comment
There was a problem hiding this comment.
+1. LGTM. Let's wait for yetus.
|
LGTM +1. |
|
The result of the CI is here. https://ci-hadoop.apache.org/blue/organizations/jenkins/hadoop-multibranch/detail/PR-6509/26/tests |
|
@zhangshuyan0 @haiyang1987 @tasanuma @tomscut Sir, have updated unit test, please check it again~ |
|
Good catch! The changes look good to me. Wait for the Jenkins. |
|
💔 -1 overall
This message was automatically generated. |
5c79b13 to
b0b1ef7
Compare
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
The failed UTs were all passed in my local. |
|
Committed to trunk. Thanks for your contributions! @hfutatzhanghb @haiyang1987 @tomscut |
|
@zhangshuyan0 Hi, I closed the issue HDFS-17358. |
…quals to zero or datanode does not have the replica. (apache#6509). Contributed by farmmamba. Reviewed-by: Tao Li <tomscut@apache.org> Reviewed-by: Haiyang Hu <haiyang.hu@shopee.com> Signed-off-by: Shuyan Zhang <zhangshuyan@apache.org>
Description of PR
Refer to HDFS-17358.
Recently, there is a strange case happened on our ec production cluster.
The phenomenon is as below described: NameNode does infinite recovery lease of some ec files(~80K+) and those files could never be closed.
After digging into logs and releated code, we found the root cause is below codes in method
BlockRecoveryWorker$RecoveryTaskStriped#recover:The related logs are as below:
because the length of RWR is zero, the length of the returned object in below codes is zero. We can't put it into syncBlocks.
So throw exception in checkLocations method.