Skip to content

Conversation

@slfan1989
Copy link
Contributor

@slfan1989 slfan1989 commented Jun 7, 2022

JIRA:HDFS-16624. Fix org.apache.hadoop.hdfs.tools.TestDFSAdmin#testAllDatanodesReconfig ERROR.

(#4406) found an error message during Junit unit testing, as follows:

expected:<[SUCCESS: Changed property dfs.datanode.peer.stats.enabled]> but was:<[ From: "false"]>

After code debugging, it was found that there was an error in the selection outs.get(2) of the assertion(1208), index should be equal to 1.

Please see jira for details.

Introduce code in #4264

@slfan1989
Copy link
Contributor Author

@virajjasani @tomscut please help to review the code.

@tomscut tomscut self-requested a review June 7, 2022 01:27
@virajjasani
Copy link
Contributor

Thanks for the PR @slfan1989
Is this test consistently failing or is it more of a flaky one?

@slfan1989
Copy link
Contributor Author

slfan1989 commented Jun 7, 2022

Thanks for the PR @slfan1989 Is this test consistently failing or is it more of a flaky one?

Hi @virajjasani ,
This error looks like a consistent failing, I can reproduce it by running multiple times locally, and 2 tests report as failure in #4406.

The link to the test report is as follows:
report1
reprot2

@slfan1989
Copy link
Contributor Author

slfan1989 commented Jun 7, 2022

That's surprising indeed. One of the recent PRs, I see this test passing:

  1. https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4402/1/testReport/org.apache.hadoop.hdfs.tools/TestDFSAdmin/testAllDatanodesReconfig/
  2. https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4402/2/testReport/org.apache.hadoop.hdfs.tools/TestDFSAdmin/testAllDatanodesReconfig/

But you can look at the screenshot of my local debugger(in jira HDFS-16624), the index is indeed inaccurate.
I suggest to wait for the pr's junit test to finish running and take a look at the result report.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 12m 54s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 38m 44s trunk passed
+1 💚 compile 1m 45s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 compile 1m 34s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 19s trunk passed
+1 💚 mvnsite 1m 36s trunk passed
+1 💚 javadoc 1m 20s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 1m 51s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 44s trunk passed
+1 💚 shadedclient 23m 18s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 26s the patch passed
+1 💚 compile 1m 28s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javac 1m 28s the patch passed
+1 💚 compile 1m 18s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 1m 18s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 2s the patch passed
+1 💚 mvnsite 1m 28s the patch passed
+1 💚 javadoc 0m 57s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 1m 31s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 27s the patch passed
+1 💚 shadedclient 23m 1s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 252m 18s hadoop-hdfs in the patch passed.
+1 💚 asflicense 1m 13s The patch does not generate ASF License warnings.
375m 23s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4412/1/artifact/out/Dockerfile
GITHUB PR #4412
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 25c503b8420f 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 13422ba
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4412/1/testReport/
Max. process+thread count 3270 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4412/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@slfan1989
Copy link
Contributor Author

slfan1989 commented Jun 7, 2022

I already know the cause of the problem, because the log is out of order due to multi-threading.

I compared the logs of several execution failures, as follows, after comparison, it is found that the log order cannot be
guaranteed:

PR-4406:Report1

2022-06-06 06:15:37,324 [Listener at localhost/38331] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - Reconfiguring status for node [127.0.0.1:44161]: started at Mon Jun 06 06:15:34 UTC 2022Reconfiguring status for node [127.0.0.1:38331]:  and finished at Mon Jun 06 06:15:34 UTC 2022.
2022-06-06 06:15:37,324 [Listener at localhost/38331] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - started at Mon Jun 06 06:15:34 UTC 2022SUCCESS: Changed property dfs.datanode.peer.stats.enabled
2022-06-06 06:15:37,324 [Listener at localhost/38331] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - 	From: "false"
2022-06-06 06:15:37,325 [Listener at localhost/38331] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - 	To: "true"
2022-06-06 06:15:37,325 [Listener at localhost/38331] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) -  and finished at Mon Jun 06 06:15:34 UTC 2022.
2022-06-06 06:15:37,325 [Listener at localhost/38331] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - SUCCESS: Changed property dfs.datanode.peer.stats.enabled
2022-06-06 06:15:37,325 [Listener at localhost/38331] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - 	From: "false"
2022-06-06 06:15:37,325 [Listener at localhost/38331] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - 	To: "true"
2022-06-06 06:15:37,325 [Listener at localhost/38331] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - Retrieval of reconfiguration status successful on 2 nodes, failed on 0 nodes.

PR-4406:Report2

2022-06-06 21:29:20,263 [Listener at localhost/38959] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:testAllDatanodesReconfig(1205)) - dfsadmin -status -livenodes output:
2022-06-06 21:29:20,264 [Listener at localhost/38959] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - Reconfiguring status for node [127.0.0.1:38959]: started at Mon Jun 06 21:29:17 UTC 2022 and finished at Mon Jun 06 21:29:17 UTC 2022.
2022-06-06 21:29:20,264 [Listener at localhost/38959] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - Reconfiguring status for node [127.0.0.1:44079]: SUCCESS: Changed property dfs.datanode.peer.stats.enabled
2022-06-06 21:29:20,264 [Listener at localhost/38959] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - 	From: "false"
2022-06-06 21:29:20,264 [Listener at localhost/38959] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - 	To: "true"
2022-06-06 21:29:20,264 [Listener at localhost/38959] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - started at Mon Jun 06 21:29:17 UTC 2022 and finished at Mon Jun 06 21:29:17 UTC 2022.
2022-06-06 21:29:20,264 [Listener at localhost/38959] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - SUCCESS: Changed property dfs.datanode.peer.stats.enabled
2022-06-06 21:29:20,264 [Listener at localhost/38959] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - 	From: "false"
2022-06-06 21:29:20,264 [Listener at localhost/38959] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - 	To: "true"
2022-06-06 21:29:20,265 [Listener at localhost/38959] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - Retrieval of reconfiguration status successful on 2 nodes, failed on 0 nodes.

PR-4410:Report1

2022-06-06 17:00:03,792 [Listener at localhost/41421] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:testAllDatanodesReconfig(1205)) - dfsadmin -status -livenodes output:
2022-06-06 17:00:03,794 [Listener at localhost/41421] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - Reconfiguring status for node [127.0.0.1:41421]: started at Mon Jun 06 17:00:00 UTC 2022Reconfiguring status for node [127.0.0.1:32857]:  and finished at Mon Jun 06 17:00:00 UTC 2022.
2022-06-06 17:00:03,794 [Listener at localhost/41421] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - started at Mon Jun 06 17:00:00 UTC 2022SUCCESS: Changed property dfs.datanode.peer.stats.enabled
2022-06-06 17:00:03,794 [Listener at localhost/41421] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - 	From: "false"
2022-06-06 17:00:03,794 [Listener at localhost/41421] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - 	To: "true"
2022-06-06 17:00:03,794 [Listener at localhost/41421] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) -  and finished at Mon Jun 06 17:00:00 UTC 2022.
2022-06-06 17:00:03,794 [Listener at localhost/41421] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - SUCCESS: Changed property dfs.datanode.peer.stats.enabled
2022-06-06 17:00:03,795 [Listener at localhost/41421] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - 	From: "false"
2022-06-06 17:00:03,795 [Listener at localhost/41421] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - 	To: "true"
2022-06-06 17:00:03,795 [Listener at localhost/41421] INFO  tools.TestDFSAdmin (TestDFSAdmin.java:lambda$testAllDatanodesReconfig$0(1206)) - Retrieval of reconfiguration status successful on 2 nodes, failed on 0 nodes.

It is the disorder caused by the following code.

...
for (DatanodeInfo node : nodes) {
        executorService.submit(() -> {
          int status = startReconfiguration(nodeType, node.getIpcAddr(false), out, err);
          if (status == 0) {
            successCount.incrementAndGet();
          } else {
            failCount.incrementAndGet();
          }
        });
      }
... 

Therefore, using a fixed-order assertion judgment will occasionally succeed and occasionally fail.

improve proposals:

int getReconfigurationStatus(final String nodeType, final String address, final PrintStream out,
      final PrintStream err) {
    String outMsg = null;
    String errMsg = null;
    ReconfigurationTaskStatus status = null;

    try {
      status = getReconfigurationStatusDispatch(nodeType, address, out, err);
      outMsg = String.format("Reconfiguring status for node [%s]: ", address);
    } catch (IOException e) {
      errMsg = String.format("Node [%s] reloading configuration: %s.", address,
          e.toString());
    }
    .// log printing is performed synchronously to ensure that it can be output in order.
    synchronized(this) {
        if (errMsg != null) {
          err.println(errMsg);
          return 1;
        } else {
          out.print(outMsg);
        }
    
       ....
    }
    return 0;
  }

@virajjasani @tomscut Is it ok for me to do this? Or can you give me some advice?

if (!status.stopped()) {
out.println(" and is still running.");
return 0;
synchronized (this) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slfan1989 This is good for concurrency control but we should avoid this for performance issues. Rather, we can update test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your suggestion, I will modify the junit test.

Comment on lines 1208 to 1210
assertEquals("SUCCESS: Changed property dfs.datanode.peer.stats.enabled", outs.get(1));
assertEquals("\tFrom: \"false\"", outs.get(2));
assertEquals("\tTo: \"true\"", outs.get(3));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that concurrency is at play here, we can do something like this:

    assertTrue("SUCCESS: Changed property dfs.datanode.peer.stats.enabled".equals(outs.get(2))
        || "SUCCESS: Changed property dfs.datanode.peer.stats.enabled".equals(outs.get(1)));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will modify the code.

@virajjasani
Copy link
Contributor

Good find btw @slfan1989

@slfan1989
Copy link
Contributor Author

@virajjasani @tomscut please help me review the code again.

@slfan1989
Copy link
Contributor Author

Good find btw @slfan1989

The code you contributed is very useful, I have learned a lot from you, thank you again!

Copy link
Contributor

@tomscut tomscut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. @virajjasani Please also take a look. Thanks.

Copy link
Contributor

@virajjasani virajjasani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (non-binding)

@virajjasani
Copy link
Contributor

Thank you for the nice fix @slfan1989 and for the review @tomscut !!

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 17m 34s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 36m 50s trunk passed
+1 💚 compile 1m 45s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 compile 1m 35s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 26s trunk passed
+1 💚 mvnsite 1m 46s trunk passed
+1 💚 javadoc 1m 24s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 1m 46s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 41s trunk passed
+1 💚 shadedclient 23m 10s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 23s the patch passed
+1 💚 compile 1m 27s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javac 1m 27s the patch passed
+1 💚 compile 1m 20s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 1m 20s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 8s the patch passed
+1 💚 mvnsite 1m 33s the patch passed
+1 💚 javadoc 1m 4s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 1m 42s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 52s the patch passed
+1 💚 shadedclient 25m 32s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 386m 30s hadoop-hdfs in the patch passed.
+1 💚 asflicense 1m 14s The patch does not generate ASF License warnings.
515m 24s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4412/2/artifact/out/Dockerfile
GITHUB PR #4412
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 22af71d9fc96 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / c0b129a
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4412/2/testReport/
Max. process+thread count 2891 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4412/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@tomscut tomscut changed the title HDFS-16624. Fix org.apache.hadoop.hdfs.tools.TestDFSAdmin#testAllDatanodesReconfig ERROR. HDFS-16624. Fix flaky unit test TestDFSAdmin#testAllDatanodesReconfig Jun 9, 2022
@tomscut
Copy link
Contributor

tomscut commented Jun 9, 2022

Thanks @slfan1989 for your contribution! Thanks @virajjasani for your review.

@tomscut tomscut merged commit 2ac01b2 into apache:trunk Jun 9, 2022
tomscut pushed a commit to tomscut/hadoop that referenced this pull request Jun 9, 2022
tomscut pushed a commit that referenced this pull request Jun 9, 2022
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants