Skip to content

Conversation

@steveloughran
Copy link
Contributor

The fix for HADOOP-18456/IMPALA-11592 is, if our hypothesis is correct,
in WeakReferenceMap.create() where a strong reference to the new value
is kept in a local variable and referred to later so that the JVM will not GC it.

Description of PR

WeakReferenceMap.create() is resistant and resilient to GC taking place during
its creation process.

Local variables were renamed to show when refs are strong vs. weak.

How was this patch tested?

There's a new test, but otherwise code review.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

The fix for this is, if our hypothesis is correct, in WeakReferenceMap.create()
where a strong reference to the new value is kept in a local variable
*and referred to later* so that the JVM will not GC it.

Change-Id: I29929965c31c8e6b2ab9a491fbadb40871f10c3d
@steveloughran
Copy link
Contributor Author

s3a testing in progress

Copy link
Contributor

@mehakmeet mehakmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, really like the descriptive comments to make it clear.

In terms of testing I was wondering if we could use Runtime.getRuntime().gc() to actually run between calling threadMap.getForCurrentThread(); without storing the ref and asserting that the count of entries created is 2.

Something like this maybe? I'm not sure of the implications thisRuntime.getRuntime().gc() would have on rest of the tests tho, should be fine I think.

threadMap.getForCurrentThread();
Runtime.getRuntime().gc();
threadMap.getForCurrentThread();

Assertions.assertThat(threadMap.getEntriesCreatedCount()).isEqualTo(2);
Assertions.assertThat(threadMap.getReferenceLostCount()).isEqualTo(1);

.describedAs("current thread map value on second set")
.isEqualTo("hello");

// it is forbidden to explictly set to null via the set() call.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo "explictly"

.isNull();

// second attempt returns itself
Assertions.assertThat(threadMap.setForCurrentThread("hello"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

little doubt here: what happens if we set it to "hello2" this time, does the set return "hello" or "hello2"?

Can you add the next assert to be of a different value than "hello", just to confirm if set actually returns the previous set value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added it at the end, to show the dynamic value is returned on the overwrite...easiest place to add

}

/**
/**y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo

// note if there was any change in the reference.
// as this forces strongRef to be kept in scope
if (strongRef != resolvedStrongRef) {
LOG.debug("Created instance for key {}: {} overwritten by {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is the case, shouldn't we raise an exception? Are we not returning the wrong value then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it's just addressing a race condition...the reason we do the lookup is to ensure that all threads share the same instance. the exact choice of which one is not considered relevant

/**
* Set the new value for the current thread.
* @param newVal new reference to set for the active thread.
* @return any old value, possibly null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "any old value", or should this be "previously set value"?

computation from any live thread."
*/

final V strongRef = requireNonNull(factory.apply(key));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message for a factory returning null instance.

@steveloughran
Copy link
Contributor Author

problem with the gc stuff is what it took to trigger it in the earlier test for that...you need to make a lot of attempts and put memory load on the system before that gc() call does anything. it's more of a request than a command, and the JVM is free to ignore it

@steveloughran
Copy link
Contributor Author

test failure is already fixed in #4900

Change-Id: I5aa94d2c4dc6640ab08e5d531ae9b650575d0cc7
Change-Id: I4b434ffc495ce0a4f5c1dea0047983fa6348b642
@steveloughran
Copy link
Contributor Author

merged with trunk to fix the failure.

one test failure against s3 london, -Dparallel-tests -DtestsThreadCount=10

[INFO] Running org.apache.hadoop.fs.s3a.commit.ITestCommitOperationCost
[ERROR] Tests run: 44, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 79.159 s <<< FAILURE! - in org.apache.hadoop.fs.contract.s3a.ITestS3AContractVectoredRead
[ERROR] testStopVectoredIoOperationsUnbuffer[Buffer type : direct](org.apache.hadoop.fs.contract.s3a.ITestS3AContractVectoredRead)  Time elapsed: 1.26 s  <<< FAILURE!
java.lang.AssertionError: Expected an exception of type class java.io.InterruptedIOException
        at org.apache.hadoop.test.LambdaTestUtils.intercept(LambdaTestUtils.java:409)
        at org.apache.hadoop.fs.contract.s3a.ITestS3AContractVectoredRead.testStopVectoredIoOperationsUnbuffer(ITestS3AContractVectoredRead.java:143)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
        at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.lang.Thread.run(Thread.java:750)

anyone else seen this? race condition maybe?

Tried to create a real GC, but System.gc() wouldn't do it, and
it is too brittle to safely use in a test case anyway

Change-Id: I9f3eadaa75125bb481f193ae29752483a434bf8a
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 59s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 25s trunk passed
+1 💚 compile 25m 27s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 compile 23m 39s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 41s trunk passed
+1 💚 mvnsite 2m 11s trunk passed
+1 💚 javadoc 1m 48s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 17s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 26s trunk passed
+1 💚 shadedclient 27m 4s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 6s the patch passed
+1 💚 compile 24m 44s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javac 24m 44s the patch passed
+1 💚 compile 22m 8s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 javac 22m 8s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 1m 16s /results-checkstyle-hadoop-common-project_hadoop-common.txt hadoop-common-project/hadoop-common: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
+1 💚 mvnsite 1m 51s the patch passed
+1 💚 javadoc 1m 17s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 57s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 2m 56s the patch passed
+1 💚 shadedclient 26m 10s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 37s hadoop-common in the patch passed.
+1 💚 asflicense 1m 10s The patch does not generate ASF License warnings.
231m 8s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4909/4/artifact/out/Dockerfile
GITHUB PR #4909
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 114c5bcce017 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 7579bef
Default Java Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4909/4/testReport/
Max. process+thread count 3134 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4909/4/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@apache apache deleted a comment from hadoop-yetus Sep 20, 2022
@apache apache deleted a comment from hadoop-yetus Sep 20, 2022
@apache apache deleted a comment from hadoop-yetus Sep 20, 2022
Change-Id: Ibc7e2f2134c8bf43412cdbf9821649ed53ef81d5
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 44m 8s trunk passed
+1 💚 compile 27m 26s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 compile 24m 23s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 1m 27s trunk passed
+1 💚 mvnsite 2m 6s trunk passed
+1 💚 javadoc 1m 34s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 0s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 15s trunk passed
+1 💚 shadedclient 28m 34s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 11s the patch passed
+1 💚 compile 28m 3s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javac 28m 3s the patch passed
+1 💚 compile 25m 51s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 javac 25m 51s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 34s the patch passed
+1 💚 mvnsite 2m 3s the patch passed
+1 💚 javadoc 1m 25s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 5s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 24s the patch passed
+1 💚 shadedclient 29m 18s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 19m 39s hadoop-common in the patch passed.
+1 💚 asflicense 1m 17s The patch does not generate ASF License warnings.
249m 52s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4909/5/artifact/out/Dockerfile
GITHUB PR #4909
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 19c7685402c7 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 6670c04
Default Java Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4909/5/testReport/
Max. process+thread count 1284 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4909/5/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1.

@steveloughran steveloughran merged commit 0676495 into apache:trunk Sep 23, 2022
asfgit pushed a commit that referenced this pull request Sep 23, 2022
This problem surfaced in impala integration tests
   IMPALA-11592. TestLocalCatalogRetries.test_fetch_metadata_retry fails in S3 build
after the change
  HADOOP-17461. Add thread-level IOStatistics Context
The actual GC race condition came with
 HADOOP-18091. S3A auditing leaks memory through ThreadLocal references

The fix for this is, if our hypothesis is correct, in WeakReferenceMap.create()
where a strong reference to the new value is kept in a local variable
*and referred to later* so that the JVM will not GC it.

Along with the fix, extra assertions ensure that if the problem is not fixed,
applications will fail faster/more meaningfully.

Contributed by Steve Loughran.
Copy link
Contributor

@dannycjones dannycjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already merged but...

+1 (non-binding), lgtm. I had one question but no concern.

Comment on lines +218 to +220
// resolve that reference, handling the situation where somehow it was removed from the map
// between the put() and the get()
resolvedStrongRef = resolve(retrievedWeakRef);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a strong reference on L207, why do we expect to lose it?

I could see it happening in the old create(K key) implementation, but less so here.

HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
…4909)


This problem surfaced in impala integration tests
   IMPALA-11592. TestLocalCatalogRetries.test_fetch_metadata_retry fails in S3 build
after the change
  HADOOP-17461. Add thread-level IOStatistics Context
The actual GC race condition came with
 HADOOP-18091. S3A auditing leaks memory through ThreadLocal references

The fix for this is, if our hypothesis is correct, in WeakReferenceMap.create()
where a strong reference to the new value is kept in a local variable
*and referred to later* so that the JVM will not GC it.

Along with the fix, extra assertions ensure that if the problem is not fixed,
applications will fail faster/more meaningfully. 

Contributed by Steve Loughran.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants