Skip to content

Conversation

@SteNicholas
Copy link
Member

@SteNicholas SteNicholas commented Mar 21, 2024

What changes were proposed in this pull request?

Introduce JVM profiling JVMProfier in Celeborn Worker using async-profiler to capture CPU and memory profiles.

Why are the changes needed?

async-profiler is a sampling profiler for any JDK based on the HotSpot JVM that does not suffer from Safepoint bias problem. It has low overhead and doesn’t rely on JVMTI. It avoids the safepoint bias problem by using the AsyncGetCallTrace API provided by HotSpot JVM to profile the Java code paths, and Linux’s perf_events to profile the native code paths. It features HotSpot-specific APIs to collect stack traces and to track memory allocations.
The feature introduces a profier plugin that does not add any overhead unless enabled and can be configured to accept profiler arguments as a configuration parameter. It should support to turn profiling on/off, includes the jar/binaries needed for profiling.

Backport [SPARK-46094] Support Executor JVM Profiling.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Worker cluster test.

@SteNicholas SteNicholas force-pushed the CELEBORN-1299 branch 3 times, most recently from f0a0e42 to 708b323 Compare March 21, 2024 11:36
@codecov
Copy link

codecov bot commented Mar 21, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 49.03%. Comparing base (a371f93) to head (dc629d2).
Report is 2 commits behind head on main.

Files Patch % Lines
...cala/org/apache/celeborn/common/CelebornConf.scala 85.72% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2409      +/-   ##
==========================================
+ Coverage   48.85%   49.03%   +0.18%     
==========================================
  Files         209      209              
  Lines       13102    13123      +21     
  Branches     1134     1134              
==========================================
+ Hits         6400     6433      +33     
+ Misses       6282     6272      -10     
+ Partials      420      418       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SteNicholas
Copy link
Member Author

SteNicholas commented Mar 21, 2024

cc @mridulm, @RexXiong. PTAL.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself looks good to me, thanks for working on this !
Why are we targeting only worker ?

@SteNicholas
Copy link
Member Author

SteNicholas commented Mar 22, 2024

@mridulm, thanks for review. IMO, most scenarios only need coding profiling of Celeborn Worker in production practice, therefore worker coding profiling is given priority to support. If there is strong need to target master, I would like to support in future.
cc @RexXiong.

@SteNicholas SteNicholas force-pushed the CELEBORN-1299 branch 3 times, most recently from cc5edb1 to 6933a70 Compare March 22, 2024 08:17
@mridulm
Copy link
Contributor

mridulm commented Mar 22, 2024

As the fleet size increases (both compute cluster (20k+ nodes) and Celeborn cluster (2k+ workers) ), master scaling is also a challenge.
We dont necessarily need to add support for master in this PR, as long as adding that support is not a challenge

@SteNicholas
Copy link
Member Author

@mridulm, I agree with the above point of view. Thanks for your suggestion.
@RexXiong, do you have other comments?

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! merge to main(v0.5.0)

@RexXiong RexXiong closed this in 73cf156 Mar 25, 2024
* Linux (x64)
* Linux (arm 64)
* Linux (musl, x64)
* MacOS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"macOS" - the case is matter, let's respect the official product name https://support.apple.com/macos


HikariCP/4.0.3//HikariCP-4.0.3.jar
RoaringBitmap/0.9.32//RoaringBitmap-0.9.32.jar
ap-loader-all/3.0-8//ap-loader-all-3.0-8.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LICENSE/NOTICE are missed

Code profiling is currently only supported for

* Linux (x64)
* Linux (arm 64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arm64, don't split it out.

.doc("Local file system path on worker where profiler output is saved. "
+ "Defaults to the working directory of the worker process.")
.stringConf
.createWithDefault(".")
Copy link
Member

@pan3793 pan3793 Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible, please choose a default value independent of the working dir, the user likely uses the absolute path of the script to start the process under an arbitrary dir, which would cause indeterminate results. one solution is use ${CELEBORN_HOME}/profiling and resolve the ${CELEBORN_HOME} in def workerJvmProfilerLocalDir: String

| celeborn.worker.internal.port | 0 | false | Internal server port on the Worker where the master nodes connect. | 0.5.0 | |
| celeborn.worker.jvmProfiler.enabled | false | false | Turn on code profiling via async_profiler in workers. | 0.5.0 | |
| celeborn.worker.jvmProfiler.localDir | . | false | Local file system path on worker where profiler output is saved. Defaults to the working directory of the worker process. | 0.5.0 | |
| celeborn.worker.jvmProfiler.options | event=wall,interval=10ms,alloc=2m,lock=10ms,chunktime=300s | false | Options to pass on to the async profiler. | 0.5.0 | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where can I find all the option candidates and valid combinations? A detailed description or a hyperlink is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants