-
Notifications
You must be signed in to change notification settings - Fork 408
[CELEBORN-1299] Introduce JVM profiling in Celeborn Worker using async-profiler #2409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f0a0e42 to
708b323
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2409 +/- ##
==========================================
+ Coverage 48.85% 49.03% +0.18%
==========================================
Files 209 209
Lines 13102 13123 +21
Branches 1134 1134
==========================================
+ Hits 6400 6433 +33
+ Misses 6282 6272 -10
+ Partials 420 418 -2 ☔ View full report in Codecov by Sentry. |
708b323 to
4800121
Compare
mridulm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change itself looks good to me, thanks for working on this !
Why are we targeting only worker ?
cc5edb1 to
6933a70
Compare
6933a70 to
dc629d2
Compare
|
As the fleet size increases (both compute cluster (20k+ nodes) and Celeborn cluster (2k+ workers) ), master scaling is also a challenge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! merge to main(v0.5.0)
| * Linux (x64) | ||
| * Linux (arm 64) | ||
| * Linux (musl, x64) | ||
| * MacOS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"macOS" - the case is matter, let's respect the official product name https://support.apple.com/macos
|
|
||
| HikariCP/4.0.3//HikariCP-4.0.3.jar | ||
| RoaringBitmap/0.9.32//RoaringBitmap-0.9.32.jar | ||
| ap-loader-all/3.0-8//ap-loader-all-3.0-8.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LICENSE/NOTICE are missed
| Code profiling is currently only supported for | ||
|
|
||
| * Linux (x64) | ||
| * Linux (arm 64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arm64, don't split it out.
| .doc("Local file system path on worker where profiler output is saved. " | ||
| + "Defaults to the working directory of the worker process.") | ||
| .stringConf | ||
| .createWithDefault(".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if possible, please choose a default value independent of the working dir, the user likely uses the absolute path of the script to start the process under an arbitrary dir, which would cause indeterminate results. one solution is use ${CELEBORN_HOME}/profiling and resolve the ${CELEBORN_HOME} in def workerJvmProfilerLocalDir: String
| | celeborn.worker.internal.port | 0 | false | Internal server port on the Worker where the master nodes connect. | 0.5.0 | | | ||
| | celeborn.worker.jvmProfiler.enabled | false | false | Turn on code profiling via async_profiler in workers. | 0.5.0 | | | ||
| | celeborn.worker.jvmProfiler.localDir | . | false | Local file system path on worker where profiler output is saved. Defaults to the working directory of the worker process. | 0.5.0 | | | ||
| | celeborn.worker.jvmProfiler.options | event=wall,interval=10ms,alloc=2m,lock=10ms,chunktime=300s | false | Options to pass on to the async profiler. | 0.5.0 | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where can I find all the option candidates and valid combinations? A detailed description or a hyperlink is required.
What changes were proposed in this pull request?
Introduce JVM profiling
JVMProfierin Celeborn Worker using async-profiler to capture CPU and memory profiles.Why are the changes needed?
async-profiler is a sampling profiler for any JDK based on the HotSpot JVM that does not suffer from Safepoint bias problem. It has low overhead and doesn’t rely on JVMTI. It avoids the safepoint bias problem by using the
AsyncGetCallTraceAPI provided by HotSpot JVM to profile the Java code paths, and Linux’s perf_events to profile the native code paths. It features HotSpot-specific APIs to collect stack traces and to track memory allocations.The feature introduces a profier plugin that does not add any overhead unless enabled and can be configured to accept profiler arguments as a configuration parameter. It should support to turn profiling on/off, includes the jar/binaries needed for profiling.
Backport [SPARK-46094] Support Executor JVM Profiling.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Worker cluster test.