-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-45209][CORE][UI] Flame Graph Support For Executor Thread Dump Page #42988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looks cool. cc @mridulm FYI |
|
awesome! |
|
The UI looks nice ! Thanks for working on this @yaooqinn :-) My main concern is around effectively capturing stack frames without safepoint bias, correlating it to the specific task, and for executor flamegraph which threads to capture. @HyukjinKwon might remember what we had built for Safari, but we did that outside of the scope of Spark due to limitations (which might not be relevant now to be honest). This has been called out of scope - but details around what we planning to support (in the first cut) would be great to understand. |
|
This PR mainly focuses on the UI, independent of the profiling steps. What we might have in the future are:
|
|
One important aspect of flame graphs is the semantics of the "width" of the bars. It can be defined to mean anything, e.g. aggregated profiling ticks (i.e. number of samples) or wall clock duration etc. What's the intended semantics of "width" here for the thread dump snapshot? |
|
Currently, the |
SPARK-45151 added ability to fetch stack dump, not flame graph support - and more importantly, from point of view of flamegraphs and performance analysis, the stack dump returned suffers from safepoint bias. If this is an initial implementation, which we plan to refine over 4.0 release, that sounds fine to me - at a minimum, users/deployments should have the ability to override how the stack dump - for flamegraphs not SPARK-45151 - is generated.
We can iterate more on this when we have a PR for it.
|
|
+CC @thejdeep as well. |
In the current implementation in this PR, the width is essentially representing the "number of threads" that's sharing the same bottom portion of the stack, from a single snapshot of the threads dump, right? |
|
@rednaxelafx Yes |
|
@yaooqinn Looks beautiful! |
|
Thank you @beliefer |
LuciferYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
Very cool features should be very helpful to frontline maintenance engineers.
|
Merged into master for Spark 4.0. |
|
This is a great feature! can we add a config to turn it on/off? I just worried about any possible bugs and stop the UI from working. |
…Page ### What changes were proposed in this pull request? This PR draws a CPU Flame Graph by Java stack traces for executors and drivers. Currently, the Java stack traces is just a SNAPSHOT, not sampling at a certain frequency for a period. Sampling might be considered an upcoming feature out of the scope of this PR.  If you are new to flame graphs, there are also some references you can refer to learn about the basic concepts and details. [1] [Flame Graphs](https://www.brendangregg.com/flamegraphs.html) [2] [FLIP-165: Operator's Flame Graphs](https://cwiki.apache.org/confluence/display/FLINK/FLIP-165%3A+Operator%27s+Flame+Graphs) [3] [Java in Flames. mixed-mode flame graphs provide a… | by Netflix Technology Blog](https://netflixtechblog.com/java-in-flames-e763b3d32166) [4] [HProf](https://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html) #### Pending features This PR mainly focuses on the UI, independent of the profiling steps. What we might have in the future are: - Flame Graph Support For Task Thread Page which SPARK-45151 added - Add `ProfilingExecutor(max, interval)` message to profile whole executor - Add `ProfileTask(taskId, max, interval)` message to profile an certain task - Different views for on/off/full CPUs - Mixed mode profiling, which might rely upon some ext libs at runtime - And so on. ### Why are the changes needed? Performance is always an important design factor in Spark. It is desirable to provide better visibility into the distribution of CPU resources while executing user code alongside the Spark kernel. One of the most visually effective means to do that is [Flame Graphs](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html), which visually presents the data gathered by performance profiling tools used by developers for performance tuning their applications. ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? locally ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#42988 from yaooqinn/SPARK-45209. Authored-by: Kent Yao <[email protected]> Signed-off-by: yangjie01 <[email protected]> (cherry picked from commit a073bf3)
What changes were proposed in this pull request?
This PR draws a CPU Flame Graph by Java stack traces for executors and drivers. Currently, the Java stack traces is just a SNAPSHOT, not sampling at a certain frequency for a period. Sampling might be considered an upcoming feature out of the scope of this PR.
If you are new to flame graphs, there are also some references you can refer to learn about the basic concepts and details.
[1] Flame Graphs
[2] FLIP-165: Operator's Flame Graphs
[3] Java in Flames. mixed-mode flame graphs provide a… | by Netflix Technology Blog
[4] HProf
Pending features
This PR mainly focuses on the UI, independent of the profiling steps. What we might have in the future are:
ProfilingExecutor(max, interval)message to profile whole executorProfileTask(taskId, max, interval)message to profile an certain taskWhy are the changes needed?
Performance is always an important design factor in Spark. It is desirable to provide better visibility into the distribution of CPU resources while executing user code alongside the Spark kernel. One of the most visually effective means to do that is Flame Graphs, which visually presents the data gathered by performance profiling tools used by developers for performance tuning their applications.
Does this PR introduce any user-facing change?
yes
How was this patch tested?
locally
Was this patch authored or co-authored using generative AI tooling?
no