[SPARK-45209][CORE][UI] Flame Graph Support For Executor Thread Dump Page #42988

yaooqinn · 2023-09-19T03:02:50Z

What changes were proposed in this pull request?

This PR draws a CPU Flame Graph by Java stack traces for executors and drivers. Currently, the Java stack traces is just a SNAPSHOT, not sampling at a certain frequency for a period. Sampling might be considered an upcoming feature out of the scope of this PR.

If you are new to flame graphs, there are also some references you can refer to learn about the basic concepts and details.

[1] Flame Graphs
[2] FLIP-165: Operator's Flame Graphs
[3] Java in Flames. mixed-mode flame graphs provide a… | by Netflix Technology Blog
[4] HProf

Pending features

This PR mainly focuses on the UI, independent of the profiling steps. What we might have in the future are:

Flame Graph Support For Task Thread Page which SPARK-45151 added
Add ProfilingExecutor(max, interval) message to profile whole executor
Add ProfileTask(taskId, max, interval) message to profile an certain task
Different views for on/off/full CPUs
Mixed mode profiling, which might rely upon some ext libs at runtime
And so on.

Why are the changes needed?

Performance is always an important design factor in Spark. It is desirable to provide better visibility into the distribution of CPU resources while executing user code alongside the Spark kernel. One of the most visually effective means to do that is Flame Graphs, which visually presents the data gathered by performance profiling tools used by developers for performance tuning their applications.

Does this PR introduce any user-facing change?

yes

How was this patch tested?

locally

Was this patch authored or co-authored using generative AI tooling?

no

HyukjinKwon · 2023-09-19T03:32:56Z

Looks cool. cc @mridulm FYI

zhengruifeng · 2023-09-19T03:33:13Z

awesome!

mridulm · 2023-09-19T04:03:29Z

The UI looks nice ! Thanks for working on this @yaooqinn :-)

My main concern is around effectively capturing stack frames without safepoint bias, correlating it to the specific task, and for executor flamegraph which threads to capture. @HyukjinKwon might remember what we had built for Safari, but we did that outside of the scope of Spark due to limitations (which might not be relevant now to be honest).

This has been called out of scope - but details around what we planning to support (in the first cut) would be great to understand.

yaooqinn · 2023-09-19T04:22:19Z

This PR mainly focuses on the UI, independent of the profiling steps. What we might have in the future are:

Flame Graph Support For Task Thread Page, which SPARK-45151 added
Add ProfilingExecutor(max, interval) message to profile the whole executor, which returns the same data structure with TriggerThreadDump message
Add ProfileTask(taskId, max, interval) message to profile a specific task, which returns the same data structure with TaskThreadDump message
Different views for on/off/full CPUs
Mixed mode profiling, which might rely upon some ext libs at runtime
And so on.

rednaxelafx · 2023-09-19T05:07:22Z

One important aspect of flame graphs is the semantics of the "width" of the bars. It can be defined to mean anything, e.g. aggregated profiling ticks (i.e. number of samples) or wall clock duration etc.

What's the intended semantics of "width" here for the thread dump snapshot?

yaooqinn · 2023-09-19T07:13:41Z

Currently, the width represents the number of samples. the CPU time is not gathered yet, see also SPARK-44896

mridulm · 2023-09-19T15:48:57Z

This PR mainly focuses on the UI, independent of the profiling steps. What we might have in the future are:

Flame Graph Support For Task Thread Page, which SPARK-45151 added

SPARK-45151 added ability to fetch stack dump, not flame graph support - and more importantly, from point of view of flamegraphs and performance analysis, the stack dump returned suffers from safepoint bias.

If this is an initial implementation, which we plan to refine over 4.0 release, that sounds fine to me - at a minimum, users/deployments should have the ability to override how the stack dump - for flamegraphs not SPARK-45151 - is generated.

Add ProfilingExecutor(max, interval) message to profile the whole executor, which returns the same data structure with TriggerThreadDump message

We can iterate more on this when we have a PR for it.
Thanks for the details.

Add ProfileTask(taskId, max, interval) message to profile a specific task, which returns the same data structure with TaskThreadDump message

Different views for on/off/full CPUs

Mixed mode profiling, which might rely upon some ext libs at runtime

And so on.

mridulm · 2023-09-19T15:50:06Z

+CC @thejdeep as well.

rednaxelafx · 2023-09-20T00:28:36Z

Currently, the width represents the number of samples.

In the current implementation in this PR, the width is essentially representing the "number of threads" that's sharing the same bottom portion of the stack, from a single snapshot of the threads dump, right?

yaooqinn · 2023-09-20T01:47:01Z

@rednaxelafx Yes

beliefer · 2023-09-22T06:13:03Z

@yaooqinn Looks beautiful!

yaooqinn · 2023-09-22T07:59:11Z

Thank you @beliefer

LuciferYang

+1, LGTM
Very cool features should be very helpful to frontline maintenance engineers.

LuciferYang · 2023-10-25T10:31:21Z

Merged into master for Spark 4.0.
Thanks @yaooqinn
Thanks for your review. @HyukjinKwon @zhengruifeng @rednaxelafx @mridulm @beliefer

cloud-fan · 2023-11-06T06:47:10Z

This is a great feature! can we add a config to turn it on/off? I just worried about any possible bugs and stop the UI from working.

…Page ### What changes were proposed in this pull request? This PR draws a CPU Flame Graph by Java stack traces for executors and drivers. Currently, the Java stack traces is just a SNAPSHOT, not sampling at a certain frequency for a period. Sampling might be considered an upcoming feature out of the scope of this PR. ![fg git](https://github.com/apache/spark/assets/8326978/c3f99a1a-78ee-4adb-be1f-e4afd5f307b7) If you are new to flame graphs, there are also some references you can refer to learn about the basic concepts and details. [1] [Flame Graphs](https://www.brendangregg.com/flamegraphs.html) [2] [FLIP-165: Operator's Flame Graphs](https://cwiki.apache.org/confluence/display/FLINK/FLIP-165%3A+Operator%27s+Flame+Graphs) [3] [Java in Flames. mixed-mode flame graphs provide a… | by Netflix Technology Blog](https://netflixtechblog.com/java-in-flames-e763b3d32166) [4] [HProf](https://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html) #### Pending features This PR mainly focuses on the UI, independent of the profiling steps. What we might have in the future are: - Flame Graph Support For Task Thread Page which SPARK-45151 added - Add `ProfilingExecutor(max, interval)` message to profile whole executor - Add `ProfileTask(taskId, max, interval)` message to profile an certain task - Different views for on/off/full CPUs - Mixed mode profiling, which might rely upon some ext libs at runtime - And so on. ### Why are the changes needed? Performance is always an important design factor in Spark. It is desirable to provide better visibility into the distribution of CPU resources while executing user code alongside the Spark kernel. One of the most visually effective means to do that is [Flame Graphs](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html), which visually presents the data gathered by performance profiling tools used by developers for performance tuning their applications. ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? locally ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#42988 from yaooqinn/SPARK-45209. Authored-by: Kent Yao <[email protected]> Signed-off-by: yangjie01 <[email protected]> (cherry picked from commit a073bf3)

github-actions bot added WEB UI CORE labels Sep 19, 2023

yaooqinn changed the title ~~[WIP][SPARK-45209][CORE][UI] Flame Graph Support For Executor Thread Dump Page~~ [SPARK-45209][CORE][UI] Flame Graph Support For Executor Thread Dump Page Sep 20, 2023

yaooqinn added 4 commits October 13, 2023 10:28

[WIP][SPARK-45209] Flame Graph Support For Executor Thread Dump Page

d6abd99

nit

520809c

nit

8221f3b

nit

3df1524

yaooqinn force-pushed the SPARK-45209 branch from 9bedda1 to 3df1524 Compare October 13, 2023 02:28

Merge branch 'master' into SPARK-45209

11b8a03

LuciferYang approved these changes Oct 18, 2023

View reviewed changes

yaooqinn added 2 commits October 23, 2023 17:07

Merge branch 'master' into SPARK-45209

dda99e6

Merge branch 'master' into SPARK-45209

b24d075

LuciferYang closed this in a073bf3 Oct 25, 2023

yaooqinn deleted the SPARK-45209 branch November 6, 2023 08:40

yaooqinn mentioned this pull request Nov 6, 2023

[SPARK-45804][UI] Add spark.ui.threadDump.flamegraphEnabled config to switch flame graph on/off #43674

Closed

cloud-fan mentioned this pull request May 6, 2025

[SPARK-52010] Do not generate API docs for internal classes #50797

Closed

[SPARK-45209][CORE][UI] Flame Graph Support For Executor Thread Dump Page #42988

[SPARK-45209][CORE][UI] Flame Graph Support For Executor Thread Dump Page #42988

Conversation

yaooqinn commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Pending features

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Sep 19, 2023

Uh oh!

zhengruifeng commented Sep 19, 2023

Uh oh!

mridulm commented Sep 19, 2023

Uh oh!

yaooqinn commented Sep 19, 2023

Uh oh!

rednaxelafx commented Sep 19, 2023

Uh oh!

yaooqinn commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Sep 19, 2023

Uh oh!

rednaxelafx commented Sep 20, 2023

Uh oh!

yaooqinn commented Sep 20, 2023

Uh oh!

beliefer commented Sep 22, 2023

Uh oh!

yaooqinn commented Sep 22, 2023

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Oct 25, 2023

Uh oh!

cloud-fan commented Nov 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

yaooqinn commented Sep 19, 2023 •

edited

Loading

yaooqinn commented Sep 19, 2023 •

edited

Loading

mridulm commented Sep 19, 2023 •

edited

Loading