-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics while Sort is missing #23258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
d40109f
42c77f2
68dbdd7
4ee2c8d
1e55f31
c3336d8
75d0c08
386a7e5
c496c54
3ce0e03
5e94a3e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -18,6 +18,7 @@ | |||||||||||||||||||||||||||||||||||
| package org.apache.spark.sql.execution.metric | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| import java.io.File | ||||||||||||||||||||||||||||||||||||
| import java.util.regex.Pattern | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| import scala.collection.mutable.HashMap | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
@@ -40,6 +41,26 @@ trait SQLMetricsTestUtils extends SQLTestUtils { | |||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| protected def statusStore: SQLAppStatusStore = spark.sharedState.statusStore | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| protected val bytes = "([0-9]+(\\.[0-9]+)?) (EiB|PiB|TiB|GiB|MiB|KiB|B)" | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| protected val duration = "([0-9]+(\\.[0-9]+)?) (ms|s|m|h)" | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| // "\n96.2 MiB (32.1 MiB, 32.1 MiB, 32.1 MiB)" | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
| protected val sizeMetricPattern = Pattern.compile(s"\\n$bytes \\($bytes, $bytes, $bytes\\)") | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| // "\n2.0 ms (1.0 ms, 1.0 ms, 1.0 ms)" | ||||||||||||||||||||||||||||||||||||
| protected val timingMetricPattern = | ||||||||||||||||||||||||||||||||||||
| Pattern.compile(s"\\n$duration \\($duration, $duration, $duration\\)") | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| /** Generate a function to check the specified pattern. | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
| * | ||||||||||||||||||||||||||||||||||||
| * @param pattern a pattern | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
| * @return a function to check the specified pattern | ||||||||||||||||||||||||||||||||||||
| */ | ||||||||||||||||||||||||||||||||||||
| protected def checkPattern(pattern: Pattern): (Any => Boolean) = { | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
| (in: Any) => pattern.matcher(in.toString).matches() | ||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| /** | ||||||||||||||||||||||||||||||||||||
| * Get execution metrics for the SQL execution and verify metrics values. | ||||||||||||||||||||||||||||||||||||
| * | ||||||||||||||||||||||||||||||||||||
|
|
@@ -198,6 +219,32 @@ trait SQLMetricsTestUtils extends SQLTestUtils { | |||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| /** | ||||||||||||||||||||||||||||||||||||
| * Call `df.collect()` and verify if the collected metrics satisfy the specified predicates. | ||||||||||||||||||||||||||||||||||||
| * @param df `DataFrame` to run | ||||||||||||||||||||||||||||||||||||
| * @param expectedNumOfJobs number of jobs that will run | ||||||||||||||||||||||||||||||||||||
| * @param expectedMetricsPredicates the expected metrics predicates. The format is | ||||||||||||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: go to 100 chars and the next line has a bad indentation
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because usually metric values are numbers, so for metrics values, predicates could be more natural than regular expressions which are more suitable for text matching. For simple metric values, helper functions are not needed. However, timing and size metric values are a little complex:
With helper functions, we extract stats (by BTW, may be timing and size metric values should be stored in a more structured way rather than pure text format (even with "\n" in values).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, indentation is not right. I have fixed it in the new commit.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my point is: as of now, pattern matching is enough for what we need to check and we do not have a use case when we actually need to parse the exact values. Doing that, we can simplify this PR and reduce considerably the size of this change. So I think we should go this way. If in the future we will need something like you proposed here because we want to check the actual values, then we can introduce all the methods you are suggesting here. But as of know this can be skipped IMO.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This does look like a load of additional code that I think duplicates some existing code in Utils? is it really necessary to make some basic assertions about metric values?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mgaido91 I agree. Thanks for your detailed and clear explanation. Checking metric values do make things unnecessarily complex. @srowen As @mgaido91 said, currently it is not necessary to check metric values, pattern matching is enough, and we could eliminate these methods. As for code duplication, methods here are not duplicate with code in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, I have switched to pattern matching and also removed unnecessary helper methods in the new commit. |
||||||||||||||||||||||||||||||||||||
| * `nodeId -> (operatorName, metric name -> metric value predicate)`. | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
| */ | ||||||||||||||||||||||||||||||||||||
| protected def testSparkPlanMetricsWithPredicates( | ||||||||||||||||||||||||||||||||||||
| df: DataFrame, | ||||||||||||||||||||||||||||||||||||
| expectedNumOfJobs: Int, | ||||||||||||||||||||||||||||||||||||
| expectedMetricsPredicates: Map[Long, (String, Map[String, Any => Boolean])]): Unit = { | ||||||||||||||||||||||||||||||||||||
| val optActualMetrics = | ||||||||||||||||||||||||||||||||||||
| getSparkPlanMetrics(df, expectedNumOfJobs, expectedMetricsPredicates.keySet) | ||||||||||||||||||||||||||||||||||||
| optActualMetrics.foreach { actualMetrics => | ||||||||||||||||||||||||||||||||||||
| assert(expectedMetricsPredicates.keySet === actualMetrics.keySet) | ||||||||||||||||||||||||||||||||||||
| for (nodeId <- expectedMetricsPredicates.keySet) { | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
| val (expectedNodeName, expectedMetricsPredicatesMap) = expectedMetricsPredicates(nodeId) | ||||||||||||||||||||||||||||||||||||
| val (actualNodeName, actualMetricsMap) = actualMetrics(nodeId) | ||||||||||||||||||||||||||||||||||||
| assert(expectedNodeName === actualNodeName) | ||||||||||||||||||||||||||||||||||||
| for (metricName <- expectedMetricsPredicatesMap.keySet) { | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
| assert(expectedMetricsPredicatesMap(metricName)(actualMetricsMap(metricName))) | ||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
| def createSizeMetric(sc: SparkContext, name: String): SQLMetric = { | |
| // The final result of this metric in physical operator UI may look like: | |
| // data size total (min, med, max): | |
| // 100GB (100MB, 1GB, 10GB) | |
| val acc = new SQLMetric(SIZE_METRIC, -1) | |
| acc.register(sc, name = Some(s"$name total (min, med, max)"), countFailedValues = false) | |
| acc | |
| } | |
| def createTimingMetric(sc: SparkContext, name: String): SQLMetric = { | |
| // The final result of this metric in physical operator UI may looks like: | |
| // duration(min, med, max): | |
| // 5s (800ms, 1s, 2s) | |
| val acc = new SQLMetric(TIMING_METRIC, -1) | |
| acc.register(sc, name = Some(s"$name total (min, med, max)"), countFailedValues = false) | |
| acc | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a new commit, I have added SQLMetricsTestUtils#testSparkPlanMetricsWithPredicates. In such a way, we simply need to provide a test spec in test("Sort metrics") to make the test case declarative rather than procedural.
To simplify timing and size metric testing, I added 2 common predicates, timingMetricAllStatsShould and sizeMetricAllStatsShould. These could be used for other metrics as long as they are timing or size metrics.
And I also modified the original testSparkPlanMetrics to make it a special case of testSparkPlanMetricsWithPredicates, where each expected metric value is converted to an equality predicate. This eliminated duplicate code as testSparkPlanMetrics and testSparkPlanMetricsWithPredicates are almost the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unneeded change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in the new commit.