-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24634][SS] Add a new metric regarding number of inputs later than watermark plus allowed delay #28607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I just revive #21617 (instead of pushing forward on #24936) because this approach is straightforward and pretty much simpler compared to #24936, though #24936 provides exact number. #21617 got good tractions from reviewers, except the fact the representation of "input rows" are not exactly correct. I'm trying to cover such thing in this patch. |
|
Test build #122967 has finished for PR 28607 at commit
|
|
Test build #122971 has finished for PR 28607 at commit
|
|
retest this, please |
|
Test build #122973 has finished for PR 28607 at commit
|
10d8d56 to
f09a623
Compare
|
Test build #122981 has finished for PR 28607 at commit
|
|
retest this, please |
|
Test build #123140 has finished for PR 28607 at commit
|
| iter: Iterator[InternalRow], | ||
| predicate: BasePredicate): Iterator[InternalRow] = { | ||
| iter.filter { row => | ||
| val filteredIn = !predicate.eval(row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: little confusing of the variable naming here, maybe use a name related with watermark? inWatermark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's clearer if we apply filterNot - then the variable name can be renamed to lateInput which is matched with metric name. Probably the variable predicate is also not clear which row will get true. Let me apply it.
|
Thanks for reviving, I think it's an important metrics. How about also attach a screenshot of SQL UI contains this metric? |
|
Test build #123306 has finished for PR 28607 at commit
|
|
Thanks for reviewing the PR. I attached the SQL UI screenshot which has late inputs. |
xuanyuanking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Test build #123315 has finished for PR 28607 at commit
|
|
Test build #123345 has finished for PR 28607 at commit
|
03dc461 to
4216405
Compare
|
Test build #123357 has finished for PR 28607 at commit
|
|
retest this, please |
|
Test build #123984 has finished for PR 28607 at commit
|
|
Merged to master. |
|
Thanks all for reviewing and merging! |
|
|
||
| override lazy val metrics = Map( | ||
| "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"), | ||
| "numLateInputs" -> SQLMetrics.createMetric(sparkContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe name it as numDroppedRowsByWatermark? Technically, we don't guarantee to drop all late inputs. So this metric doesn't report the accurate number of late inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Yeah I guess I tried to explain the behavior but the name seems to be still confusing to others. I agree the suggested name is clearer.
Btw, would we be better to have accurate number of late inputs? (Just asking because #24936 covers this, and can be applied orthogonally like num of late inputs vs num of dropped inputs.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…to "numRowsDroppedByWatermark" ### What changes were proposed in this pull request? This PR renames the variable from "numLateInputs" to "numRowsDroppedByWatermark" so that it becomes self-explanation. ### Why are the changes needed? This is originated from post-review, see #28607 (comment) ### Does this PR introduce _any_ user-facing change? No, as SPARK-24634 is not introduced in any release yet. ### How was this patch tested? Existing UTs. Closes #28828 from HeartSaVioR/SPARK-24634-v3-followup. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
Please refer https://issues.apache.org/jira/browse/SPARK-24634 to see rationalization of the issue.
This patch adds a new metric to count the number of inputs arrived later than watermark plus allowed delay. To make changes simpler, this patch doesn't count the exact number of input rows which are later than watermark plus allowed delay. Instead, this patch counts the inputs which are dropped in the logic of operator. The difference of twos are shown in streaming aggregation: to optimize the calculation, streaming aggregation "pre-aggregates" the input rows, and later checks the lateness against "pre-aggregated" inputs, hence the number might be reduced.
The new metric will be provided via two places:
Why are the changes needed?
Dropping late inputs means that end users might not get expected outputs. Even end users may indicate the fact and tolerate the result (as that's what allowed lateness is for), but they should be able to observe whether the current value of allowed lateness drops inputs or not so that they can adjust the value.
Also, whatever the chance they have multiple of stateful operators in a single query, if Spark drops late inputs "between" these operators, it becomes "correctness" issue. Spark should disallow such possibility, but given we already provided the flexibility, at least we should provide the way to observe the correctness issue and decide whether they should make correction of their query or not.
Does this PR introduce any user-facing change?
Yes. End users will be able to retrieve the information of late inputs via two ways:
How was this patch tested?
New UTs added & existing UTs are modified to reflect the change.
And ran manual test reproducing SPARK-28094.
I've picked the specific case on "B outer C outer D" which is enough to represent the "intermediate late row" issue due to global watermark.
https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17
Spark logs warning message on the query which means SPARK-28074 is working correctly,
and we can find the late inputs from the batch 4 as follows:
which represents intermediate inputs are being lost, ended up with correctness issue.