Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

@jiangxb1987 jiangxb1987 commented Jul 5, 2017

What changes were proposed in this pull request?

Long values can be passed to rangeBetween as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.

Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.

This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c

After this been merged, we can close #16818 .

How was this patch tested?

Add new tests in DataFrameWindowFunctionsSuite and TypeCoercionSuite.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this over. I left a few comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: typo Cannot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made lower and upper AnyRef, this was to allow the use of both of (foldable) Expressions and SpecialFrameBoundary. This works rather well with things like constant folding. The reason for not making SpecialFrameBoundary an Expression is that this cannot have a type (unless you make it a case class I suppose) and that it showed some weird behavior during analysis/optimization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both DateType and TimestampType expressions are going to need a time zone. I was wondering if we can use a GMT because these are just offset calculation? cc @ueshin

If we can't then we either need to thread through the session local timezone, or it might be easier to put the entire offset calculation in the frame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also create API's that allow for other types of literals, at least one for CalendarIntervals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree that's what we definitely should do, but I'd suggest we address this in a follow-up work, and focus on resolving the overflow issue on Long frame boundaries in rangeBetween in this PR.

One major concern is the WindowSpec API is marked Stable, so I'm wondering what's the proper procedure to make a change to this interface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be the easiest to make it take a Column, so rangeBetween(begin: Column, end: Column), only downside to this is that we need some way to express the special boundaries (current row, unbounded). Also cc @rxin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use Literal(0) to represent CurrentRow? And a sufficient large number(like Literal(Long.MaxValue)) for Unbounded?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was trying to avoid introducing a special value, but maybe you can do that.

How important is it to fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rule it out of the scope of this PR and address this in a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add tests for dates/doubles/timestamps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do this tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only add the test cases after we have finalized the API change.

@SparkQA
Copy link

SparkQA commented Jul 5, 2017

Test build #79210 has finished for PR 18540 at commit 52c5289.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • sealed trait FrameType
  • sealed trait WindowFrame extends Expression with Unevaluable
  • case class SpecifiedWindowFrame(

@SparkQA
Copy link

SparkQA commented Jul 5, 2017

Test build #79231 has finished for PR 18540 at commit 1135053.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this? can we just require strict types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm that would be kind of weird. So a user will get type coercion in its select but not in the range clause.

@SparkQA
Copy link

SparkQA commented Jul 11, 2017

Test build #79531 has finished for PR 18540 at commit a761f63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

}
/**
* A specified Window Frame. The val lower/uppper can be either a foldable [[Expression]] or a
* [[SpecialFrameBoundary]].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make SpecialFrameBoundary an expression?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that. The problem is that you will need to make them have a proper data type. I tried to make them case object .. {} with data type null, but I ended with these replaced with a null literal.

All I am saying that this will require a little bit more coding. Since you need to resolve the data type of the boundary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this in WindowFrameCoercion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


private def isValidFrameType(ft: DataType): Boolean = (orderSpec.head.dataType, ft) match {
case (DateType, IntegerType) => true
case (TimestampType, CalendarIntervalType) => true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we support DateType, TimestampType in follow-up PR? Let's focus on refactor/cleanup in this PR.

@SparkQA
Copy link

SparkQA commented Jul 21, 2017

Test build #79835 has finished for PR 18540 at commit 427753d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • sealed trait SpecialFrameBoundary extends Expression with Unevaluable

@SparkQA
Copy link

SparkQA commented Jul 21, 2017

Test build #79841 has finished for PR 18540 at commit 43b2399.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

override def children: Seq[Expression] = partitionSpec ++ orderSpec
override def children: Seq[Expression] = partitionSpec ++ orderSpec ++ Seq(frameSpecification)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: partitionSpec ++ orderSpec :+ frameSpecification

* Represents a window frame.
*/
sealed trait WindowFrame
def isValueBound: Boolean = valueBoundary.nonEmpty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: def isValueBound: Boolean = !isUnbounded

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, is it possible that both lower and upper are current row?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we can have ROWS CURRENT ROW

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having rows between current row and current row is kinda dumb, the aggregate should contain only 1 value. However range between current row and current row can be very useful because you can aggregate over all the observations with the same ordering value.

}

private def checkBoundary(b: Expression, location: String): TypeCheckResult = b match {
case e: Expression if !e.foldable && !e.isInstanceOf[SpecialFrameBoundary] =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can mark SpecialFrameBoundary as foldable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should, with what are you going to replace the boundary during optimization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I was wrong.

Seq(UnresolvedAttribute("a")),
Seq(SortOrder(Literal.default(DateType), Ascending)),
SpecifiedWindowFrame(RangeFrame, Literal(10.0), Literal(2147483648L)))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some more test cases with special window frame boundary?

("between unbounded preceding and current row", UnboundedPreceding, CurrentRow),
("10 preceding", -Literal(10), CurrentRow),
("2147483648 preceding", -Literal(2147483648L), CurrentRow),
("3 + 1 following", Add(Literal(3), Literal(1)), CurrentRow), // Will fail during analysis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this will fail during analysis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lower boundary would be higher than the upper boundary, previously it would fail, but we have removed this check, should add it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that will improve the UX?

("2147483648 preceding", -Literal(2147483648L), CurrentRow),
("3 + 1 following", Add(Literal(3), Literal(1)), CurrentRow), // Will fail during analysis
("unbounded preceding", Unbounded, CurrentRow),
("unbounded following", Unbounded, CurrentRow), // Will fail during analysis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact this is problematic, we would generate the same result for both unbounded preceding and unbounded following. @hvanhovell any idea on resolving this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the idea was that the unboundedness was tied to the location in which it was used, so for example unbounded in the first position would mean unbounded preceding. However this is completely opposite to how we interpret literal bounds, it might be better to reintroduce special boundaries for unbounded preceding and unbounded following.

"num-children" -> 2,
"frameType" -> JObject("object" -> JString(RowFrame.getClass.getName)),
"lower" -> 0,
"upper" -> 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why lower and upper is 0 and 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR, SpecialFrameBoundary and WindowFrame are made Expressions, thus they are TreeNodes, so the field values are made value index in the TreeNode.children.

@cloud-fan
Copy link
Contributor

LGTM except some minor comments

@SparkQA
Copy link

SparkQA commented Jul 24, 2017

Test build #79905 has finished for PR 18540 at commit 5c9a992.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

TypeCheckFailure(
"Cannot use an UnspecifiedFrame. This should have been converted during analysis. " +
"Please file a bug report.")
case f: SpecifiedWindowFrame if f.frameType == RangeFrame && !f.isUnbounded &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be !f.isValueBound? basically current row and current row is not unbound but should be allowed here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was wrong

@SparkQA
Copy link

SparkQA commented Jul 27, 2017

Test build #80006 has finished for PR 18540 at commit a1f91cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"Frame bound value must be a constant integer.",
ctx)
e.eval().asInstanceOf[Int]
validate(e.resolved && e.foldable, "Frame bound value must be a literal.", ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary? I think analyzer can detect and report this failure too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about keep this so we can fail earlier?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@cloud-fan
Copy link
Contributor

LGTM, pending tests

@SparkQA
Copy link

SparkQA commented Jul 28, 2017

Test build #80020 has finished for PR 18540 at commit 9b8a19b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 28, 2017

Test build #80021 has finished for PR 18540 at commit fbcea1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

You still missed resolving this comment

FrameBoundary(l),
FrameBoundary(h))))
if order.isEmpty || frame != RowFrame || l != h =>
frame: SpecifiedWindowFrame))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Cutting in the middle looks weird. Please keep them in the same line.
WindowSpecDefinition(_, order, frame: SpecifiedWindowFrame))

failAnalysis(s"Expression '$e' not supported within a window function.")
}
// Make sure the window specification is valid.
s.validate match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verification is moved to checkInputDataTypes, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea

private def createBoundaryCast(boundary: Expression, dt: DataType): Expression = {
boundary match {
case e: Expression if e.dataType != dt && Cast.canCast(e.dataType, dt) &&
!e.isInstanceOf[SpecialFrameBoundary] =>
Copy link
Member

@gatorsmile gatorsmile Jul 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting if in the middle looks weird. How about?

        case e: SpecialFrameBoundary => e
        case e: Expression if e.dataType != dt && Cast.canCast(e.dataType, dt) => Cast(e, dt)
        case o => o

object WindowFrameCoercion extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan resolveExpressions {
case s @ WindowSpecDefinition(_, Seq(order), SpecifiedWindowFrame(RangeFrame, lower, upper))
if order.resolved =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add two more spaces before if

s"A range window frame with value boundaries cannot be used in a window specification " +
s"with multiple order by expressions: ${orderSpec.mkString(",")}")
case f: SpecifiedWindowFrame if f.frameType == RangeFrame && f.isValueBound &&
!isValidFrameType(f.valueBoundary.head.dataType) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I do not like many long if conditions. Let us add extra two spaces in line 71, 66, and 62.

TypeCheckFailure(
s"The data type '${orderSpec.head.dataType}' used in the order specification does " +
s"not match the data type '${f.valueBoundary.head.dataType}' which is used in the " +
"range frame.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to confirm whether we have at least four negatives test cases to respectively cover these cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first case is for defensive guard, I'll add test sql for the two negative cases related to orderBy in RangeFrame.

sealed trait FrameBoundary {
def notFollows(other: FrameBoundary): Boolean
case object RangeFrame extends FrameType {
override def inputType: AbstractDataType = TypeCollection.NumericAndInterval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, we also support CalendarInterval. Do we have a test case to verify it works on CalendarInterval?

s"'${l.dataType.catalogString}' <> '${u.dataType.catalogString}'")
case (l: Expression, u: Expression) if isGreaterThan(l, u) =>
TypeCheckFailure(
"The lower bound of a window frame must less than or equal to the upper bound")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question. Do we have test cases for the above three negative cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second is defensive check, added test sql for the rest.

if (check.isFailure) {
check
} else if (!offset.foldable) {
TypeCheckFailure(s"Offset expression '$offset' must be a literal.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it's only used by lead()/lag() functions, that both checked the input types, so we're not able to test this from sql.

"Frame bound value must be a constant integer.",
ctx)
e.eval().asInstanceOf[Int]
validate(e.resolved && e.foldable, "Frame bound value must be a literal.", ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

UnboundedFollowing
case SqlBaseParser.FOLLOWING =>
ValueFollowing(value)
value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be an unsigned-integer based on ANSI SQL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask how should we parse it into an unsigned-integer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we already allowed it in the previous release. Thus, we need to follow what we have now.

UnboundedPreceding
case SqlBaseParser.PRECEDING =>
ValuePreceding(value)
UnaryMinus(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here. Do we allow users assign a negative value?

* @since 1.4.0
*/
// Note: when updating the doc for this method, also update Window.rangeBetween.
def rangeBetween(start: Long, end: Long): WindowSpec = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if start is larger than end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll get and compute empty frames.

@gatorsmile
Copy link
Member

It looks pretty solid. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 29, 2017

Test build #80041 has finished for PR 18540 at commit 38b04df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

gatorsmile commented Jul 29, 2017

Thanks! Merging to master

asfgit pushed a commit that referenced this pull request Jul 29, 2017
…undary

## What changes were proposed in this pull request?

Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.

Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.

This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c

After this been merged, we can close #16818 .

## How was this patch tested?

Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.

Author: Xingbo Jiang <[email protected]>

Closes #18540 from jiangxb1987/rangeFrame.

(cherry picked from commit 92d8563)
Signed-off-by: gatorsmile <[email protected]>
@asfgit asfgit closed this in 92d8563 Jul 29, 2017
@jiangxb1987 jiangxb1987 deleted the rangeFrame branch August 9, 2017 08:38
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…undary

## What changes were proposed in this pull request?

Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.

Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.

This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c

After this been merged, we can close apache#16818 .

## How was this patch tested?

Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.

Author: Xingbo Jiang <[email protected]>

Closes apache#18540 from jiangxb1987/rangeFrame.

(cherry picked from commit 92d8563)
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants