[SPARK-38077][SQL] Fix binary compatibility issue with isDeterministic flag #35378

sunchao · 2022-02-01T07:28:14Z

What changes were proposed in this pull request?

This fixes a binary compatibility issue caused by SPARK-37957 with the introduction of the additional isDeterministic which defaults to true.

Why are the changes needed?

Adding method parameters with default value will break binary compatibility (see here). Even though Spark doesn't strictly guarantee it, it is still better to avoid. In this case, the compatibility of frameless is broken when it wants to work with multiple Spark versions (e.g., 3.2.0 and 3.2.1).

Does this PR introduce any user-facing change?

Now it requires users to call setDeterministic after initializing Invoke and StaticInvoke if they want to mark the methods as non-deterministic.

How was this patch tested?

Existing tests.

thesamet · 2022-02-01T07:38:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

Note that having a var here breaks the convention that case classes are immutable, and also breaks copy and equality checks which would ignore this value.

Hmm I'm not sure whether there's a better way. I tried to make the parameter non-default but then it seems will break the API compatibility.

I think it should be possible using secondary constructor in the companion object. Let me take a look.

My suggestion is to keep isDeterministic: boolean = true in the constructor, and add def this and an apply in a companion object:

case class StaticInvoke( staticObject: Class[_], dataType: DataType, functionName: String, arguments: Seq[Expression] = Nil, inputTypes: Seq[AbstractDataType] = Nil, propagateNull: Boolean = true, returnNullable: Boolean = true, isDeterministic: Boolean = true) { def this( staticObject: Class[_], dataType: DataType, functionName: String, arguments: Seq[Expression], inputTypes: Seq[AbstractDataType], propagateNull: Boolean, returnNullable: Boolean ) = { this(staticObject, dataType, functionName, arguments, inputTypes, propagateNull, returnNullable, true) } } object StaticInvoke { def apply( staticObject: Class[_], dataType: DataType, functionName: String, arguments: Seq[Expression], inputTypes: Seq[AbstractDataType], propagateNull: Boolean, returnNullable: Boolean ): StaticInvoke = StaticInvoke(staticObject, dataType, functionName, arguments, inputTypes, propagateNull, returnNullable, true) }

The two methods added without default parameters represent what existing compiled libraries are linked with. I tested this suggestion in a test project where I added empty traits for DataType, Expression, etc. The above suggestion takes care of the constructor compatibility, but leaves a few unavoidable binary compatibility issues that should be inconsequential (pertaining to tupled, curried and copy method):

[error] compat: Failed binary compatibility check against default:compat_2.13:0.1.0! Found 4 potential problems (filtered 2) [error] * static method tupled()scala.Function1 in class x.StaticInvoke does not have a correspondent in current version [error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("x.StaticInvoke.tupled") [error] * static method curried()scala.Function1 in class x.StaticInvoke does not have a correspondent in current version [error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("x.StaticInvoke.curried") [error] * method copy(java.lang.Class,x.DataType,java.lang.String,scala.collection.immutable.Seq,scala.collection.immutable.Seq,Boolean,Boolean)x.StaticInvoke in class x.StaticInvoke does not have a correspondent in current version [error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("x.StaticInvoke.copy") [error] * the type hierarchy of object x.StaticInvoke is different in current version. Missing types {scala.runtime.AbstractFunction7} [error] filter with: ProblemFilters.exclude[MissingTypesProblem]("x.StaticInvoke$")

+1 to @thesamet 's suggestion

We can also consider cleaning up the legacy/compatibility versions of this/apply in the next minor release and keeping them around only for patch releases?

It would make sense to add a comment to have apply and this removed for Spark 3.3.0

Ah cool, this is neat. Let me switch to use this.

Hmm what's the benefit of removing these before the Spark 3.3.0 release? won't that cause the compatibility issue again? say frameless wants to work with both Spark 3.2.x and 3.3.x

The intent of the suggestion was to do this on or after the 3.3.0 release. It's ok to defer, but binary incompatibilities between 3.2.x and 3.3.0 are more expected. This is just to balance how long you want to carry the noise in the code caused by the additional constructors.

dongjoon-hyun · 2022-02-01T07:46:09Z

cc @huaxingao

viirya · 2022-02-02T01:06:53Z

cc @cloud-fan

viirya · 2022-02-02T01:09:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

    returnNullable: Boolean = true,
    isDeterministic: Boolean = true) extends InvokeLike {

+  def this(


Add a comment here explaining why we need this? Without any context, this looks a bit redundant.

Sure added comments.

viirya · 2022-02-02T01:10:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

 }

+object StaticInvoke {
+  def apply(


As we have second constructor, do we need apply?

The this constructor ends up being a static method on StaticInvoke, the apply as a method on StaticInvoke$.class which serves as the companion object of StaticInvoke. They are both needed for binary compatibility. The Scala compile generates both of them for the default constructor.

HyukjinKwon · 2022-02-03T02:03:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

    returnNullable: Boolean = true,
    isDeterministic: Boolean = true) extends InvokeLike {

+  // This additional constructor is added to keep binary compatibility after the addition of the


Hm, technically all expressions under catalyst are private, and we don't maintain binary compatibility here. For the same reason, we don't run MiMa too. I believe the downstream projects can work around by reflection.

We also made this kind of argument change at 3.2.0 too (7d8181b) without keeping binary compatibility. I would go -1 for this change - it makes less sense to keep binary compatibility for this argument specifically in the private package which we documented and we intentionally skip binary compatibility check.

The warning says "between minor releases" ;)

This is an internal API, and I think it makes less sense to make some changes to keep the binary compatibility here. We should probably mention maintenance release too - note that they were all explicitly private[sql] before (which we removed at SPARK-16813 to make the code easier to debug). Such compatibility has never been guaranteed in history.

One option might be to revert #35243 from branch-3.2 since it is trivial up to my knowledge, V2 expressions are still unstable, and it virtually doesn't affect anything by default in Spark 3.2.1.

I'll leave this up to the maintainers to decide whether to revert, keep this change, or break binary compatibility. I'll add the library maintainer context here though (I maintain scalapb and sparksql-scalapb). We currently don't have a way to provide users the ability to use custom types with Datasets (such as sealed trait hierarchies). To remedy that, Spark provides Encoder and Decoder which I believe are public (?), however implementing them requires ExpressionEncoder which quickly takes you to use catalysts expressions to do anything useful (instantiating objects, querying them, etc). Spark currently doesn't provide a general solution in this space and apparently library maintainers (myself included) dipped in the internals, and end users depend on us for this.

Maintaining compatibility in the Spark/Scala ecosystem is really time consuming for maintainers - see this and this. The need for those versions came from users noticing problems, resulting in debugging by maintainers and so on. I'd like to ask to avoid/minimize binary breakages between maintenance releases. Breaking binary compatibility on feature releases makes it hard enough. Thank you!

I do sympathize with that. In order to address all such problems, expressions for API (V2 expressions) are under heavy development as a long run goal. I also agree that it's probably best to avoid the changes that unnecessarily break the compatibility of private/internal API, e.g., if that does not bring significant dev overhead.

For this PR, it would look awkward and confusing (see the comments in the code): if the developers should keep the binary compatibility in the expression at StaticInvoke and Invoke or all the expressions. In addition, we should keep adding overloaded constructors, which is not ideal for private/internal API.

Encoder and Decoder are indeed public but ExpressionEncoder is currently not (that is under internal catalyst package). We guarantee, with binary compatibility check, and maintain the binary compatibility and backward compatibility as documented for public API but not for internal API.

AFAIK this is always case by case. Yes, we don't expect people to rely on private classes such as Expression, but the fact is many Spark libraries are already using these private classes.

The ecosystem is very important to Spark and I think we should try our best to fix binary compatibility if it does break downstream libraries. I'm +1 to this PR.

Once we keep this compatibility, we will have to make such exceptions every time when downstream projects are broken for using our internal or private codes. If this is very significant, and a large user group is affected, maybe we should think about making it as an exception but note that this is an exception to the norm.

github-actions bot added the SQL label Feb 1, 2022

thesamet reviewed Feb 1, 2022

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-38077] Fix binary compatibility issue with isDeterministic flag~~ [SPARK-38077][SQL] Fix binary compatibility issue with isDeterministic flag Feb 1, 2022

sunchao and others added 2 commits February 1, 2022 10:32

switch to use new approach

3792ba0

add co-author

bd41089

sunchao force-pushed the SPARK-38077 branch from 19fd3c1 to bd41089 Compare February 1, 2022 18:34

viirya reviewed Feb 2, 2022

View reviewed changes

add more comments

5df2897

HyukjinKwon reviewed Feb 3, 2022

View reviewed changes

pomadchin mentioned this pull request Apr 7, 2022

Compatibility with Spark 3.2.1 typelevel/frameless#605

Closed

srowen closed this Apr 16, 2022

HyukjinKwon mentioned this pull request Jun 4, 2022

[SPARK-39259][SQL][FOLLOWUP] Fix source and binary incompatibilities in transformDownWithSubqueries #36765

Closed

[SPARK-38077][SQL] Fix binary compatibility issue with isDeterministic flag #35378

[SPARK-38077][SQL] Fix binary compatibility issue with isDeterministic flag #35378

Uh oh!

Conversation

sunchao commented Feb 1, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thesamet Feb 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 1, 2022

Uh oh!

viirya commented Feb 2, 2022

Uh oh!

viirya Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

thesamet Feb 1, 2022 •

edited

Loading

viirya Feb 2, 2022 •

edited

Loading

HyukjinKwon Feb 3, 2022 •

edited

Loading

HyukjinKwon Feb 4, 2022 •

edited

Loading

HyukjinKwon Feb 7, 2022 •

edited

Loading