Skip to content

Conversation

@sunchao
Copy link
Member

@sunchao sunchao commented Feb 1, 2022

What changes were proposed in this pull request?

This fixes a binary compatibility issue caused by SPARK-37957 with the introduction of the additional isDeterministic which defaults to true.

Why are the changes needed?

Adding method parameters with default value will break binary compatibility (see here). Even though Spark doesn't strictly guarantee it, it is still better to avoid. In this case, the compatibility of frameless is broken when it wants to work with multiple Spark versions (e.g., 3.2.0 and 3.2.1).

Does this PR introduce any user-facing change?

Now it requires users to call setDeterministic after initializing Invoke and StaticInvoke if they want to mark the methods as non-deterministic.

How was this patch tested?

Existing tests.

@github-actions github-actions bot added the SQL label Feb 1, 2022
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that having a var here breaks the convention that case classes are immutable, and also breaks copy and equality checks which would ignore this value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not sure whether there's a better way. I tried to make the parameter non-default but then it seems will break the API compatibility.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible using secondary constructor in the companion object. Let me take a look.

Copy link

@thesamet thesamet Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to keep isDeterministic: boolean = true in the constructor, and add def this and an apply in a companion object:

case class StaticInvoke(
  staticObject: Class[_],
  dataType: DataType,
  functionName: String,
  arguments: Seq[Expression] = Nil,
  inputTypes: Seq[AbstractDataType] = Nil,
  propagateNull: Boolean = true,
  returnNullable: Boolean = true,
  isDeterministic: Boolean = true) {

  def this(
	staticObject: Class[_],
	dataType: DataType,
	functionName: String,
	arguments: Seq[Expression],
	inputTypes: Seq[AbstractDataType],
	propagateNull: Boolean,
	returnNullable: Boolean
  ) = {
  this(staticObject, dataType, functionName, arguments, inputTypes, propagateNull, returnNullable, true)
  }

}

object StaticInvoke {
  def apply(
	staticObject: Class[_],
	dataType: DataType,
	functionName: String,
	arguments: Seq[Expression],
	inputTypes: Seq[AbstractDataType],
	propagateNull: Boolean,
	returnNullable: Boolean
  ): StaticInvoke = StaticInvoke(staticObject, dataType, functionName, arguments, inputTypes, propagateNull, returnNullable, true)
}

The two methods added without default parameters represent what existing compiled libraries are linked with. I tested this suggestion in a test project where I added empty traits for DataType, Expression, etc. The above suggestion takes care of the constructor compatibility, but leaves a few unavoidable binary compatibility issues that should be inconsequential (pertaining to tupled, curried and copy method):

[error] compat: Failed binary compatibility check against default:compat_2.13:0.1.0! Found 4 potential problems (filtered 2)
[error]  * static method tupled()scala.Function1 in class x.StaticInvoke does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("x.StaticInvoke.tupled")
[error]  * static method curried()scala.Function1 in class x.StaticInvoke does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("x.StaticInvoke.curried")
[error]  * method copy(java.lang.Class,x.DataType,java.lang.String,scala.collection.immutable.Seq,scala.collection.immutable.Seq,Boolean,Boolean)x.StaticInvoke in class x.StaticInvoke does not have a correspondent in current version
[error]    filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("x.StaticInvoke.copy")
[error]  * the type hierarchy of object x.StaticInvoke is different in current version. Missing types {scala.runtime.AbstractFunction7}
[error]    filter with: ProblemFilters.exclude[MissingTypesProblem]("x.StaticInvoke$")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @thesamet 's suggestion

We can also consider cleaning up the legacy/compatibility versions of this/apply in the next minor release and keeping them around only for patch releases?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to add a comment to have apply and this removed for Spark 3.3.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah cool, this is neat. Let me switch to use this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what's the benefit of removing these before the Spark 3.3.0 release? won't that cause the compatibility issue again? say frameless wants to work with both Spark 3.2.x and 3.3.x

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent of the suggestion was to do this on or after the 3.3.0 release. It's ok to defer, but binary incompatibilities between 3.2.x and 3.3.0 are more expected. This is just to balance how long you want to carry the noise in the code caused by the additional constructors.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-38077] Fix binary compatibility issue with isDeterministic flag [SPARK-38077][SQL] Fix binary compatibility issue with isDeterministic flag Feb 1, 2022
@dongjoon-hyun
Copy link
Member

cc @huaxingao

@viirya
Copy link
Member

viirya commented Feb 2, 2022

cc @cloud-fan

returnNullable: Boolean = true,
isDeterministic: Boolean = true) extends InvokeLike {

def this(
Copy link
Member

@viirya viirya Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here explaining why we need this? Without any context, this looks a bit redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure added comments.

}

object StaticInvoke {
def apply(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have second constructor, do we need apply?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The this constructor ends up being a static method on StaticInvoke, the apply as a method on StaticInvoke$.class which serves as the companion object of StaticInvoke. They are both needed for binary compatibility. The Scala compile generates both of them for the default constructor.

returnNullable: Boolean = true,
isDeterministic: Boolean = true) extends InvokeLike {

// This additional constructor is added to keep binary compatibility after the addition of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, technically all expressions under catalyst are private, and we don't maintain binary compatibility here. For the same reason, we don't run MiMa too. I believe the downstream projects can work around by reflection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also made this kind of argument change at 3.2.0 too (7d8181b) without keeping binary compatibility. I would go -1 for this change - it makes less sense to keep binary compatibility for this argument specifically in the private package which we documented and we intentionally skip binary compatibility check.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning says "between minor releases" ;)

Copy link
Member

@HyukjinKwon HyukjinKwon Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal API, and I think it makes less sense to make some changes to keep the binary compatibility here. We should probably mention maintenance release too - note that they were all explicitly private[sql] before (which we removed at SPARK-16813 to make the code easier to debug). Such compatibility has never been guaranteed in history.

One option might be to revert #35243 from branch-3.2 since it is trivial up to my knowledge, V2 expressions are still unstable, and it virtually doesn't affect anything by default in Spark 3.2.1.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this up to the maintainers to decide whether to revert, keep this change, or break binary compatibility. I'll add the library maintainer context here though (I maintain scalapb and sparksql-scalapb). We currently don't have a way to provide users the ability to use custom types with Datasets (such as sealed trait hierarchies). To remedy that, Spark provides Encoder and Decoder which I believe are public (?), however implementing them requires ExpressionEncoder which quickly takes you to use catalysts expressions to do anything useful (instantiating objects, querying them, etc). Spark currently doesn't provide a general solution in this space and apparently library maintainers (myself included) dipped in the internals, and end users depend on us for this.

Maintaining compatibility in the Spark/Scala ecosystem is really time consuming for maintainers - see this and this. The need for those versions came from users noticing problems, resulting in debugging by maintainers and so on. I'd like to ask to avoid/minimize binary breakages between maintenance releases. Breaking binary compatibility on feature releases makes it hard enough. Thank you!

Copy link
Member

@HyukjinKwon HyukjinKwon Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do sympathize with that. In order to address all such problems, expressions for API (V2 expressions) are under heavy development as a long run goal. I also agree that it's probably best to avoid the changes that unnecessarily break the compatibility of private/internal API, e.g., if that does not bring significant dev overhead.

For this PR, it would look awkward and confusing (see the comments in the code): if the developers should keep the binary compatibility in the expression at StaticInvoke and Invoke or all the expressions. In addition, we should keep adding overloaded constructors, which is not ideal for private/internal API.

Encoder and Decoder are indeed public but ExpressionEncoder is currently not (that is under internal catalyst package). We guarantee, with binary compatibility check, and maintain the binary compatibility and backward compatibility as documented for public API but not for internal API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this is always case by case. Yes, we don't expect people to rely on private classes such as Expression, but the fact is many Spark libraries are already using these private classes.

The ecosystem is very important to Spark and I think we should try our best to fix binary compatibility if it does break downstream libraries. I'm +1 to this PR.

Copy link
Member

@HyukjinKwon HyukjinKwon Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we keep this compatibility, we will have to make such exceptions every time when downstream projects are broken for using our internal or private codes. If this is very significant, and a large user group is affected, maybe we should think about making it as an exception but note that this is an exception to the norm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants