-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-38077][SQL] Fix binary compatibility issue with isDeterministic flag #35378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that having a var here breaks the convention that case classes are immutable, and also breaks copy and equality checks which would ignore this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm not sure whether there's a better way. I tried to make the parameter non-default but then it seems will break the API compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be possible using secondary constructor in the companion object. Let me take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to keep isDeterministic: boolean = true in the constructor, and add def this and an apply in a companion object:
case class StaticInvoke(
staticObject: Class[_],
dataType: DataType,
functionName: String,
arguments: Seq[Expression] = Nil,
inputTypes: Seq[AbstractDataType] = Nil,
propagateNull: Boolean = true,
returnNullable: Boolean = true,
isDeterministic: Boolean = true) {
def this(
staticObject: Class[_],
dataType: DataType,
functionName: String,
arguments: Seq[Expression],
inputTypes: Seq[AbstractDataType],
propagateNull: Boolean,
returnNullable: Boolean
) = {
this(staticObject, dataType, functionName, arguments, inputTypes, propagateNull, returnNullable, true)
}
}
object StaticInvoke {
def apply(
staticObject: Class[_],
dataType: DataType,
functionName: String,
arguments: Seq[Expression],
inputTypes: Seq[AbstractDataType],
propagateNull: Boolean,
returnNullable: Boolean
): StaticInvoke = StaticInvoke(staticObject, dataType, functionName, arguments, inputTypes, propagateNull, returnNullable, true)
}
The two methods added without default parameters represent what existing compiled libraries are linked with. I tested this suggestion in a test project where I added empty traits for DataType, Expression, etc. The above suggestion takes care of the constructor compatibility, but leaves a few unavoidable binary compatibility issues that should be inconsequential (pertaining to tupled, curried and copy method):
[error] compat: Failed binary compatibility check against default:compat_2.13:0.1.0! Found 4 potential problems (filtered 2)
[error] * static method tupled()scala.Function1 in class x.StaticInvoke does not have a correspondent in current version
[error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("x.StaticInvoke.tupled")
[error] * static method curried()scala.Function1 in class x.StaticInvoke does not have a correspondent in current version
[error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("x.StaticInvoke.curried")
[error] * method copy(java.lang.Class,x.DataType,java.lang.String,scala.collection.immutable.Seq,scala.collection.immutable.Seq,Boolean,Boolean)x.StaticInvoke in class x.StaticInvoke does not have a correspondent in current version
[error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("x.StaticInvoke.copy")
[error] * the type hierarchy of object x.StaticInvoke is different in current version. Missing types {scala.runtime.AbstractFunction7}
[error] filter with: ProblemFilters.exclude[MissingTypesProblem]("x.StaticInvoke$")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @thesamet 's suggestion
We can also consider cleaning up the legacy/compatibility versions of this/apply in the next minor release and keeping them around only for patch releases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense to add a comment to have apply and this removed for Spark 3.3.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah cool, this is neat. Let me switch to use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm what's the benefit of removing these before the Spark 3.3.0 release? won't that cause the compatibility issue again? say frameless wants to work with both Spark 3.2.x and 3.3.x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent of the suggestion was to do this on or after the 3.3.0 release. It's ok to defer, but binary incompatibilities between 3.2.x and 3.3.0 are more expected. This is just to balance how long you want to carry the noise in the code caused by the additional constructors.
|
cc @huaxingao |
|
cc @cloud-fan |
| returnNullable: Boolean = true, | ||
| isDeterministic: Boolean = true) extends InvokeLike { | ||
|
|
||
| def this( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here explaining why we need this? Without any context, this looks a bit redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure added comments.
| } | ||
|
|
||
| object StaticInvoke { | ||
| def apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we have second constructor, do we need apply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The this constructor ends up being a static method on StaticInvoke, the apply as a method on StaticInvoke$.class which serves as the companion object of StaticInvoke. They are both needed for binary compatibility. The Scala compile generates both of them for the default constructor.
| returnNullable: Boolean = true, | ||
| isDeterministic: Boolean = true) extends InvokeLike { | ||
|
|
||
| // This additional constructor is added to keep binary compatibility after the addition of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, technically all expressions under catalyst are private, and we don't maintain binary compatibility here. For the same reason, we don't run MiMa too. I believe the downstream projects can work around by reflection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also made this kind of argument change at 3.2.0 too (7d8181b) without keeping binary compatibility. I would go -1 for this change - it makes less sense to keep binary compatibility for this argument specifically in the private package which we documented and we intentionally skip binary compatibility check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning says "between minor releases" ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an internal API, and I think it makes less sense to make some changes to keep the binary compatibility here. We should probably mention maintenance release too - note that they were all explicitly private[sql] before (which we removed at SPARK-16813 to make the code easier to debug). Such compatibility has never been guaranteed in history.
One option might be to revert #35243 from branch-3.2 since it is trivial up to my knowledge, V2 expressions are still unstable, and it virtually doesn't affect anything by default in Spark 3.2.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave this up to the maintainers to decide whether to revert, keep this change, or break binary compatibility. I'll add the library maintainer context here though (I maintain scalapb and sparksql-scalapb). We currently don't have a way to provide users the ability to use custom types with Datasets (such as sealed trait hierarchies). To remedy that, Spark provides Encoder and Decoder which I believe are public (?), however implementing them requires ExpressionEncoder which quickly takes you to use catalysts expressions to do anything useful (instantiating objects, querying them, etc). Spark currently doesn't provide a general solution in this space and apparently library maintainers (myself included) dipped in the internals, and end users depend on us for this.
Maintaining compatibility in the Spark/Scala ecosystem is really time consuming for maintainers - see this and this. The need for those versions came from users noticing problems, resulting in debugging by maintainers and so on. I'd like to ask to avoid/minimize binary breakages between maintenance releases. Breaking binary compatibility on feature releases makes it hard enough. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do sympathize with that. In order to address all such problems, expressions for API (V2 expressions) are under heavy development as a long run goal. I also agree that it's probably best to avoid the changes that unnecessarily break the compatibility of private/internal API, e.g., if that does not bring significant dev overhead.
For this PR, it would look awkward and confusing (see the comments in the code): if the developers should keep the binary compatibility in the expression at StaticInvoke and Invoke or all the expressions. In addition, we should keep adding overloaded constructors, which is not ideal for private/internal API.
Encoder and Decoder are indeed public but ExpressionEncoder is currently not (that is under internal catalyst package). We guarantee, with binary compatibility check, and maintain the binary compatibility and backward compatibility as documented for public API but not for internal API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK this is always case by case. Yes, we don't expect people to rely on private classes such as Expression, but the fact is many Spark libraries are already using these private classes.
The ecosystem is very important to Spark and I think we should try our best to fix binary compatibility if it does break downstream libraries. I'm +1 to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we keep this compatibility, we will have to make such exceptions every time when downstream projects are broken for using our internal or private codes. If this is very significant, and a large user group is affected, maybe we should think about making it as an exception but note that this is an exception to the norm.
What changes were proposed in this pull request?
This fixes a binary compatibility issue caused by SPARK-37957 with the introduction of the additional
isDeterministicwhich defaults to true.Why are the changes needed?
Adding method parameters with default value will break binary compatibility (see here). Even though Spark doesn't strictly guarantee it, it is still better to avoid. In this case, the compatibility of frameless is broken when it wants to work with multiple Spark versions (e.g., 3.2.0 and 3.2.1).
Does this PR introduce any user-facing change?
Now it requires users to call
setDeterministicafter initializingInvokeandStaticInvokeif they want to mark the methods as non-deterministic.How was this patch tested?
Existing tests.