Skip to content

Conversation

@drewrobb
Copy link
Contributor

What changes were proposed in this pull request?

This change fixes a particular scenario where default spark SQL can't encode (thrift) types that are generated by twitter scrooge. These types are a trait that extends scala.ProductX with a constructor defined only in a companion object, rather than a actual case class. The actual case class used is child class, but that type is almost never referred to in code. The type has no corresponding constructor symbol and causes an exception. For all other purposes, these classes act just like case classes, so it is unfortunate that spark SQL can't serialize them nicely as it can actual case classes. For an full example of a scrooge codegen class, see https://gist.github.com/anonymous/ba13d4b612396ca72725eaa989900314.

This change catches the case where the type has no constructor but does have an apply method on the type's companion object. This allows for thrift types to be serialized/deserialized with implicit encoders the same way as normal case classes. This fix had to be done in three places where the constructor is assumed to be an actual constructor:

  1. In serializing, determining the schema for the dataframe relies on inspecting its constructor (ScalaReflection.constructParams). Here we fall back to using the companion constructor arguments.
  2. In deserializing or evaluating, in the java codegen ( NewInstance.doGenCode), the type couldn't be constructed with the new keyword. If there is no constructor, we change the constructor call to try the companion constructor.
  3. In deserializing or evaluating, without codegen, the constructor is directly invoked (NewInstance.constructor). This was fixed with scala reflection to get the actual companion apply method.

The return type of findConstructor was changed because the companion apply method constructor can't be represented as a java.lang.reflect.Constructor.

There might be situations in which this approach would also fail in a new way, but it does at a minimum work for the specific scrooge example and will not impact cases that were already succeeding prior to this change

Note: this fix does not enable using scrooge thrift enums, additional work for this is necessary. With this patch, it seems like you could patch com.twitter.scrooge.ThriftEnum to extend _root_.scala.Product1[Int] with def _1 = value to get spark's implicit encoders to handle enums, but I've yet to use this method myself.

Note: I previously opened a PR for this issue, but only was able to fix case 1) there: #18766

How was this patch tested?

I've fixed all 3 cases and added two tests that use a case class that is similar to scrooge generated one. The test in ScalaReflectionSuite checks 1), and the additional asserting in ObjectExpressionsSuite checks 2) and 3).

@SparkQA
Copy link

SparkQA commented Nov 17, 2018

Test build #98940 has finished for PR 23062 at commit e27f933.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably OK. My concern is whether this still works in Scala 2.11 and 2.12, and some style issues

@drewrobb
Copy link
Contributor Author

Thanks much for the prompt feedback. PTAL at changes. I had to modify the ScroogeLikeExample class to make the additional DatasetSuite test work-- my prior example had removed methods from the scrooge code that were actually necessary. Kept as separate commits for review, but can squash later?

RE Scala 2.12 are there any specific things you are worried about I could look at? I'm not super familiar with changes in 2.12, and I think 2.11 should be fine, as this was developed using 2.11. And I assume the QA build is testing on 2.12 also?

Also this might not be the place to ask but it is rather ungreppable-- have you encountered compilation generated java files in ./org/ (that directory is also not in .gitignore!) breaking compilation in sbt? In my development environment I'll frequently have to delete that entire directory for compilation to work again.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have specific concern about Scala 2.11 or 2.12, just know that this is the kind of thing that can break across versions. The PR builder will use 2.12 and we do have other tests for 2.11.

@SparkQA
Copy link

SparkQA commented Nov 20, 2018

Test build #99033 has finished for PR 23062 at commit 0074693.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

override def canEqual(other: Any): Boolean = other.isInstanceOf[ScroogeLikeExample]

private def _equals(x: ScroogeLikeExample, y: ScroogeLikeExample): Boolean =
x.productArity == y.productArity &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but you've already established it's a ScroogeLikeExample here. Why must it be Product1 just to check whether it's also Product1? seems like it's not adding anything. In fact, why not just compare the one element that this trait knows about? to the extent it can implement equals() meaningfully, that's all it is doing already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous answer was not complete. Product1 is also necessary so that the implicit Encoders.product[T <: Product : TypeTag] will work with this class, if omitted the DatasetSuite test will not compile:

[error] /home/drew/spark/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:1577: value toDS is not a member of Seq[org.apache.spark.sql.catalyst.ScroogeLikeExample]
[error]     val ds = data.toDS

I could add some new encoder, but I think that might be worse as the goal of this PR is for Scrooge classes to work with the provided implicit encoders.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use SparkSession.createDataset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, actually that probably won't work any more or less. OK, it's because there is an Encoder for Product. You can still simplify the equals() and so on I think, but looks like that's easier than a new Encoder. Or is it sufficient to test a Seq of a concrete subtype of ScroogeLikeExample?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about changing the tests to use a concrete subtype, because the reflection calls might behave differently in that case either now or later on. I simplified a little more. canEqual is necessary to implement product. equals is necessary or tests will not pass (it will check object pointer equality), and hashCode is needed for scalastyle to pass since equals is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's OK, leave in Product, if it's actually testing the case you have in mind. Yes I know equals() is needed. The new implementation looks good.

@SparkQA
Copy link

SparkQA commented Nov 20, 2018

Test build #99039 has finished for PR 23062 at commit 83a1987.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 20, 2018

Test build #99029 has finished for PR 23062 at commit 39979e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 20, 2018

Test build #99031 has finished for PR 23062 at commit 4cba3cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 20, 2018

Test build #99043 has finished for PR 23062 at commit 8c25fd0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 20, 2018

Test build #99044 has finished for PR 23062 at commit 04a34c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Option(ConstructorUtils.getMatchingAccessibleConstructor(cls, paramTypes: _*))
def findConstructor[T](cls: Class[T], paramTypes: Seq[Class[_]]): Option[Seq[AnyRef] => T] = {
Option(ConstructorUtils.getMatchingAccessibleConstructor(cls, paramTypes: _*)) match {
case Some(c) => Some(x => c.newInstance(x: _*).asInstanceOf[T])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this cast isn't needed because Class[T].newInstance returns T already, but it's fine to leave.

@srowen
Copy link
Member

srowen commented Nov 21, 2018

Merged to master

@asfgit asfgit closed this in 6bbdf34 Nov 21, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This change fixes a particular scenario where default spark SQL can't encode (thrift) types that are generated by twitter scrooge. These types are a trait that extends `scala.ProductX` with a constructor defined only in a companion object, rather than a actual case class. The actual case class used is child class, but that type is almost never referred to in code. The type has no corresponding constructor symbol and causes an exception. For all other purposes, these classes act just like case classes, so it is unfortunate that spark SQL can't serialize them nicely as it can actual case classes. For an full example of a scrooge codegen class, see https://gist.github.com/anonymous/ba13d4b612396ca72725eaa989900314.

This change catches the case where the type has no constructor but does have an `apply` method on the type's companion object. This allows for thrift types to be serialized/deserialized with implicit encoders the same way as normal case classes. This fix had to be done in three places where the constructor is assumed to be an actual constructor:

1) In serializing, determining the schema for the dataframe relies on inspecting its constructor (`ScalaReflection.constructParams`). Here we fall back to using the companion constructor arguments.
2) In deserializing or evaluating, in the java codegen ( `NewInstance.doGenCode`), the type couldn't be constructed with the new keyword. If there is no constructor, we change the constructor call to try the companion constructor.
3)  In deserializing or evaluating, without codegen, the constructor is directly invoked (`NewInstance.constructor`). This was fixed with scala reflection to get the actual companion apply method.

The return type of `findConstructor` was changed because the companion apply method constructor can't be represented as a `java.lang.reflect.Constructor`.

There might be situations in which this approach would also fail in a new way, but it does at a minimum work for the specific scrooge example and will not impact cases that were already succeeding prior to this change

Note: this fix does not enable using scrooge thrift enums, additional work for this is necessary. With this patch, it seems like you could patch `com.twitter.scrooge.ThriftEnum` to extend `_root_.scala.Product1[Int]` with `def _1 = value` to get spark's implicit encoders to handle enums, but I've yet to use this method myself.

Note: I previously opened a PR for this issue, but only was able to fix case 1) there: apache#18766

## How was this patch tested?

I've fixed all 3 cases and added two tests that use a case class that is similar to scrooge generated one. The test in ScalaReflectionSuite checks 1), and the additional asserting in ObjectExpressionsSuite checks 2) and 3).

Closes apache#23062 from drewrobb/SPARK-8288.

Authored-by: Drew Robb <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Sep 22, 2022
### What changes were proposed in this pull request?
Fixes encoding of classes that uses companion object constructors in the interpreted path. Without this change the that is added in this change would fail with
```
...
  Cause: java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: Couldn't find a valid constructor on interface org.apache.spark.sql.catalyst.ScroogeLikeExample
newInstance(interface org.apache.spark.sql.catalyst.ScroogeLikeExample)
  at org.apache.spark.sql.errors.QueryExecutionErrors$.expressionDecodingError(QueryExecutionErrors.scala:1199)
...
```

As far as I can tell this bug has existed since the initial implementation in SPARK-8288 #23062

The existing spec that tested this part of the code incorrectly provided an outerPointer which hid the bug from that test.

### Why are the changes needed?
Fixes a bug, the new spec in the ExpressionsEncoderSuite shows that this is in fact a bug.

### Does this PR introduce _any_ user-facing change?
Yes, it fixes a bug.

### How was this patch tested?
New and existing specs in ExpressionEncoderSuite and ObjectExpressionsSuite.

Closes #37837 from eejbyfeldt/spark-40385.

Authored-by: Emil Ejbyfeldt <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Sep 22, 2022
### What changes were proposed in this pull request?
Fixes encoding of classes that uses companion object constructors in the interpreted path. Without this change the that is added in this change would fail with
```
...
  Cause: java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: Couldn't find a valid constructor on interface org.apache.spark.sql.catalyst.ScroogeLikeExample
newInstance(interface org.apache.spark.sql.catalyst.ScroogeLikeExample)
  at org.apache.spark.sql.errors.QueryExecutionErrors$.expressionDecodingError(QueryExecutionErrors.scala:1199)
...
```

As far as I can tell this bug has existed since the initial implementation in SPARK-8288 #23062

The existing spec that tested this part of the code incorrectly provided an outerPointer which hid the bug from that test.

### Why are the changes needed?
Fixes a bug, the new spec in the ExpressionsEncoderSuite shows that this is in fact a bug.

### Does this PR introduce _any_ user-facing change?
Yes, it fixes a bug.

### How was this patch tested?
New and existing specs in ExpressionEncoderSuite and ObjectExpressionsSuite.

Closes #37837 from eejbyfeldt/spark-40385.

Authored-by: Emil Ejbyfeldt <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 73e3c36)
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants