-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3549] Removing dependency on "spark-avro" #4955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@hudi-bot run azure |
|
LGTM. lets test this patch w/ diff spark runtime versions (minor versions) to ensure we are good wrt diff runtime versions against hudi-spark3 bundles. And I assume, with this patch, we should also be good to rename our spark3 bundles from hudi-spark3.2.1-bundle to hudi-spark3.2-bundles as we discussed. |
|
@hudi-bot run azure |
|
@nsivabalan correct |
ec42271 to
088c5bb
Compare
| * @param context instance of {@link HoodieEngineContext} | ||
| * @param instantTime instant of the carried operation triggering the update | ||
| */ | ||
| public abstract void updateMetadataIndexes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may I know how is this change is related to this patch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So some of the tests that we have in "spark-client" module actually require SparkAdapter to be loaded, which lives in "hudi-spark" module, entailing that it couldn't be loaded there.
So i had to either move the tests to "hudi-spark" or remove this method (which uses AvroConversionUtil, in turn referencing SparkAdapter) which i'm removing regardless in another PR.
| import org.apache.spark.HoodieSparkTypeUtils.isCastPreservingOrdering | ||
| import org.apache.spark.sql.catalyst.expressions.{Add, AttributeReference, BitwiseOr, Cast, DateAdd, DateDiff, DateFormatClass, DateSub, Divide, Exp, Expm1, Expression, FromUTCTimestamp, FromUnixTime, Log, Log10, Log1p, Log2, Lower, Multiply, ParseToDate, ParseToTimestamp, ShiftLeft, ShiftRight, ToUTCTimestamp, ToUnixTimestamp, Upper} | ||
|
|
||
| object HoodieSpark3_1CatalystExpressionUtils extends HoodieCatalystExpressionUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any diff between this file and HoodieSpark3_2CatalystExpressionUtils ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: This PR is stacked on 4996, so this change is actually from there
@xushiyan as well. let's avoid renaming bundles . It does cause some busy work and thrashing for users, when they just want to pick up a new version. e.g if they had a Is the change to bundle names in this PR or a separate one? If so, can we just retain spark2, spark3, spark3.1 as bundle names? whats the plan |
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a first pass. Can you comment on any custom changes we have made in the different AvroSerializer classes?
| sparkProfile: "spark3" | ||
| sparkVersion: "3.2.0" | ||
|
|
||
| - scalaProfile: "scala-2.12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be okay with gh action minutes? @xushiyan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be okay with gh action minutes? @xushiyan
Ci is performed for about 5-6 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar gh actions only run mvn install atm; in #5082 we adding some basic testcases covering quickstart for different spark versions. CI limit-wise we're good.
README.md
Outdated
| The default hudi-jar bundles spark-avro module. To build without spark-avro module, build using `spark-shade-unbundle-avro` profile | ||
| Previously, Hudi bundles were packaging (and shading) "spark-avro" module internally. However, due to multiple occasion | ||
| of it being broken b/w patch versions (most recent was, b/w 3.2.0 and 3.2.1) of Spark after substantial deliberation | ||
| we took a decision to let go such dependency and instead simply clone the structures we're relying on to better control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can shorten this a bit and just make README have the actual steps to do here?
### What about "spark-avro" module?
Hudi versions 0.11 or later, no longer required `spark-avro` to be specified using `--packages`
| def createAvroDeserializer(rootAvroType: Schema, rootCatalystType: DataType): HoodieAvroDeserializer | ||
|
|
||
| /** | ||
| * TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs?
cf2994c to
5ac30f7
Compare
| * | ||
| * PLEASE REFRAIN MAKING ANY CHANGES TO THIS CODE UNLESS ABSOLUTELY NECESSARY | ||
| * | ||
| * NOTE: This is a version of [[AvroDeserializer]] impl from Spark 2.4.4 w/ the fix for SPARK-30267 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar this is the diff against Spark
|
|
||
| object AvroSerializer { | ||
|
|
||
| // NOTE: Following methods have been renamed in Spark 3.2.1 [1] making [[AvroSerializer]] implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar this is the diff against Spark
|
|
||
| object AvroDeserializer { | ||
|
|
||
| // NOTE: Following methods have been renamed in Spark 3.2.1 [1] making [[AvroDeserializer]] implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar this is the diff against Spark
93afdff to
d0ac9f3
Compare
.github/workflows/bot.yml
Outdated
|
|
||
| # Spark 3.2.x | ||
| - scalaProfile: "scala-2.12" | ||
| sparkProfile: "spark3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark3 -> spark3.2.0
| sparkProfile: "spark3" | ||
| sparkVersion: "3.2.0" | ||
|
|
||
| - scalaProfile: "scala-2.12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be okay with gh action minutes? @xushiyan
Ci is performed for about 5-6 minutes.
d0ac9f3 to
46d6a25
Compare
565ef8a to
02c1f87
Compare
|
@hudi-bot run azure |
02c1f87 to
4a2280b
Compare
|
@hudi-bot run azure |
1 similar comment
|
@hudi-bot run azure |
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Good job on the patch!
Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc) To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR). Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches. Following Hudi modules to Spark minor branches is currently maintained: "hudi-spark3" -> 3.2.x "hudi-spark3.1.x" -> 3.1.x "hudi-spark2" -> 2.4.x Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches: AvroSerializer AvroDeserializer AvroUtils Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules. SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1). All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them. Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI. Brief change log: - Removing spark-avro bundling from Hudi by default - Scaffolded Spark 3.2.x hierarchy - Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy - Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy - Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module - Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0 - Modified bot.yml to build full matrix of support Spark versions - Removed "spark-avro" dependency from all modules - Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.
Tips
What is the purpose of the pull request
After some back-and-forth in our discussions regarding "spark-avro", we've finally settled on the following approach:
Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in
that case 3.2.1, 3.2.0, etc)
To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR).
Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches.
Following Hudi modules to Spark minor branches is currently maintained:
Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches:
AvroSerializerAvroDeserializerAvroUtilsEach of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules.
SchemaConvertersclass in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1).All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (
org.apache.spark.sql.avro,org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them.Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI.
Brief change log
Verify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.