[HUDI-3549] Removing dependency on "spark-avro" #4955

alexeykudinkin · 2022-03-05T02:48:11Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

After some back-and-forth in our discussions regarding "spark-avro", we've finally settled on the following approach:

Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in
that case 3.2.1, 3.2.0, etc)

To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR).

Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches.

Following Hudi modules to Spark minor branches is currently maintained:

"hudi-spark3" -> 3.2.x
"hudi-spark3.1.x" -> 3.1.x
"hudi-spark2" -> 2.4.x

Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches:

AvroSerializer
AvroDeserializer
AvroUtils

Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules.

SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1).

All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them.

Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI.

Brief change log

Removing spark-avro bundling from Hudi by default
Scaffolded Spark 3.2.x hierarchy
Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy
Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy
Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module
Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0
Modified bot.yml to build full matrix of support Spark versions
Removed "spark-avro" dependency from all modules

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

alexeykudinkin · 2022-03-05T21:50:31Z

@hudi-bot run azure

nsivabalan · 2022-03-06T20:43:08Z

LGTM. lets test this patch w/ diff spark runtime versions (minor versions) to ensure we are good wrt diff runtime versions against hudi-spark3 bundles.

And I assume, with this patch, we should also be good to rename our spark3 bundles from hudi-spark3.2.1-bundle to hudi-spark3.2-bundles as we discussed.

alexeykudinkin · 2022-03-07T21:15:37Z

@hudi-bot run azure

alexeykudinkin · 2022-03-08T01:40:04Z

@nsivabalan correct

nsivabalan · 2022-03-19T02:11:09Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java

-   * @param context instance of {@link HoodieEngineContext}
-   * @param instantTime instant of the carried operation triggering the update
-   */
-  public abstract void updateMetadataIndexes(


may I know how is this change is related to this patch ?

So some of the tests that we have in "spark-client" module actually require SparkAdapter to be loaded, which lives in "hudi-spark" module, entailing that it couldn't be loaded there.

So i had to either move the tests to "hudi-spark" or remove this method (which uses AvroConversionUtil, in turn referencing SparkAdapter) which i'm removing regardless in another PR.

nsivabalan · 2022-03-19T02:29:14Z

...i-spark3.1.x/src/main/scala/org/apache/spark/sql/HoodieSpark3_1CatalystExpressionUtils.scala

+import org.apache.spark.HoodieSparkTypeUtils.isCastPreservingOrdering
+import org.apache.spark.sql.catalyst.expressions.{Add, AttributeReference, BitwiseOr, Cast, DateAdd, DateDiff, DateFormatClass, DateSub, Divide, Exp, Expm1, Expression, FromUTCTimestamp, FromUnixTime, Log, Log10, Log1p, Log2, Lower, Multiply, ParseToDate, ParseToTimestamp, ShiftLeft, ShiftRight, ToUTCTimestamp, ToUnixTimestamp, Upper}
+
+object HoodieSpark3_1CatalystExpressionUtils extends HoodieCatalystExpressionUtils {


Is there any diff between this file and HoodieSpark3_2CatalystExpressionUtils ?

FYI: This PR is stacked on 4996, so this change is actually from there

vinothchandar · 2022-03-21T15:34:08Z

we should also be good to rename our spark3 bundles from hudi-spark3.2.1-bundle to hudi-spark3.2-bundles as we discussed.

@xushiyan as well. let's avoid renaming bundles . It does cause some busy work and thrashing for users, when they just want to pick up a new version. e.g if they had a HUDI_VERSION in their build/deploy scripts, now they need to all adjust per new naming.

Is the change to bundle names in this PR or a separate one? If so, can we just retain spark2, spark3, spark3.1 as bundle names? whats the plan

vinothchandar

Made a first pass. Can you comment on any custom changes we have made in the different AvroSerializer classes?

vinothchandar · 2022-03-21T21:35:08Z

.github/workflows/bot.yml

+            sparkProfile: "spark3"
+            sparkVersion: "3.2.0"
+
+          - scalaProfile: "scala-2.12"


Would this be okay with gh action minutes? @xushiyan

Would this be okay with gh action minutes? @xushiyan

Ci is performed for about 5-6 minutes.

@vinothchandar gh actions only run mvn install atm; in #5082 we adding some basic testcases covering quickstart for different spark versions. CI limit-wise we're good.

vinothchandar · 2022-03-21T21:37:36Z

README.md

-The default hudi-jar bundles spark-avro module. To build without spark-avro module, build using `spark-shade-unbundle-avro` profile
+Previously, Hudi bundles were packaging (and shading) "spark-avro" module internally. However, due to multiple occasion 
+of it being broken b/w patch versions (most recent was, b/w 3.2.0 and 3.2.1) of Spark after substantial deliberation 
+we took a decision to let go such dependency and instead simply clone the structures we're relying on to better control 


we can shorten this a bit and just make README have the actual steps to do here?

### What about "spark-avro" module? Hudi versions 0.11 or later, no longer required `spark-avro` to be specified using `--packages`

vinothchandar · 2022-03-21T21:40:05Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala

  def createAvroDeserializer(rootAvroType: Schema, rootCatalystType: DataType): HoodieAvroDeserializer

+  /**
+   * TODO


alexeykudinkin · 2022-03-21T22:35:56Z

...spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala

+ *
+ *       PLEASE REFRAIN MAKING ANY CHANGES TO THIS CODE UNLESS ABSOLUTELY NECESSARY
+ *
+ * NOTE: This is a version of [[AvroDeserializer]] impl from Spark 2.4.4 w/ the fix for SPARK-30267


@vinothchandar this is the diff against Spark

alexeykudinkin · 2022-03-21T22:36:23Z

hudi-spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

+
+object AvroSerializer {
+
+  // NOTE: Following methods have been renamed in Spark 3.2.1 [1] making [[AvroSerializer]] implementation


@vinothchandar this is the diff against Spark

alexeykudinkin · 2022-03-21T22:36:55Z

...spark-datasource/hudi-spark3/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala

+
+object AvroDeserializer {
+
+  // NOTE: Following methods have been renamed in Spark 3.2.1 [1] making [[AvroDeserializer]] implementation


@vinothchandar this is the diff against Spark

XuQianJin-Stars · 2022-03-22T15:34:36Z

.github/workflows/bot.yml

+
+          # Spark 3.2.x
+          - scalaProfile: "scala-2.12"
+            sparkProfile: "spark3"


spark3 -> spark3.2.0

XuQianJin-Stars · 2022-03-22T15:36:27Z

.github/workflows/bot.yml

+            sparkProfile: "spark3"
+            sparkVersion: "3.2.0"
+
+          - scalaProfile: "scala-2.12"


Would this be okay with gh action minutes? @xushiyan

Ci is performed for about 5-6 minutes.

alexeykudinkin · 2022-03-26T01:00:29Z

@hudi-bot run azure

….2.0

… "spark-avro"

alexeykudinkin · 2022-03-28T23:44:35Z

@hudi-bot run azure

alexeykudinkin · 2022-03-29T01:23:36Z

@hudi-bot run azure

hudi-bot · 2022-03-29T02:44:38Z

CI report:

6dbbd9e UNKNOWN
b5ce651 UNKNOWN
192fe7f Azure: FAILURE Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan

LGTM. Good job on the patch!

Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc) To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR). Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches. Following Hudi modules to Spark minor branches is currently maintained: "hudi-spark3" -> 3.2.x "hudi-spark3.1.x" -> 3.1.x "hudi-spark2" -> 2.4.x Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches: AvroSerializer AvroDeserializer AvroUtils Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules. SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1). All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them. Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI. Brief change log: - Removing spark-avro bundling from Hudi by default - Scaffolded Spark 3.2.x hierarchy - Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy - Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy - Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module - Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0 - Modified bot.yml to build full matrix of support Spark versions - Removed "spark-avro" dependency from all modules - Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.

nsivabalan assigned nsivabalan and bhasudha Mar 7, 2022

nsivabalan added the priority:critical Production degraded; pipelines stalled label Mar 7, 2022

nsivabalan assigned vinothchandar Mar 7, 2022

nsivabalan changed the title ~~[HUDI-3549] Removing "spark-avro" bundling from Hudi by default~~ [HUDI-3549][WIP][DO_NOT_MERGE] Removing "spark-avro" bundling from Hudi by default Mar 8, 2022

nsivabalan added the status:in-progress Work in progress label Mar 8, 2022

alexeykudinkin force-pushed the ak/spk-avr-shd-fix branch from ec42271 to 088c5bb Compare March 17, 2022 00:53

alexeykudinkin changed the title ~~[HUDI-3549][WIP][DO_NOT_MERGE] Removing "spark-avro" bundling from Hudi by default~~ [HUDI-3549][Stacked on 4996][DO_NOT_MERGE] Removing "spark-avro" bundling from Hudi by default Mar 17, 2022

alexeykudinkin changed the title ~~[HUDI-3549][Stacked on 4996][DO_NOT_MERGE] Removing "spark-avro" bundling from Hudi by default~~ [HUDI-3549][Stacked on 4996] Removing "spark-avro" bundling from Hudi by default Mar 18, 2022

alexeykudinkin changed the title ~~[HUDI-3549][Stacked on 4996] Removing "spark-avro" bundling from Hudi by default~~ [HUDI-3549][Stacked on 4996] Removing dependency on "spark-avro" Mar 18, 2022

nsivabalan reviewed Mar 19, 2022

View reviewed changes

vinothchandar reviewed Mar 21, 2022

View reviewed changes

alexeykudinkin force-pushed the ak/spk-avr-shd-fix branch from cf2994c to 5ac30f7 Compare March 21, 2022 22:13

alexeykudinkin commented Mar 21, 2022

View reviewed changes

xushiyan added priority:blocker Production down; release blocker and removed priority:critical Production degraded; pipelines stalled labels Mar 22, 2022

alexeykudinkin force-pushed the ak/spk-avr-shd-fix branch from 93afdff to d0ac9f3 Compare March 22, 2022 04:55

XuQianJin-Stars reviewed Mar 22, 2022

View reviewed changes

alexeykudinkin force-pushed the ak/spk-avr-shd-fix branch from d0ac9f3 to 46d6a25 Compare March 25, 2022 00:40

alexeykudinkin changed the title ~~[HUDI-3549][Stacked on 4996] Removing dependency on "spark-avro"~~ [HUDI-3549] Removing dependency on "spark-avro" Mar 25, 2022

alexeykudinkin force-pushed the ak/spk-avr-shd-fix branch 4 times, most recently from 565ef8a to 02c1f87 Compare March 25, 2022 22:41

Alexey Kudinkin added 21 commits March 28, 2022 12:16

Fixing compilation

d51871b

Missing license

7807a1b

Reverting accidental change

736c36d

Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3…

258c601

….2.0

Cleaned up all "spark-avro" refs

9968b8d

Tidying up

a60e736

Untethered bespoke Column Stats index from the write path

5ddb953

Fixing compilation

af6bcd5

Fixing tests dependent on SparkAdapter

d1d9359

Fixing compilation

5f5c90d

Tidying up

d621f30

Inlined all methods renamed in DataSourceUtils

d374e8d

Tidying up

81d964f

lint

2a0b567

Fixing compilation

20ee548

Reverting incorrect change

df5b626

Tidying up after rebase

7a72626

Removing spark-avro from Hudi examples

c3da8e2

Fixed compilation for Spark 3.2.0

2490c7e

Sync'd Avro Serializer/Deserializer hierarchy to Spark 3.1.3

265462f

Bumped 3.1.x branch to rely on Spark 3.1.3

4a2280b

alexeykudinkin force-pushed the ak/spk-avr-shd-fix branch from 02c1f87 to 4a2280b Compare March 28, 2022 19:55

Relocate "org.apache.spark.sql.avro" to avoid potential collisions w/…

192fe7f

… "spark-avro"

nsivabalan removed the status:in-progress Work in progress label Mar 29, 2022

nsivabalan approved these changes Mar 29, 2022

View reviewed changes

nsivabalan merged commit e5a2bae into apache:master Mar 29, 2022


		object AvroSerializer {

		// NOTE: Following methods have been renamed in Spark 3.2.1 [1] making [[AvroSerializer]] implementation


		object AvroDeserializer {

		// NOTE: Following methods have been renamed in Spark 3.2.1 [1] making [[AvroDeserializer]] implementation

[HUDI-3549] Removing dependency on "spark-avro" #4955

[HUDI-3549] Removing dependency on "spark-avro" #4955

Uh oh!

Conversation

alexeykudinkin commented Mar 5, 2022 • edited by vinothchandar Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

alexeykudinkin commented Mar 5, 2022

Uh oh!

nsivabalan commented Mar 6, 2022

Uh oh!

alexeykudinkin commented Mar 7, 2022

Uh oh!

alexeykudinkin commented Mar 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Mar 21, 2022

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Mar 26, 2022

Uh oh!

alexeykudinkin commented Mar 28, 2022

Uh oh!

alexeykudinkin commented Mar 29, 2022

Uh oh!

hudi-bot commented Mar 29, 2022

CI report:

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

alexeykudinkin commented Mar 5, 2022 •

edited by vinothchandar

Loading