Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Mar 5, 2022

Tips

What is the purpose of the pull request

After some back-and-forth in our discussions regarding "spark-avro", we've finally settled on the following approach:

Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in
that case 3.2.1, 3.2.0, etc)

To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR).

Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches.

Following Hudi modules to Spark minor branches is currently maintained:

  • "hudi-spark3" -> 3.2.x
  • "hudi-spark3.1.x" -> 3.1.x
  • "hudi-spark2" -> 2.4.x

Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches:

  • AvroSerializer
  • AvroDeserializer
  • AvroUtils

Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules.

  • SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1).

All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them.

Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI.

Brief change log

  • Removing spark-avro bundling from Hudi by default
  • Scaffolded Spark 3.2.x hierarchy
  • Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy
  • Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy
  • Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module
  • Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0
  • Modified bot.yml to build full matrix of support Spark versions
  • Removed "spark-avro" dependency from all modules

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@nsivabalan
Copy link
Contributor

LGTM. lets test this patch w/ diff spark runtime versions (minor versions) to ensure we are good wrt diff runtime versions against hudi-spark3 bundles.

And I assume, with this patch, we should also be good to rename our spark3 bundles from hudi-spark3.2.1-bundle to hudi-spark3.2-bundles as we discussed.

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Mar 7, 2022
@alexeykudinkin
Copy link
Contributor Author

@nsivabalan correct

@nsivabalan nsivabalan changed the title [HUDI-3549] Removing "spark-avro" bundling from Hudi by default [HUDI-3549][WIP][DO_NOT_MERGE] Removing "spark-avro" bundling from Hudi by default Mar 8, 2022
@nsivabalan nsivabalan added the status:in-progress Work in progress label Mar 8, 2022
@alexeykudinkin alexeykudinkin changed the title [HUDI-3549][WIP][DO_NOT_MERGE] Removing "spark-avro" bundling from Hudi by default [HUDI-3549][Stacked on 4996][DO_NOT_MERGE] Removing "spark-avro" bundling from Hudi by default Mar 17, 2022
@alexeykudinkin alexeykudinkin changed the title [HUDI-3549][Stacked on 4996][DO_NOT_MERGE] Removing "spark-avro" bundling from Hudi by default [HUDI-3549][Stacked on 4996] Removing "spark-avro" bundling from Hudi by default Mar 18, 2022
@alexeykudinkin alexeykudinkin changed the title [HUDI-3549][Stacked on 4996] Removing "spark-avro" bundling from Hudi by default [HUDI-3549][Stacked on 4996] Removing dependency on "spark-avro" Mar 18, 2022
* @param context instance of {@link HoodieEngineContext}
* @param instantTime instant of the carried operation triggering the update
*/
public abstract void updateMetadataIndexes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may I know how is this change is related to this patch ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So some of the tests that we have in "spark-client" module actually require SparkAdapter to be loaded, which lives in "hudi-spark" module, entailing that it couldn't be loaded there.

So i had to either move the tests to "hudi-spark" or remove this method (which uses AvroConversionUtil, in turn referencing SparkAdapter) which i'm removing regardless in another PR.

import org.apache.spark.HoodieSparkTypeUtils.isCastPreservingOrdering
import org.apache.spark.sql.catalyst.expressions.{Add, AttributeReference, BitwiseOr, Cast, DateAdd, DateDiff, DateFormatClass, DateSub, Divide, Exp, Expm1, Expression, FromUTCTimestamp, FromUnixTime, Log, Log10, Log1p, Log2, Lower, Multiply, ParseToDate, ParseToTimestamp, ShiftLeft, ShiftRight, ToUTCTimestamp, ToUnixTimestamp, Upper}

object HoodieSpark3_1CatalystExpressionUtils extends HoodieCatalystExpressionUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any diff between this file and HoodieSpark3_2CatalystExpressionUtils ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: This PR is stacked on 4996, so this change is actually from there

@vinothchandar
Copy link
Member

we should also be good to rename our spark3 bundles from hudi-spark3.2.1-bundle to hudi-spark3.2-bundles as we discussed.

@xushiyan as well. let's avoid renaming bundles . It does cause some busy work and thrashing for users, when they just want to pick up a new version. e.g if they had a HUDI_VERSION in their build/deploy scripts, now they need to all adjust per new naming.

Is the change to bundle names in this PR or a separate one? If so, can we just retain spark2, spark3, spark3.1 as bundle names? whats the plan

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a first pass. Can you comment on any custom changes we have made in the different AvroSerializer classes?

sparkProfile: "spark3"
sparkVersion: "3.2.0"

- scalaProfile: "scala-2.12"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be okay with gh action minutes? @xushiyan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be okay with gh action minutes? @xushiyan

Ci is performed for about 5-6 minutes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar gh actions only run mvn install atm; in #5082 we adding some basic testcases covering quickstart for different spark versions. CI limit-wise we're good.

README.md Outdated
The default hudi-jar bundles spark-avro module. To build without spark-avro module, build using `spark-shade-unbundle-avro` profile
Previously, Hudi bundles were packaging (and shading) "spark-avro" module internally. However, due to multiple occasion
of it being broken b/w patch versions (most recent was, b/w 3.2.0 and 3.2.1) of Spark after substantial deliberation
we took a decision to let go such dependency and instead simply clone the structures we're relying on to better control
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can shorten this a bit and just make README have the actual steps to do here?

### What about "spark-avro" module?

Hudi versions 0.11 or later, no longer required `spark-avro` to be specified using `--packages`

def createAvroDeserializer(rootAvroType: Schema, rootCatalystType: DataType): HoodieAvroDeserializer

/**
* TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs?

*
* PLEASE REFRAIN MAKING ANY CHANGES TO THIS CODE UNLESS ABSOLUTELY NECESSARY
*
* NOTE: This is a version of [[AvroDeserializer]] impl from Spark 2.4.4 w/ the fix for SPARK-30267
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar this is the diff against Spark


object AvroSerializer {

// NOTE: Following methods have been renamed in Spark 3.2.1 [1] making [[AvroSerializer]] implementation
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar this is the diff against Spark


object AvroDeserializer {

// NOTE: Following methods have been renamed in Spark 3.2.1 [1] making [[AvroDeserializer]] implementation
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar this is the diff against Spark

@xushiyan xushiyan added priority:blocker Production down; release blocker and removed priority:critical Production degraded; pipelines stalled labels Mar 22, 2022

# Spark 3.2.x
- scalaProfile: "scala-2.12"
sparkProfile: "spark3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark3 -> spark3.2.0

sparkProfile: "spark3"
sparkVersion: "3.2.0"

- scalaProfile: "scala-2.12"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be okay with gh action minutes? @xushiyan

Ci is performed for about 5-6 minutes.

@alexeykudinkin alexeykudinkin changed the title [HUDI-3549][Stacked on 4996] Removing dependency on "spark-avro" [HUDI-3549] Removing dependency on "spark-avro" Mar 25, 2022
@alexeykudinkin alexeykudinkin force-pushed the ak/spk-avr-shd-fix branch 4 times, most recently from 565ef8a to 02c1f87 Compare March 25, 2022 22:41
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

1 similar comment
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan removed the status:in-progress Work in progress label Mar 29, 2022
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good job on the patch!

@nsivabalan nsivabalan merged commit e5a2bae into apache:master Mar 29, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc)

To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR).

Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches.

Following Hudi modules to Spark minor branches is currently maintained:

"hudi-spark3" -> 3.2.x
"hudi-spark3.1.x" -> 3.1.x
"hudi-spark2" -> 2.4.x
Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches:

AvroSerializer
AvroDeserializer
AvroUtils
Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules.

SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1).
All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them.

Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI.

Brief change log:
- Removing spark-avro bundling from Hudi by default
- Scaffolded Spark 3.2.x hierarchy
- Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy
- Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy
- Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module
- Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0
- Modified bot.yml to build full matrix of support Spark versions
- Removed "spark-avro" dependency from all modules
- Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants