[HUDI-4186] Support Hudi with Spark 3.3.0 #5943

CTTY · 2022-06-22T22:27:31Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Support Spark 3.3.0

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

melin · 2022-06-28T01:37:00Z

Upgrade Parquet version to 1.12.3.
PARQUET-1968 can simplify current In predicate filter pushdown.

codope · 2022-07-12T12:56:20Z

@CTTY is this still WIP?

yihua · 2022-07-12T17:41:35Z

@alexeykudinkin @XuQianJin-Stars @YannByron This PR is going to add support for Spark 3.3. In the long term, how should we maintain the support matrix for Spark in Hudi? Do we deprecate the support for older Spark versions as we add new versions? cc @xushiyan @vinothchandar

CTTY · 2022-07-12T22:25:59Z

...park3.3/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_3ExtendedSqlAstBuilder.scala

+          "Partition column types may not be specified in Create Table As Select (CTAS)",
+          ctx)
+
+        // CreateTable / CreateTableAsSelect was migrated to v2 in Spark 3.3.0


Change made according to https://issues.apache.org/jira/browse/SPARK-36850

Also see SPARK-36902

CTTY · 2022-07-12T22:50:09Z

@CTTY is this still WIP?

@codope Yes, but can we enable Azure CI for this PR? It could expose more potential issues and I can work on them

CTTY · 2022-07-12T23:28:31Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala

    )

-    new HoodieFileScanRDD(sparkSession, baseFileReader, fileSplits)
+    // TODO: which schema to use here?


Changes made according to https://issues.apache.org/jira/browse/SPARK-37273

CTTY · 2022-07-12T23:37:45Z

...lient/src/main/java/org/apache/hudi/client/bootstrap/HoodieSparkBootstrapSchemaProvider.java

-    MessageType parquetSchema = new ParquetUtils().readSchema(context.getHadoopConf().get(), filePath);
+    Configuration hadoopConf = context.getHadoopConf().get();
+    MessageType parquetSchema = new ParquetUtils().readSchema(hadoopConf, filePath);
+


Change made according to SPARK-36935. ParquetSchemaConverter change

CTTY · 2022-07-12T23:40:15Z

...rk-datasource/hudi-spark3.3/src/main/scala/org/apache/spark/sql/HoodieSpark3_3SqlUtils.scala

+      case IdentityTransform(FieldReference(Seq(col))) =>
+        identityCols += col
+
+      case BucketTransform(numBuckets, Seq(FieldReference(Seq(col))), _) =>


SPARK-37627 Separate SortedBucketTransform from BucketTransform

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

CTTY · 2022-07-12T23:43:05Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

      // It's critical for this rules to follow in this order, so that DataSource V2 to V1 fallback
      // is performed prior to other rules being evaluated
-      rules ++= Seq(dataSourceV2ToV1Fallback, spark3Analysis, spark3ResolveReferences, spark32ResolveAlterTableCommands)
+      rules ++= Seq(dataSourceV2ToV1Fallback, spark3Analysis, spark3ResolveReferences, resolveAlterTableCommands)


SPARK-38939 DropColumns syntax change

CTTY · 2022-07-12T23:44:18Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieCommonSqlParser.scala


  override def parseDataType(sqlText: String): DataType = delegate.parseDataType(sqlText)

+  // SPARK-37266 Added parseQuery to ParserInterface in Spark 3.3.0


SPARK-37266 ParserInterface change

CTTY · 2022-07-13T00:17:42Z

removed [WIP] tag to unblock Azure CI. This PR is still under work

alexeykudinkin · 2022-07-13T20:27:49Z

@yihua this is a good question. IMO we should avoid breaking unless we absolutely have to and make sure we maintain compatibility as long as it makes sense from the standpoint of investing resources to maintain it.

In this case, i'd say that we should not break any existing compatibility (Spark 2.4, 3.1, 3.2, 3.3) but instead, say, declare that 3.1 is in maintenance (EOL) mode and new features are not guaranteed to work in there in the future releases. Thoughts?

CTTY · 2022-07-14T21:37:29Z

Most hive sync CI tests are failing. I saw another PR working on this: #6110

CTTY · 2022-07-19T19:35:36Z

@codope @yihua This is ready to review

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

...e/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/DeleteHoodieTableCommand.scala

...source/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala

...spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/adapter/Spark2Adapter.scala

yihua

@CTTY I see a lot of classes for Spark 3.3 support, e.g., Spark33DataSourceUtils, are just copied from existing Spark 3.2 support classes in Hudi. Are they safe? Should we update them based on corresponding Spark 3.3 classes?

...ache/spark/sql/execution/datasources/parquet/Spark33HoodieVectorizedParquetRecordReader.java

hudi-spark-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala

...hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java

...lient/src/main/java/org/apache/hudi/client/bootstrap/HoodieSparkBootstrapSchemaProvider.java

xushiyan · 2022-07-26T15:12:45Z

hudi-examples/hudi-examples-spark/pom.xml

+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-auth</artifactId>
+        </dependency>
+


good find. so can we now re-enable spark 3.2 quickstart test in GH action? check out bot.yml

pom.xml

...k-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala

YannByron · 2022-07-27T02:55:15Z

@alexeykudinkin @XuQianJin-Stars @YannByron This PR is going to add support for Spark 3.3. In the long term, how should we maintain the support matrix for Spark in Hudi? Do we deprecate the support for older Spark versions as we add new versions? cc @xushiyan @vinothchandar

It's not easy to decide.
Like delta lake, its latest version just supports the current latest version of spark, and it never consider to support the new features and the older spark version at its same version. If users need these, port the new features to the version that they own maintain. After all, i think most production users will not follow the Spark release iteration in a timely manner.
maybe we can support the two or three latest spark version to provide some convenience to our users.

This reverts commit 9584223.

yihua · 2022-07-27T19:20:26Z

...e/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/DeleteHoodieTableCommand.scala

-    if (deleteTable.condition.isDefined) {
-      df = df.filter(Column(deleteTable.condition.get))
-    }
+    // SPARK-38626 DeleteFromTable.condition is changed from Option[Expression] to Expression in Spark 3.3


nit: the comment can go into the Spark adapter implementation and is not necessary here.

This can be addressed in a separate PR.

hudi-bot · 2022-07-27T19:56:14Z

CI report:

fa048b1 UNKNOWN
b946041 UNKNOWN
0fdc134 UNKNOWN
6f46cc6 UNKNOWN
c4488cd UNKNOWN
72ecf94 UNKNOWN
9584223 Azure: SUCCESS
53f4010 Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

alexeykudinkin

Left a few minor comments, we can take those in a follow-up PR

alexeykudinkin · 2022-07-27T20:42:08Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

-      val resolvedCondition = condition.map(resolveExpressionFrom(table)(_))
-      // Return the resolved DeleteTable
-      DeleteFromTable(table, resolvedCondition)
+      val resolveExpression = resolveExpressionFrom(table, None)_


I'd suggest we keep syntax as it was (with parenthesis)

alexeykudinkin · 2022-07-27T20:44:16Z

...spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/adapter/Spark2Adapter.scala

+    new Spark2HoodieFileScanRDD(sparkSession, readFunction, filePartitions)
+  }
+
+  override def resolveDeleteFromTable(deleteFromTable: Command,


Given that we have extractCondition we can get rid of resolveDeleteFromTable

We can't simply reuse the method, resolveDeleteFromTable has a different logic that involves resolveFromExpression

I don't see why we can't:

We get rid of the method completely

We use extractCondition to extract condition and then do everything else (resolution, etc) in the caller

alexeykudinkin · 2022-07-27T20:46:37Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieCommonSqlParser.scala

+  /* SPARK-37266 Added parseQuery to ParserInterface in Spark 3.3.0. This is a patch to prevent
+   hackers from tampering text with persistent view, it won't be called in older Spark
+   Don't mark this as override for backward compatibility
+   Can't use sparkExtendedParser directly here due to the same reason */


Sorry, but i can't understand the java-doc: can you please elaborate on why this is here?

What exactly are we trying to prevent from happening?

What BWC are we referring to?

parseQuery is a new method of Spark trait ParserInterface. there would be compile issue If we call this method from any class that’s shared across different versions of spark, because older ParserInterface doesn’t have this method

Due to the same reason, we can't mark this method with override because for older spark there isn't parseQuery

As discussed on Slack, let's instead of doing parsing in SparkAdapter create ExtendedParserInterface, where we can place this new parseQuery method and that could be used in Hudi's code-base (this is similar to how HoodieCatalystExpressionUtils set up)

alexeykudinkin · 2022-07-27T20:47:44Z

...datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_3Adapter.scala

+
+  override def getQueryParserFromExtendedSqlParser(session: SparkSession, delegate: ParserInterface,
+                                                   sqlText: String): LogicalPlan = {
+    new HoodieSpark3_3ExtendedSqlParser(session, delegate).parseQuery(sqlText)


This is not a query parser -- this is already parsed query

alexeykudinkin · 2022-07-27T20:49:22Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieCommonSqlParser.scala

+   hackers from tampering text with persistent view, it won't be called in older Spark
+   Don't mark this as override for backward compatibility
+   Can't use sparkExtendedParser directly here due to the same reason */
+  def parseQuery(sqlText: String): LogicalPlan = parse(sqlText) { parser =>


Why are we doing double-parsing?

I reused the code flow from parsePlan method under the same class here. Calling parse might not be needed here. good point

alexeykudinkin · 2022-07-27T21:12:37Z

...spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/adapter/Spark2Adapter.scala

+    DeleteFromTable(deleteFromTableCommand.table, resolvedCondition)
+  }
+
+  override def extractCondition(deleteFromTable: Command): Expression = {


Let's also return Option instead of null

alexeykudinkin · 2022-07-27T21:19:34Z

@CTTY please add the Jiras in the description so that they're more easily discoverable

CTTY · 2022-07-27T21:31:42Z

Created those Jiras below to follow up on improving the code quality:

https://issues.apache.org/jira/browse/HUDI-4466 Re-use common code between Spark 3.2/3.3

https://issues.apache.org/jira/browse/HUDI-4467 Port borrowed code from Spark 3.3

https://issues.apache.org/jira/browse/HUDI-4468 Time travel logic simplification

https://issues.apache.org/jira/browse/HUDI-4489 Break down HoodieAnalysis rules into Spark-specific components

https://issues.apache.org/jira/browse/HUDI-4481 Refactor HoodieCommonSqlParser

@alexeykudinkin @XuQianJin-Stars @yihua

Updated follow-up jiras

yihua · 2022-07-27T21:43:26Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala

+   * Extract condition in [[DeleteFromTable]]
+   * SPARK-38626 condition is no longer Option in Spark 3.3
+   */
+  def extractCondition(deleteFromTable: Command): Expression


nit: rename to extractDeleteCondition?

yihua

LGTM. @CTTY you should address the minor comments in a separate PR.

xushiyan · 2022-07-27T22:25:12Z

@yihua @CTTY the last commit disabled testHoodieFlinkQuickstart. I don't know why this affects flink tests. Please make a follow up to re-enable it back.

CTTY · 2022-07-27T22:28:46Z

@yihua @CTTY the last commit disabled testHoodieFlinkQuickstart. I don't know why this affects flink tests. Please make a follow up to re-enable it back.

Created this jira to track it: https://issues.apache.org/jira/browse/HUDI-4491

CTTY · 2022-07-28T00:16:50Z

Created this umbrella jira and linked existing follow-up jiras to it: https://issues.apache.org/jira/browse/HUDI-4492

codope

@yihua @CTTY Can you please check my comments below? The standard spark3 profile is pointing to spark 3.3.0 but the bundle name has also changed. I think the users expect the name to be hudi-spark3-bundle-*.

codope · 2022-07-30T09:27:24Z

pom.xml

+        <spark3.version>3.3.0</spark3.version>
        <spark.version>${spark3.version}</spark.version>
-        <sparkbundle.version>3</sparkbundle.version>
+        <sparkbundle.version>3.3</sparkbundle.version>


Shouldn't sparkbundle.version still be 3 for this profilie? After building code with -Dspark3 option, I see the bundle named hudi-spark3.3-bundle-* instead of hudi-spark3-bundle-*. Is that expected?

codope · 2022-07-30T09:29:39Z

pom.xml

        <scala.version>${scala12.version}</scala.version>
        <scala.binary.version>2.12</scala.binary.version>
-        <hudi.spark.module>hudi-spark3</hudi.spark.module>
+        <hudi.spark.module>hudi-spark3.3.x</hudi.spark.module>


same here. should we keep it hudi-spark3?

CTTY force-pushed the ctty/hudi-spark33 branch from a69b13f to 33e0f12 Compare June 27, 2022 19:32

CTTY force-pushed the ctty/hudi-spark33 branch 2 times, most recently from c6f80a4 to 95c9cde Compare July 1, 2022 21:35

CTTY force-pushed the ctty/hudi-spark33 branch 4 times, most recently from b99eb59 to e523893 Compare July 11, 2022 17:40

codope added dependencies Dependency updates priority:blocker Production down; release blocker engine:spark Spark integration labels Jul 12, 2022

CTTY commented Jul 12, 2022

View reviewed changes

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala Outdated Show resolved Hide resolved

CTTY commented Jul 12, 2022

View reviewed changes

CTTY changed the title ~~[HUDI-4186] [WIP] Support Hudi with Spark 3.3.0~~ [HUDI-4186] Support Hudi with Spark 3.3.0 Jul 13, 2022

CTTY force-pushed the ctty/hudi-spark33 branch from 9e8f799 to 925cca4 Compare July 14, 2022 19:57

CTTY force-pushed the ctty/hudi-spark33 branch 3 times, most recently from af12dbd to 4eff788 Compare July 17, 2022 20:45

yihua reviewed Jul 26, 2022

View reviewed changes

CTTY force-pushed the ctty/hudi-spark33 branch from fb4b473 to c1626fb Compare July 26, 2022 06:14

yihua reviewed Jul 26, 2022

View reviewed changes

xushiyan reviewed Jul 26, 2022

View reviewed changes

...k-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala Outdated Show resolved Hide resolved

Support Spark 3.3

851ac3c

CTTY force-pushed the ctty/hudi-spark33 branch from ca8a1ba to 851ac3c Compare July 27, 2022 02:40

CTTY added 3 commits July 26, 2022 21:28

azure

9c0d17d

re-enabled flink quickstart

9584223

Revert "re-enabled flink quickstart"

53f4010

This reverts commit 9584223.

yihua reviewed Jul 27, 2022

View reviewed changes

alexeykudinkin reviewed Jul 27, 2022

View reviewed changes

yihua reviewed Jul 27, 2022

View reviewed changes

yihua approved these changes Jul 27, 2022

View reviewed changes

yihua merged commit cdaec5a into apache:master Jul 27, 2022

CTTY deleted the ctty/hudi-spark33 branch July 27, 2022 21:48

This was referenced Jul 27, 2022

[MINOR] Minor changes around Spark 3.3 support #6231

Merged

[HUDI-4478] Rename existing Spark/Flink modules concisely #6230

Closed

codope reviewed Jul 30, 2022

View reviewed changes


		override def parseDataType(sqlText: String): DataType = delegate.parseDataType(sqlText)

		// SPARK-37266 Added parseQuery to ParserInterface in Spark 3.3.0

[HUDI-4186] Support Hudi with Spark 3.3.0 #5943

[HUDI-4186] Support Hudi with Spark 3.3.0 #5943

Uh oh!

Conversation

CTTY commented Jun 22, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

melin commented Jun 28, 2022

Uh oh!

codope commented Jul 12, 2022

Uh oh!

yihua commented Jul 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CTTY commented Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CTTY commented Jul 13, 2022

Uh oh!

alexeykudinkin commented Jul 13, 2022

Uh oh!

CTTY commented Jul 14, 2022

Uh oh!

CTTY commented Jul 19, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

YannByron commented Jul 27, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jul 27, 2022

CI report:

Uh oh!

alexeykudinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

CTTY commented Jul 12, 2022 •

edited

Loading

codope left a comment •

edited

Loading