Skip to content

Conversation

@CTTY
Copy link
Contributor

@CTTY CTTY commented Jun 22, 2022

Tips

What is the purpose of the pull request

Support Spark 3.3.0

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@CTTY CTTY force-pushed the ctty/hudi-spark33 branch from a69b13f to 33e0f12 Compare June 27, 2022 19:32
@melin
Copy link

melin commented Jun 28, 2022

Upgrade Parquet version to 1.12.3.
PARQUET-1968 can simplify current In predicate filter pushdown.

@CTTY CTTY force-pushed the ctty/hudi-spark33 branch 2 times, most recently from c6f80a4 to 95c9cde Compare July 1, 2022 21:35
@CTTY CTTY force-pushed the ctty/hudi-spark33 branch 4 times, most recently from b99eb59 to e523893 Compare July 11, 2022 17:40
@codope codope added dependencies Dependency updates priority:blocker Production down; release blocker engine:spark Spark integration labels Jul 12, 2022
@codope
Copy link
Member

codope commented Jul 12, 2022

@CTTY is this still WIP?

@yihua
Copy link
Contributor

yihua commented Jul 12, 2022

@alexeykudinkin @XuQianJin-Stars @YannByron This PR is going to add support for Spark 3.3. In the long term, how should we maintain the support matrix for Spark in Hudi? Do we deprecate the support for older Spark versions as we add new versions? cc @xushiyan @vinothchandar

"Partition column types may not be specified in Create Table As Select (CTAS)",
ctx)

// CreateTable / CreateTableAsSelect was migrated to v2 in Spark 3.3.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also see SPARK-36902

@CTTY
Copy link
Contributor Author

CTTY commented Jul 12, 2022

@CTTY is this still WIP?

@codope Yes, but can we enable Azure CI for this PR? It could expose more potential issues and I can work on them

)

new HoodieFileScanRDD(sparkSession, baseFileReader, fileSplits)
// TODO: which schema to use here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MessageType parquetSchema = new ParquetUtils().readSchema(context.getHadoopConf().get(), filePath);
Configuration hadoopConf = context.getHadoopConf().get();
MessageType parquetSchema = new ParquetUtils().readSchema(hadoopConf, filePath);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made according to SPARK-36935. ParquetSchemaConverter change

case IdentityTransform(FieldReference(Seq(col))) =>
identityCols += col

case BucketTransform(numBuckets, Seq(FieldReference(Seq(col))), _) =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK-37627 Separate SortedBucketTransform from BucketTransform

// It's critical for this rules to follow in this order, so that DataSource V2 to V1 fallback
// is performed prior to other rules being evaluated
rules ++= Seq(dataSourceV2ToV1Fallback, spark3Analysis, spark3ResolveReferences, spark32ResolveAlterTableCommands)
rules ++= Seq(dataSourceV2ToV1Fallback, spark3Analysis, spark3ResolveReferences, resolveAlterTableCommands)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK-38939 DropColumns syntax change


override def parseDataType(sqlText: String): DataType = delegate.parseDataType(sqlText)

// SPARK-37266 Added parseQuery to ParserInterface in Spark 3.3.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK-37266 ParserInterface change

@CTTY CTTY changed the title [HUDI-4186] [WIP] Support Hudi with Spark 3.3.0 [HUDI-4186] Support Hudi with Spark 3.3.0 Jul 13, 2022
@CTTY
Copy link
Contributor Author

CTTY commented Jul 13, 2022

removed [WIP] tag to unblock Azure CI. This PR is still under work

@alexeykudinkin
Copy link
Contributor

@yihua this is a good question. IMO we should avoid breaking unless we absolutely have to and make sure we maintain compatibility as long as it makes sense from the standpoint of investing resources to maintain it.

In this case, i'd say that we should not break any existing compatibility (Spark 2.4, 3.1, 3.2, 3.3) but instead, say, declare that 3.1 is in maintenance (EOL) mode and new features are not guaranteed to work in there in the future releases. Thoughts?

@CTTY CTTY force-pushed the ctty/hudi-spark33 branch from 9e8f799 to 925cca4 Compare July 14, 2022 19:57
@CTTY
Copy link
Contributor Author

CTTY commented Jul 14, 2022

Most hive sync CI tests are failing. I saw another PR working on this: #6110

@CTTY CTTY force-pushed the ctty/hudi-spark33 branch 3 times, most recently from af12dbd to 4eff788 Compare July 17, 2022 20:45
@CTTY
Copy link
Contributor Author

CTTY commented Jul 19, 2022

@codope @yihua This is ready to review

@CTTY CTTY force-pushed the ctty/hudi-spark33 branch from fb4b473 to c1626fb Compare July 26, 2022 06:14
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CTTY I see a lot of classes for Spark 3.3 support, e.g., Spark33DataSourceUtils, are just copied from existing Spark 3.2 support classes in Hudi. Are they safe? Should we update them based on corresponding Spark 3.3 classes?

<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
</dependency>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find. so can we now re-enable spark 3.2 quickstart test in GH action? check out bot.yml

@CTTY CTTY force-pushed the ctty/hudi-spark33 branch from ca8a1ba to 851ac3c Compare July 27, 2022 02:40
@YannByron
Copy link
Contributor

@alexeykudinkin @XuQianJin-Stars @YannByron This PR is going to add support for Spark 3.3. In the long term, how should we maintain the support matrix for Spark in Hudi? Do we deprecate the support for older Spark versions as we add new versions? cc @xushiyan @vinothchandar

It's not easy to decide.
Like delta lake, its latest version just supports the current latest version of spark, and it never consider to support the new features and the older spark version at its same version. If users need these, port the new features to the version that they own maintain. After all, i think most production users will not follow the Spark release iteration in a timely manner.
maybe we can support the two or three latest spark version to provide some convenience to our users.

if (deleteTable.condition.isDefined) {
df = df.filter(Column(deleteTable.condition.get))
}
// SPARK-38626 DeleteFromTable.condition is changed from Option[Expression] to Expression in Spark 3.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment can go into the Spark adapter implementation and is not necessary here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be addressed in a separate PR.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments, we can take those in a follow-up PR

val resolvedCondition = condition.map(resolveExpressionFrom(table)(_))
// Return the resolved DeleteTable
DeleteFromTable(table, resolvedCondition)
val resolveExpression = resolveExpressionFrom(table, None)_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we keep syntax as it was (with parenthesis)

new Spark2HoodieFileScanRDD(sparkSession, readFunction, filePartitions)
}

override def resolveDeleteFromTable(deleteFromTable: Command,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we have extractCondition we can get rid of resolveDeleteFromTable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't simply reuse the method, resolveDeleteFromTable has a different logic that involves resolveFromExpression

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we can't:

  • We get rid of the method completely
  • We use extractCondition to extract condition and then do everything else (resolution, etc) in the caller

/* SPARK-37266 Added parseQuery to ParserInterface in Spark 3.3.0. This is a patch to prevent
hackers from tampering text with persistent view, it won't be called in older Spark
Don't mark this as override for backward compatibility
Can't use sparkExtendedParser directly here due to the same reason */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but i can't understand the java-doc: can you please elaborate on why this is here?

  • What exactly are we trying to prevent from happening?
  • What BWC are we referring to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseQuery is a new method of Spark trait ParserInterface. there would be compile issue If we call this method from any class that’s shared across different versions of spark, because older ParserInterface doesn’t have this method

Due to the same reason, we can't mark this method with override because for older spark there isn't parseQuery

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on Slack, let's instead of doing parsing in SparkAdapter create ExtendedParserInterface, where we can place this new parseQuery method and that could be used in Hudi's code-base (this is similar to how HoodieCatalystExpressionUtils set up)


override def getQueryParserFromExtendedSqlParser(session: SparkSession, delegate: ParserInterface,
sqlText: String): LogicalPlan = {
new HoodieSpark3_3ExtendedSqlParser(session, delegate).parseQuery(sqlText)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a query parser -- this is already parsed query

hackers from tampering text with persistent view, it won't be called in older Spark
Don't mark this as override for backward compatibility
Can't use sparkExtendedParser directly here due to the same reason */
def parseQuery(sqlText: String): LogicalPlan = parse(sqlText) { parser =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing double-parsing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reused the code flow from parsePlan method under the same class here. Calling parse might not be needed here. good point

DeleteFromTable(deleteFromTableCommand.table, resolvedCondition)
}

override def extractCondition(deleteFromTable: Command): Expression = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also return Option instead of null

@alexeykudinkin
Copy link
Contributor

@CTTY please add the Jiras in the description so that they're more easily discoverable

@CTTY
Copy link
Contributor Author

CTTY commented Jul 27, 2022

Created those Jiras below to follow up on improving the code quality:

@alexeykudinkin @XuQianJin-Stars @yihua

Updated follow-up jiras

* Extract condition in [[DeleteFromTable]]
* SPARK-38626 condition is no longer Option in Spark 3.3
*/
def extractCondition(deleteFromTable: Command): Expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to extractDeleteCondition?

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @CTTY you should address the minor comments in a separate PR.

@yihua yihua merged commit cdaec5a into apache:master Jul 27, 2022
@CTTY CTTY deleted the ctty/hudi-spark33 branch July 27, 2022 21:48
@xushiyan
Copy link
Member

@yihua @CTTY the last commit disabled testHoodieFlinkQuickstart. I don't know why this affects flink tests. Please make a follow up to re-enable it back.

@CTTY
Copy link
Contributor Author

CTTY commented Jul 27, 2022

@yihua @CTTY the last commit disabled testHoodieFlinkQuickstart. I don't know why this affects flink tests. Please make a follow up to re-enable it back.

Created this jira to track it: https://issues.apache.org/jira/browse/HUDI-4491

@CTTY
Copy link
Contributor Author

CTTY commented Jul 28, 2022

Created this umbrella jira and linked existing follow-up jiras to it: https://issues.apache.org/jira/browse/HUDI-4492

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua @CTTY Can you please check my comments below? The standard spark3 profile is pointing to spark 3.3.0 but the bundle name has also changed. I think the users expect the name to be hudi-spark3-bundle-*.

<spark3.version>3.3.0</spark3.version>
<spark.version>${spark3.version}</spark.version>
<sparkbundle.version>3</sparkbundle.version>
<sparkbundle.version>3.3</sparkbundle.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't sparkbundle.version still be 3 for this profilie? After building code with -Dspark3 option, I see the bundle named hudi-spark3.3-bundle-* instead of hudi-spark3-bundle-*. Is that expected?

<scala.version>${scala12.version}</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<hudi.spark.module>hudi-spark3</hudi.spark.module>
<hudi.spark.module>hudi-spark3.3.x</hudi.spark.module>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. should we keep it hudi-spark3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates engine:spark Spark integration priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

10 participants