[HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading #9276

jonvex · 2023-07-24T17:16:17Z

Change Logs

Merge log and skeleton files in the file reader

Failing tests with new file format:

testDataTypePromotions
testArrayOfStructsChangeColumnType
testArrayOfMapsChangeValueType
testArrayOfMapsStructChangeFieldType
testComplexOperationsOnTable
testPartitionFiltersPushDown - not a bug
testSchemaEvolutionForTableType
Test Call show_logfile_metadata Procedure
Test Call show_logfile_records Procedure
Test NestedSchemaPruning optimization unsuccessful
Test Call run_bootstrap Procedure with no-partitioned
Test nested field as primaryKey and preCombineField
Test Call run_clustering Procedure By Table
Test Call run_clustering Procedure By Path
Test Call run_clustering Procedure With Partition Pruning
Test Two Table's Union Join with time travel
Test Query Merge_On_Read Read_Optimized table - they use partitionpath as precombine
testBulkInsertsAndUpsertsWithBootstrap

Unimplemented features (incomplete list):
SkipMerge MOR
Glob paths
Schema Evolution

Impact

Improve read performance simplify integration

Risk level (write none, low medium or high below)

High
Lots of testing and lots more to do

Documentation Update

Release notes?

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

jonvex · 2023-07-30T01:48:54Z

@hudi-bot run azure

...-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

yihua · 2023-08-03T19:29:48Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

+      lazy val newHudiFileFormatUtils = if (!parameters.getOrElse(LEGACY_HUDI_PARQUET_FILE_FORMAT.key,
+        LEGACY_HUDI_PARQUET_FILE_FORMAT.defaultValue).toBoolean && (globPaths == null || globPaths.isEmpty)
+        && parameters.getOrElse(REALTIME_MERGE.key(), REALTIME_MERGE.defaultValue())
+        .equalsIgnoreCase(REALTIME_PAYLOAD_COMBINE_OPT_VAL)) {


Is there any issue with REALTIME_SKIP_MERGE_OPT_VAL merge type?

Yes. I wasn't able to get it to work correctly before the code freeze

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

yihua · 2023-08-03T19:39:13Z

...-common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister


 org.apache.hudi.DefaultSource
-org.apache.spark.sql.execution.datasources.parquet.HoodieParquetFileFormat
+org.apache.spark.sql.execution.datasources.parquet.LegacyHoodieParquetFileFormat


When switching to the new file format with the config, should the NewHoodieParquetFileFormat be registered too?

Maybe? I'm not sure. What benefit does it give us?

Just curious, I don't have a clear answer. Since createRelation is overridden so functionality-wise it's ok.

...datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBootstrapMORRelation.scala

yihua · 2023-08-03T19:46:45Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

+    val prunedPartitions = if (shouldBroadcast) {
+      listMatchingPartitionPaths(convertFilterForTimestampKeyGenerator(metaClient, partitionFilters))
+    } else {
+      listMatchingPartitionPaths(partitionFilters)
+    }
    val listedPartitions = getInputFileSlices(prunedPartitions: _*).asScala.toSeq.map {
      case (partition, fileSlices) =>
-        val baseFileStatuses: Seq[FileStatus] = getBaseFileStatus(fileSlices
+        var baseFileStatuses: Seq[FileStatus] = getBaseFileStatus(fileSlices
          .asScala
-          .map(fs => fs.getBaseFile.orElse(null))
-          .filter(_ != null))
-
+            .map(fs => fs.getBaseFile.orElse(null))
+            .filter(_ != null))
+        if (shouldBroadcast) {
+          baseFileStatuses = baseFileStatuses ++ fileSlices.asScala
+            .filter(f => f.getLogFiles.findAny().isPresent && !f.getBaseFile.isPresent)
+            .map(f => f.getLogFiles.findAny().get().getFileStatus)
+        }
        // Filter in candidate files based on the col-stats index lookup
        val candidateFiles = baseFileStatuses.filter(fs =>
          // NOTE: This predicate is true when {@code Option} is empty
          candidateFilesNamesOpt.forall(_.contains(fs.getPath.getName)))

        totalFileSize += baseFileStatuses.size
        candidateFileSize += candidateFiles.size
-        PartitionDirectory(InternalRow.fromSeq(partition.values), candidateFiles)
+        if (shouldBroadcast) {
+          val c = fileSlices.asScala.filter(f => f.getLogFiles.findAny().isPresent
+            || (f.getBaseFile.isPresent && f.getBaseFile.get().getBootstrapBaseFile.isPresent)).
+            foldLeft(Map[String, FileSlice]()) { (m, f) => m + (f.getFileId -> f) }
+          if (c.nonEmpty) {
+            PartitionDirectory(new PartitionFileSliceMapping(InternalRow.fromSeq(partition.values), spark.sparkContext.broadcast(c)), candidateFiles)
+          } else {
+            PartitionDirectory(InternalRow.fromSeq(partition.values), candidateFiles)
+          }
+        } else {
+          PartitionDirectory(InternalRow.fromSeq(partition.values), candidateFiles)
+        }


Could you move the logic to a single if branch when broadcast is enabled, so it's easier to read?

I tried that before but don't think it looks better. I did it again and you can take a look

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/Iterators.scala

yihua · 2023-08-04T20:04:46Z

...ource/hudi-spark-common/src/main/scala/org/apache/hudi/NewHoodieParquetFileFormatUtils.scala

+
+  protected lazy val basePath: Path = metaClient.getBasePathV2
+
+  protected lazy val (tableAvroSchema: Schema, internalSchemaOpt: Option[InternalSchema]) = {


Should some of the util methods also used by HoodieBaseRelation be extracted to an independent Util class for code reuse?

Every singe method here is from base relation. You said to not use the relation so I just copied over what I needed. It was much simpler before

Got it. What I meant is, the new file format should not extend existing file format classes or use relation inside directly. Util methods can still be extracted to a common util class so that both NewHoodieParquetFileFormatUtils and HoodieBaseRelation can use this new common util class. If that takes time, we can punt it.

...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala

...in/scala/org/apache/spark/sql/execution/datasources/parquet/NewHoodieParquetFileFormat.scala

yihua · 2023-08-04T20:26:54Z

...in/scala/org/apache/spark/sql/execution/datasources/parquet/NewHoodieParquetFileFormat.scala

+    val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
+    (file: PartitionedFile) => {
+      file.partitionValues match {
+        case broadcast: PartitionFileSliceMapping =>


I feel like the branching here can be further simplified based on the split or file group type without having to specify isMOR or isBootstrap: (1) base file only, (2) base file + log files, (3) log files only, (4) bootstrap skeleton file + original file, (5) bootstrap skeleton file + original file + log files. Then we may apply optimization like predicate push down per split type. We can improve this part in a follow-up, along with aligning logic in different query types (e.g., schema handling, partition path handling, etc.).

So there are 2 things going on here:
the first is that you are seeing
isBootstrap && bootstrapFileOpt.isPresent and isMOR && logFiles.nonEmpty the isBootstrap and isMore are not necessary because if the second condition is true then the first has to be true.

The second thing, that I don't think should be changed is

(isMOR, logFiles.nonEmpty) match { case (true, true) => buildMergeOnReadIterator(bootstrapIterator, logFiles, filePath.getParent, bootstrapReaderOutput, requiredSchemaWithMandatory, outputSchema, partitionSchema, partitionValues, broadcastedHadoopConf.value.value) case (true, false) => appendPartitionAndProject(bootstrapIterator, bootstrapReaderOutput, partitionSchema, outputSchema, partitionValues) case (false, false) => bootstrapIterator case (false, true) => throw new IllegalStateException("should not be log files if not mor table")

The reasoning for this is that
if it's mor and it has log files then we need to merge the log files, then append the partition path, then project away any mandatory fields for merging (recordkey and precombine) that aren't required

if it's mor and doesn't have log files, then we need to append the partitionpath

if it's not mor, then the partitionpath has already been appended by the reader itself, so we just return it.

The final edge case to bring up is if (requiredSchemaWithMandatory.isEmpty)
That means that it is a df.count() so we just use the base file reader

Makes sense. What I referred to is, we need to revisit all the complexity and see if there're opportunities to unify the differences among COW and MOR (may require changes to iterators, file readers, etc.).

COW I think would work right now, because we would never wrap the partition columns with PartitionFileSliceMapping in the FileIndex so it would just hit the iterator to read with the base file reader. I would probably just add an if at the readers are created to just return the base file reader

...urce/hudi-spark/src/test/java/org/apache/hudi/functional/TestNewHoodieParquetFileFormat.java

yihua · 2023-08-04T20:56:09Z

...spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/adapter/Spark2Adapter.scala

+  override def createLegacyHoodieParquetFileFormat(appendPartitionValues: Boolean): Option[ParquetFileFormat] = {
+    Some(new Spark24LegacyHoodieParquetFileFormat(appendPartitionValues))


So this is used by BaseFileOnlyRelation for COW and MOR read-optimized queries. I assume the new file format can also be applied by COW and MOR read-optimized queries too. We should follow up here in a separate PR.

The goal is to get rid of Spark-version-specific file format classes and make the Hudi Spark integration easier to maintain.

Yeah. We're going to have to figure out if there is a way to do schema evolution without porting code.

yihua · 2023-08-04T21:15:21Z

...scala/org/apache/spark/sql/execution/datasources/parquet/LegacyHoodieParquetFileFormat.scala



-class HoodieParquetFileFormat extends ParquetFileFormat with SparkAdapterSupport {
+class LegacyHoodieParquetFileFormat extends ParquetFileFormat with SparkAdapterSupport {


Add docs here to link to the new file format implementation so that any changes to this format implementation should also reflect in the new file format class?

I think changes to the relation and especially rdd classes are more likely to require changes in the new file format than changes to LegacyHoodieParquetFileFormat.

Yeah, my point is developers changing this class should be aware of the new file format and not miss changing the new file format causing inconsistency, if necessary.

yihua · 2023-08-04T21:16:30Z

...rg/apache/spark/sql/execution/datasources/parquet/Spark24LegacyHoodieParquetFileFormat.scala

 * </ol>
 */
-class Spark24HoodieParquetFileFormat(private val shouldAppendPartitionValues: Boolean) extends ParquetFileFormat {
+class Spark24LegacyHoodieParquetFileFormat(private val shouldAppendPartitionValues: Boolean) extends ParquetFileFormat {


Same here for version-specific file format classes: add docs here to link to the new file format implementation so that any changes to this format implementation should also reflect in the new file format class?

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala

yihua · 2023-08-04T21:30:51Z

...-common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister


 org.apache.hudi.DefaultSource
-org.apache.spark.sql.execution.datasources.parquet.HoodieParquetFileFormat
+org.apache.spark.sql.execution.datasources.parquet.LegacyHoodieParquetFileFormat


Just curious, I don't have a clear answer. Since createRelation is overridden so functionality-wise it's ok.

yihua · 2023-08-04T21:34:55Z

...ource/hudi-spark-common/src/main/scala/org/apache/hudi/NewHoodieParquetFileFormatUtils.scala

+
+  protected lazy val basePath: Path = metaClient.getBasePathV2
+
+  protected lazy val (tableAvroSchema: Schema, internalSchemaOpt: Option[InternalSchema]) = {


Got it. What I meant is, the new file format should not extend existing file format classes or use relation inside directly. Util methods can still be extracted to a common util class so that both NewHoodieParquetFileFormatUtils and HoodieBaseRelation can use this new common util class. If that takes time, we can punt it.

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

yihua · 2023-08-04T22:08:56Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

+    val useMORBootstrapFF = parameters.getOrElse(MOR_BOOTSTRAP_FILE_READER.key,
+      MOR_BOOTSTRAP_FILE_READER.defaultValue).toBoolean && (globPaths == null || globPaths.isEmpty)


Got it. Then let's leave this as a follow-up. The new file format should support this too for feature completeness.

...atasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/Iterators.scala

yihua

Overall LGTM. This is a great leap towards making Hudi Spark integration performant and simpler!

yihua · 2023-08-04T22:34:09Z

@jonvex let's track the follow-ups in JIRA.

yihua · 2023-08-04T22:54:51Z

@jonvex Could you also update the PR description with details of the approach? Before merging this PR, let's create a new PR based on this patch with hoodie.datasource.read.use.legacy.parquet.file.format=false and make sure tests pass on MOR and bootstrap queries.

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

yihua

LGTM

…spark33

yihua · 2023-08-07T04:05:47Z

CI is green.

Jonathan Vexler added 8 commits July 21, 2023 17:38

curr

0217715

fix select *

9325f13

fix checkstyle

37d3b93

relax mit

4e77337

add partition columns after merging

54a4e7e

working

ee25b44

going to generalize for all spark versions

3a1eadb

made more extensible

67f298d

jonvex changed the title ~~Mor perf spark33~~ [HUDI-6568] Hudi Spark Integration Redesign Jul 28, 2023

Jonathan Vexler added 11 commits July 28, 2023 19:19

fix error in port

6f357c6

switch default to true

d28be3b

spark 3.2 working

d7612ac

added spark 3.4 support

bb2cd1b

support spark 3.1

9ea1398

fix spark 3.2 and 3.3 after changes

7b7d90e

spark 3.0 working

a6f97ed

spark 2.4 working

a52dacd

add imports to spark 3 adapter

0e91a54

Merge remote-tracking branch 'origin/master' into mor_perf_spark33

bb0acc5

fix merge

3e2626a

Jonathan Vexler added 9 commits July 30, 2023 11:17

disable for schema on read

c05f009

disable with inmemory index

662f3b3

disable with timestamp keygenerator

72c0bb1

fix pruning timestamp keygen

793964b

fix partition filter push down test

646edf5

check glob paths for null

663aa88

add isProjectionCompatible

3d6f947

optimize skip merge

4e33648

fix testReadLogOnlyMergeOnReadTable

26bb36c

Jonathan Vexler added 2 commits August 3, 2023 11:19

fix checkstyle

af76828

fix test failing issue

1875a19

yihua reviewed Aug 3, 2023

View reviewed changes

address review feedback

ef8eaad

jonvex requested a review from yihua August 3, 2023 21:16

Merge remote-tracking branch 'origin/master' into mor_perf_spark33

89a4c7f

yihua reviewed Aug 4, 2023

View reviewed changes

address pr comments

def394b

jonvex requested a review from yihua August 4, 2023 21:35

yihua reviewed Aug 4, 2023

View reviewed changes

jonvex requested a review from yihua August 5, 2023 00:23

Jonathan Vexler and others added 3 commits August 5, 2023 09:11

addressed review

f13bb9c

Merge branch 'master' into mor_perf_spark33

e5a805e

Update docs of LegacyHoodieParquetFileFormat

65cfcdf

yihua reviewed Aug 6, 2023

View reviewed changes

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala Outdated Show resolved Hide resolved

Rename the config

c458337

yihua approved these changes Aug 6, 2023

View reviewed changes

yihua and others added 5 commits August 6, 2023 00:43

Fix imports

44a63c8

Rename config to be accurate

fa681fd

Fix build

83f6b8b

Merge remote-tracking branch 'origin/master' into mor_perf_spark33

996c798

Merge remote-tracking branch 'origin/mor_perf_spark33' into mor_perf_…

69aa9e6

…spark33

apache deleted a comment from hudi-bot Aug 7, 2023

yihua merged commit 23f657d into apache:master Aug 7, 2023


		protected lazy val basePath: Path = metaClient.getBasePathV2

		protected lazy val (tableAvroSchema: Schema, internalSchemaOpt: Option[InternalSchema]) = {

		override def createLegacyHoodieParquetFileFormat(appendPartitionValues: Boolean): Option[ParquetFileFormat] = {
		Some(new Spark24LegacyHoodieParquetFileFormat(appendPartitionValues))



		class HoodieParquetFileFormat extends ParquetFileFormat with SparkAdapterSupport {
		class LegacyHoodieParquetFileFormat extends ParquetFileFormat with SparkAdapterSupport {

		val useMORBootstrapFF = parameters.getOrElse(MOR_BOOTSTRAP_FILE_READER.key,
		MOR_BOOTSTRAP_FILE_READER.defaultValue).toBoolean && (globPaths == null \|\| globPaths.isEmpty)

[HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading #9276

[HUDI-6635] Hudi Spark Integration Redesign MOR and Bootstrap reading #9276

Uh oh!

Conversation

jonvex commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

jonvex commented Jul 30, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua commented Aug 4, 2023

jonvex commented Jul 24, 2023 •

edited

Loading