[HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index #4948

alexeykudinkin · 2022-03-03T22:19:24Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

This PR rebases Data Skipping flow from relying on bespoke Column Stats Index implementation to instead leverage MT Column Stats Index.

Brief change log

Added HoodieDatasetUitls
Rebased HoodieFileIndex to use MT instead of bespoke CS Index
Fixing tests
Cleaning up

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

xiarixiaoyao · 2022-03-04T01:17:43Z

@alexeykudinkin great works

...atasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala

codope · 2022-03-08T02:20:48Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

-      case t: Throwable =>
-        logError("Failed to read col-stats index; skipping", t)
-        None
+    if (!isDataSkippingEnabled() || !fs.exists(new Path(metadataTablePath)) || queryFilters.isEmpty) {


fs.exists is going to happen in every call if data skipping is enabled. This will hit perf as we observed in Presto. We should try to avoid it. I think we should just assume that metadata table exists and error out if it doesn't.

Fair enough, we can replace it with config check whether MT is enabled

@codope on a second thought -- there still could be case, when MT is enabled, but it's not bootstrapped yet, so we can't equate MT being enabled in config, with its presence in FS. Frankly, i don't see a way w/o fs.exists in some shape or form -- if not here it would happen w/in Spark's Data Source.

We can use the table config to determine which MT partitions are available for reading. Can you please track this in a JIRA?

codope · 2022-03-08T02:22:04Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

+      val queryReferencedColumns = collectReferencedColumns(spark, queryFilters, schema)
+
+      // Persist DF to avoid re-computing column statistics unraveling
+      withPersistence(colStatsDF) {


+1 for persistence

codope · 2022-03-08T02:26:02Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

-          } else {
-            logError("Failed to lookup candidate files in Z-index", e)
-          }
+          logError("Failed to lookup candidate files in Z-index", e)


maybe change z-index to column stats index in this err msg as well.

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieFileIndex.scala

codope · 2022-03-08T02:32:08Z

hudi-common/src/main/avro/HoodieMetadata.avsc

                            ]
                        },
+                        {
+                            "doc": "Column name for which this column statistics applies",


This change we'll have to revist when we tackle schema evolution. Can you please track it in a JIRA?

Adding to @codope 's point, does this break the read/write of metadata records in the existing metadata table if users enables it in older releases, e.g., 0.10.0 and 0.10.1?

Good call. Data Skipping won't be functional w/o this column, so we will have to call out that folks would need to flush and rebuild their MT if they want to use with Data Skipping.

In this case, we need an upgrade step if column stats index is enabled in metadata table, and this should be somewhat automatic. @nsivabalan @vinothchandar wdyt?

Why would write break? Addition of field is a valid schema evolution that we support right.
For reads, maybe we just handle this gracefully, if this field is not present in metadata table then fallback to usual query path (w/o data skipping).

Not saying it would break, and based on the context provided it should be supported by schema evolution, so I'm good with it.

yihua

The general approach looks OK. I have a couple of doubts about how existing column stats are treated and the new DAG using metadata table column stats for data skipping.

yihua · 2022-03-08T19:52:09Z

hudi-common/src/main/avro/HoodieMetadata.avsc

                            ]
                        },
+                        {
+                            "doc": "Column name for which this column statistics applies",


Adding to @codope 's point, does this break the read/write of metadata records in the existing metadata table if users enables it in older releases, e.g., 0.10.0 and 0.10.1?

yihua · 2022-03-08T20:09:57Z

hudi-common/src/main/java/org/apache/hudi/common/util/TablePathUtils.java

+        // Simply traverse directory structure until found .hoodie folder
+        Path current = partitionPath;
+        while (current != null) {
+          if (hasTableMetadataFolder(fs, current)) {


One caveat is that this may incur more than one fs.exists() calls. Is this only used for initialization (which is fine), e.g., getting table path from config, and not for core read/write logic per data file?

Correct. This is only useful in a discovery phase.

yihua · 2022-03-08T20:26:07Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

-    val indexPath = metaClient.getColumnStatsIndexPath
+  private def lookupCandidateFilesInMetadataTable(queryFilters: Seq[Expression]): Try[Option[Set[String]]] = Try {
    val fs = metaClient.getFs
+    val metadataTablePath = HoodieTableMetadata.getMetadataTableBasePath(basePath)


Is the plan to deprecate existing column stats under .hoodie/.colstatsindex and remove all usage of it in 0.11? If not, should we have two modes where metadata col stats index is used when metadata table is enabled, and .hoodie/.colstatsindex is used if metadata table is disabled?

I also need clarification on how .hoodie/.colstatsindex is generated. Does that come from clustering or it is also updated per write?

Yeah, current bespoke implementation will be removed in a follow-up. It's currently updated only after clustering completes.

When you say "current bespoke implementation will be removed in a follow-up", is this before or after 0.11.0 release? I think we still need to keep .hoodie/.colstatsindex and the data skipping logic based on that, and there should be a flag to choose how data skipping is done between that vs MT col stats. Because, if user doesn't choose to enable MT col stats in 0.11.0 and there is no data skipping logic based on .hoodie/.colstatsindex, data skipping cannot be done unless user goes back to 0.10.x. The old logic can be removed one release after 0.11.0.

Discussed offline: Bespoke implementation of Col Stats Index would be removed in 0.11

yihua · 2022-03-08T20:31:34Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

+      // Read Metadata Table's Column Stats Index into Spark's [[DataFrame]]
+      val metadataTableDF = spark.read.format("org.apache.hudi")
+        .load(s"$metadataTablePath/${MetadataPartitionType.COLUMN_STATS.getPartitionPath}")
+
+      // TODO filter on (column, partition) prefix
+      val colStatsDF = metadataTableDF.where(col(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS).isNotNull)
+        .select(requiredMetadataIndexColumns.map(col): _*)


Should the logic of fetching column stats into DF be incorporated into BaseTableMetadata as there is already another API of getColumnStats()? In this way, it may also be possible to make the logic here metadata table agnostic, and instead rely on BaseTableMetadata/HoodieTableMetadata to decide which source (.hoodie/.colstatsindex on fs vs metadata table) to use.

Am gonna start w/ a latter: i don't think we're planning to support both of these, since bespoke ColStats index was purely a stop-gap solution until we get primary MT index.

Having said that, i don't really see this to be commonly used for us to promote it into HoodieTableMetadata API: keep in mind that this table format is very Data Skipping specific and i don't think is very useful outside of that

Got it, this is fine for now. I'm thinking from the perspective of whether this can be reused for index on the write path.

yihua · 2022-03-08T20:51:39Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

+        //       query at hand might only be referencing a handful of those. As such, we collect all the
+        //       column references from the filtering expressions, and only transpose records corresponding to the
+        //       columns referenced in those
+        val transposedColStatsDF =


I guess the purpose of doing transposing here is to adapt to the expected input of existing APIs of data skipping?

Correct. It also makes code much simpler (otherwise you need to do rows intersection for every column, that makes code more involved)

yihua · 2022-03-08T20:54:29Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

+          .reduceLeft((left, right) =>
+            left.join(right, usingColumn = HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME))


this may not scale well with a large number of columns from predicates, as DF joining is expensive, even considering caching. I'm wondering if a different DAG should be written for metadata table col stats, i.e., one row of col stats per file + column. Conceptually, I think such joining can be avoided when prunning the files.

Keep in mind that here we only join columns that the table is clustered by, so this is likely bounded by the number of 10. So, frankly, i don't think this will be a bottleneck, unless we're talking about gargantuan tables (with 10s of M files).

Conceptually, we actually can't avoid joining for the following reason: ultimately to validate whether file will be accepted or not we will have to AND all of the rows of individual columns (ie all of the columns had to satisfy their respective filters) which implicitly requires join by the filename (one way or the other)

Based on the code, I thought all columns from predicates are going to trigger joining (i.e., n number of columns -> n-1 joins), not just the clustering columns, since column stats index in metadata table can contain all columns from the schema.

There are cases where the table is fat (1k to 10k+ number of columns, see this blog) and the queries can have more than 10 predicates at Uber and ByteDance. Bytedance has PB-level tables which can easily have 10s of M files in a few partitions. I worry that the joining can take a hit for this kind of scale.

I understand that some kind of "joining" is needed here, but the spark table join in the current scheme expands the table after each join and adds additional col stats column. If for each of the df from a column from the following applies the filter first and generate a boolean for each file, then the next step is going to do AND, which does not require expanding columns and an additional cached table, reducing memory pressure and possible shuffling. Then that is much less costly than spark table/df join.

queryReferencedColumns.map(colName => colStatsDF.filter(col(HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME).equalTo(colName)) .select(targetColStatsIndexColumns.map(col): _*) .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT, getNumNullsColumnNameFor(colName)) .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MIN_VALUE, getMinColumnNameFor(colName)) .withColumnRenamed(HoodieMetadataPayload.COLUMN_STATS_FIELD_MAX_VALUE, getMaxColumnNameFor(colName)) )

Based on the code, I thought all columns from predicates are going to trigger joining (i.e., n number of columns -> n-1 joins), not just the clustering columns, since column stats index in metadata table can contain all columns from the schema.

Correct, but even in the current setup we only join M columns which are directly referenced in the predicates. So even for fat tables having 1000s of columns, this is unlikely to be a problem since M << N practically at all times.

I understand that some kind of "joining" is needed here, but the spark table join in the current scheme expands the table after each join and adds additional col stats column. If for each of the df from a column from the following applies the filter first and generate a boolean for each file, then the next step is going to do AND, which does not require expanding columns and an additional cached table, reducing memory pressure and possible shuffling. Then that is much less costly than spark table/df join.

Understand your point. Such slicing however will a) require to essentially revisit the whole flow, b) would blend in index reshaping and actual querying, and i think we're trying to optimize it prematurely at the moment. We can certainly fine-tune this flow, but i would much rather focus on its correctness right now and then follow up on the performance tuning after proper testing/profiling is done. WDYT?

That makes sense to me. As synced offline, the optimization of the flow of joining will be a follow-up, not in this PR. We still need a good understanding of the percentage of the time spent in the joining stage vs the overall query planning/execution time in different table sizes (small and medium to start with), to check if this is really the bottleneck, before actually optimizing it.

Created HUDI-3611

alexeykudinkin · 2022-03-12T01:41:59Z

@hudi-bot run azure

yihua

LGTM

alexeykudinkin · 2022-03-12T03:01:10Z

@hudi-bot run azure

…stics leveraging MT ColStats index

Cleaned up `HoodieFileIndex`

Added tests

Tidying up

…chemaResolver`

Added TODO

hudi-bot · 2022-03-15T02:26:37Z

CI report:

4db7b51 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

apache#4948)

xiarixiaoyao self-assigned this Mar 4, 2022

xiarixiaoyao removed their assignment Mar 4, 2022

yihua self-assigned this Mar 4, 2022

alexeykudinkin force-pushed the ak/mtmod-idx-1 branch from 2779d95 to 4421752 Compare March 8, 2022 01:02

codope reviewed Mar 8, 2022

View reviewed changes

yihua reviewed Mar 8, 2022

View reviewed changes

alexeykudinkin force-pushed the ak/mtmod-idx-1 branch 2 times, most recently from 14366ca to 97c07be Compare March 11, 2022 22:56

yihua approved these changes Mar 12, 2022

View reviewed changes

apache deleted a comment from hudi-bot Mar 12, 2022

Alexey Kudinkin added 17 commits March 14, 2022 12:29

Added columnName to ColumnStatsMetadata payload

62c58e1

Populate columnName w/in ColumnStats payload

dc19c10

Scaffolded method to lookup candidate files based on the column stati…

7669b7e

…stics leveraging MT ColStats index

Fixing TestLayoutOptimization test enable MT and CS index

258fbcf

Added HoodieDatasetUtils;

5b54d3f

Cleaned up `HoodieFileIndex`

Replaced all string literals with appropriate constants refs

0c11952

Tidying up

874a941

Killing dead-code

b249b07

Fixing compilation

83c3a15

Cleaning up unused imports

6c27558

Fixed TablePathUtils to correctly identify MT;

74c8d13

Added tests

Transpose only columns referenced w/in filtering expressions;

f6ee536

Tidying up

Tidying up

7d7c8b3

Tidying up (more)

6ffb20c

Removing dead-code

39f16b0

Added test for Data Skipping flow w/in HoodieFileIndex

38bead7

Tidying up

6286210

Reverting recent getTableAvroSchema change

fff47ce

alexeykudinkin force-pushed the ak/mtmod-idx-1 branch from 97c07be to fff47ce Compare March 14, 2022 19:29

Alexey Kudinkin added 5 commits March 14, 2022 15:38

Fixing test record reader incorrectly augmenting schema with meta-fields

0e578c8

Fixed COW Hive InputFormat to leverage appropriate API of the `TableS…

5fd5b1b

…chemaResolver`

Deprecated some APIs;

fbc1e29

Added TODO

Unused imports

a5c3a97

Fixed typo

4db7b51

yihua merged commit 5e8ff8d into apache:master Mar 15, 2022

yihua mentioned this pull request Apr 1, 2022

[HUDI-3708] Fix failure with HoodieMetadataRecord due to schema compatibility check #5204

Merged

5 tasks

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index (

9576792

apache#4948)

stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022

[HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index (

c0eecb5

apache#4948)

		.reduceLeft((left, right) =>
		left.join(right, usingColumn = HoodieMetadataPayload.COLUMN_STATS_FIELD_FILE_NAME))

[HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index #4948

[HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index #4948

Uh oh!

Conversation

alexeykudinkin commented Mar 3, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

xiarixiaoyao commented Mar 4, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua Mar 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua Mar 15, 2022 •

edited

Loading

alexeykudinkin Mar 8, 2022 •

edited

Loading

alexeykudinkin Mar 8, 2022 •

edited

Loading