[HUDI-2101][RFC-28]support z-order for hudi #3330

xiarixiaoyao · 2021-07-23T02:03:55Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

pr for RFC - 28 : Support Z-order curve. Z-order is a technique that allows you to map multidimensional data to a single dimension. We can use this feature to improve query performance.

query performance test:
we use databricks test case
https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html?spm=a2c4g.11186623.2.11.2a8f1693QNwUSH&_ga=2.29295480.552083878.1584501563-968665100.1584501563
https://help.aliyun.com/document_detail/168137.html?spm=a2c4g.11186623.6.563.10d758ccclYtVb
prepare data:
10 billion data， 5w files （parquet files）
test env：
spark-3.1.0
hadoop-3.1.0
hudi-0.8.0
test step：

step1： prepare hudi cow table with 10 billion data， 5w files
step2： do optimize with z-order， sort column： src_ip, dst_ip and produce 5w files。 df.optimizeByZOrder(Seq("src_ip", "dst_ip"), options, outputFileNum = 50000)
step3： do optimize with z-order， sort column：src_ip, src_port, dst_ip, dst_port and produce 5w files。 df.optimizeByZOrder(Seq("src_ip", "src_port", "dst_ip", "dst_port"), options, outputFileNum = 50000)

test result

Table	Files number	Data size	first query	second query	Query resource	Data skipping
Hudi table	5w	10 billion	76980ms	29228ms	330core, 1t memory	Scan 5w files
z-order with two-column sorting	5w	10 billion	4662ms	3855ms	16 core 96G memory	Scan 4 files
z-order with four-column sorting	5w	10 billion	6168ms	3344ms	16 core 96G memory	Scan 6 files
Hilbert sort	5w	10 billion	5694ms	4029ms	16 core 96G memory	Scan 5 files

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

UT added.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

hudi-bot · 2021-07-23T02:05:55Z

CI report:

133379deca564ca42f10a1f3e59bb4aa17d80964 UNKNOWN
7b975e5 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run travis re-run the last Travis build
@hudi-bot run azure re-run the last Azure build

xiarixiaoyao · 2021-07-23T02:07:10Z

now the RFC-27 is not implement, once RFC-27 is merged, i will update code to adapt it.
in this pr, even if we donnot do data skipping, we can also achive a good result, since we sort data by z-order

hilbert implement will come soon

xiarixiaoyao · 2021-07-23T06:10:45Z

@leesf could you pls help me to review this pr

pengzhiwei2018 · 2021-07-23T06:40:10Z

hudi-common/src/main/java/org/apache/hudi/common/model/WriteOperationType.java

Do we need another write type for z-order? For my first think, z-order is a layout optimize for the table which is much similar to order-by. As currently we already have the clustering to optimize the layout by order-by, can we implement the z-order by under the clustering ? So that we can support two kind of data clustering:

Clustering table order by col1;
Clustering table zorder by col2;

we support two operation type

cluster support

zorder optimize by overwrite table。

It make sense to me now. But one thing I want to understand is that what is the difference between the two operation types you list, if cluster support can not cover the case of zorder optimize by overwrite table>

@pengzhiwei2018 you mean sql? good suggest, now we only support api . i will add sql support for anther pr. thanks for your suggest

No, I mean if 1. cluster support can cover the function of 2. zorder optimize by overwrite table, we can only implement the zorder by under the clustering. No need to introduce a write type for z-order. As you know, hudi currently has too many write types, which add the cost of use for our users.

@pengzhiwei2018 Due to the implementation of Hudi itself, optimze by overwritetable is more convenient to use， since cluster operation need to set many properties。
Another consideration: as we know delta lake support z-order, just like delta lake api, we want keep the same implement

I'm +1 to keep this as a clustering strategy. If needed, we can make top level API different. But underneath, WriteClient API and ActionExecutor will benefit from using same code (I do see lot of copy paste code there right now). Any changes we do in one ActionExecutor will have to be copied to other which is going to be a challenge.

xiarixiaoyao · 2021-07-23T07:12:30Z

@allwefantasy. could you pls help me to review the z-order partial codes. thanks

...hudi-spark/src/main/scala/org/apache/hudi/clustering/SparkZSortAndSizeExecutionStrategy.java

hudi-client/hudi-spark-client/src/main/java/org/apache/spark/sql/hudi/UnsafeAccess.java

...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java

...in/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java

hudi-client/hudi-spark-client/src/main/java/org/apache/spark/sql/hudi/ZOrderingUtil.java

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/OptimizeTableByCurve.scala

hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/Zoptimize.scala

xiarixiaoyao · 2021-07-26T01:34:10Z

@leesf thanks for your review. i will try to resolve all comments

xiarixiaoyao · 2021-07-26T14:50:39Z

@leesf address all commts. make the code more genral to adapt both z-order and hilbert optimize.

xiarixiaoyao · 2021-07-27T08:57:18Z

@leesf @garyli1019 @nsivabalan @pengzhiwei2018 . pls help me review this pr again, thanks.
this pr realized the following functions:

A new write mode optimize is implemented, which supports the optimization of data layout with zorder and Hilbert
clustering mode support zorde/hilbert
Simple dataskipping implemented,(of course even if we have no dataskipping zorder/hilbert can also accelerated query). if RFC-27 is implemented, i will adaptation

vinothchandar · 2021-10-27T15:44:11Z

(WIP) Pending items (for my own tracking)

Zoptimize.scala to java
Using LAYOUT_OPTIMIZE* prefix for all configs, property names instead of DATA_OPTIMIZE_*
Should sort columns and optimize columns be separate configs or not?
Check out the HoodieTable API changes
Protect the query side changes with a flag.

vinothchandar

@xiarixiaoyao I pushed some small changes to config names and such. The only issue to resolve is to move the data skipping config.

Add it to the Spark DataSourceReadOptions and remove from HoodieWriteConfig? Then we can rework the code in HoodieFileIndex to be protected by this config, also defensively protect the existing code paths.
For updating statistics during writing, we can just check whether layout optimization is enabled? (no need to check data skipping enabled there, since its a query side config).
Also can we make sure this works for Spark SQL as well (not just datasource)

What do you think?

vinothchandar · 2021-10-28T10:57:31Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

need to probably guard these changes with a config as well.

vinothchandar · 2021-10-28T12:22:45Z

Side note: At-least from IDE builds, I get the following error when running tests. did you fae this? have you tried running this on a cluster using a spark bundle?

xiarixiaoyao · 2021-10-29T07:12:26Z

@vinothchandar thanks for your suggestion.
the question:Side note: At-least from IDE builds, I get the following error when running tests. did you fae this? have you tried running this on a cluster using a spark bundle?

ans: i use java 8 to run those code , of course i test those code in cluster. but i find if we uses java 11 to run those code in IDE, IDE will throw "package sun.misc. does not exist", since Packages com.sun.* and sun.* hold internal stuff, and should not be used by thirdparty applications (like yours probably) in general case. now i removed all unsafe function, it should be ok for java 11.

update the code.
1. rewrite most of scala functions(ZOptimze.scala) by java in ZCurveOptimizeHelper.java and remove ZOptimze.scala
2. remove unsafe lexicographical cmp and use java lexicographical cmp. now no unsafe class exisit.
3. dataskipping should work with sql and datasource, not just datasource.
4. Introduce a new conf ENABLE_DATA_SKIPPING in DataSourceReadOptions to protect the query side changes.

i think we may no need to move data skipping config to DataSourceReadOptions. we can introduce a config in DataSourceReadOptions. WDYT?

xiarixiaoyao · 2021-10-30T01:15:52Z

@hudi-bot run azure

vinothchandar · 2021-10-31T05:04:59Z

we can introduce a config in DataSourceReadOptions. WDYT?

yes. we can introduce a new one there. But wondering if we need the data skipping flag in HoodieWriteConfig still? What purpose do you think it will solve. I am fine either ways.

Let me know when/if you have tested the latest code on your cluster and verified performance gains are still same (after removing the unsafe code etc).

Thanks for working through this with me

xiarixiaoyao · 2021-11-01T06:26:59Z

@vinothchandar
data skipping flag in HoodieWriteConfig is a write flag. config in DataSourceReadOptions is read flag. I think it’s better to read and write separately.
i will verified performance gains today.

I hope we can merge this feature before the 0.10 release, and I will try my best to resolve all the comments。

xiarixiaoyao · 2021-11-01T12:04:55Z

@vinothchandar i test the performance about z-order cluster by using unsafe sort and java sort,
test env: spark3.1.1 --master yarn-client --driver-memory 20g --executor-cores 3 --num-executors 70 --executor-memory 8g
test case: https://help.aliyun.com/document_detail/168137.html?spm=a2c4g.11186623.6.563.10d758ccclYtVb
test data: 53 G (10000w)
the test result:

	unsafe sort(only record sort time )	java sort (only record sort time)
First time	3.1min	2.8min
Second time	2.9min	2.7min

test result show: No performance degradation.

xiarixiaoyao · 2021-11-01T14:47:44Z

@hudi-bot run azure

vinothchandar · 2021-11-01T14:59:47Z

Sounds good. I ll tkae a final pass and land this, this week.

vinothchandar

Only a few minor things. LGTM overall., will push out few tweaks and land

vinothchandar · 2021-11-02T11:56:49Z

@xiarixiaoyao do you plan to add hilbert curves in the 0.10.0 release? (3rd week of nov)

right now, this config is sort of unused?

      .key(LAYOUT_OPTIMIZE_PARAM_PREFIX + "strategy")
      .defaultValue("z-order")
      .sinceVersion("0.10.0")
      .withDocumentation("Type of layout optimization to be applied, current only supports `z-order` and `hilbert` curves.");

vinothchandar · 2021-11-02T12:20:55Z

@hudi-bot run azure

xiarixiaoyao · 2021-11-02T12:27:06Z

@vinothchandar yes， hilbert curve is ready， i think time is enough to add hilbert curve.

vinothchandar · 2021-11-02T13:22:37Z

sg. will land once CI passes. right now, its queued up here.

https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=3067&view=results

xiarixiaoyao · 2021-11-09T01:42:50Z

@xushiyan
Test result for z-order
query resource --executor-memory 12G --executor-cores 4 --num-executors 4

Table	File nums	Data size	1st query	2st query	3st query (enable data skipping)	4st query (enable data skipping)	DataSkipping
Hudi origin table	20k	10billion	446806ms	397131 ms	not support	not support
z-order with two column sorting	20k	10billion	94104 ms	80778 ms	5913 ms	2769ms	Total file size is: 20000, after file skip size is: 5 skipping percent 0.99975
z-order with four-column sorting	20k	10billion	111393 ms	91664 ms	5868 ms	3186 ms	Total file size is: 20000, after file skip size is: 5 skipping percent 0.99975

vinothchandar · 2021-11-09T22:10:05Z

@xiarixiaoyao let's write a blog around z-ordering? If you are up for it, I can help you with an initial draft and you can add more to it? what do you think ?

xiarixiaoyao · 2021-11-10T06:07:13Z

@vinothchandar Of course, let's finish this together

xiarixiaoyao force-pushed the zorder branch 2 times, most recently from 93897c1 to b474498 Compare July 23, 2021 04:36

xiarixiaoyao changed the title ~~[HUDI-2101][WIP]support z-order for hudi~~ [HUDI-2101]support z-order for hudi Jul 23, 2021

xiarixiaoyao changed the title ~~[HUDI-2101]support z-order for hudi~~ [HUDI-2101][RFC-28]support z-order for hudi Jul 23, 2021

pengzhiwei2018 reviewed Jul 23, 2021

View reviewed changes