[SPARK-30427][SQL] Add config item for limiting partition number when calculating statistics through File System #27129

fuwhu · 2020-01-08T05:43:59Z

What changes were proposed in this pull request?

Add config "spark.sql.statistics.fallBackToFs.maxPartitionNumber" and use it to control whether calculate statistics through file system.

Why are the changes needed?

Currently, when spark need to calculate the statistics (eg. sizeInBytes) of table partition through file system (eg. HDFS), it does not consider the number of partitions. Then if the the number of partitions is huge, it will cost much time to calculate the statistics which may be not be that useful.

It should be reasonable to add a config item to control the limit of partition number allowable to calculate statistics through file system.

Does this PR introduce any user-facing change?

Yes, statistics of logical plan may be changed which may impact some spark strategies part, eg. JoinSelection.

How was this patch tested?

Added new unit test.

fuwhu · 2020-01-08T05:45:20Z

This config can also be used in PruneHiveTablePartitions which is proposed in #26805 .
Will update after #26805 finished.

fuwhu · 2020-01-08T05:46:44Z

cc: @cloud-fan @wangyum

SparkQA · 2020-01-08T08:05:01Z

Test build #116283 has finished for PR 27129 at commit dc0a6d1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

fuwhu · 2020-01-08T09:49:33Z

retest this please

SparkQA · 2020-01-08T13:57:51Z

Test build #116294 has finished for PR 27129 at commit dc0a6d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-09T08:05:01Z

Test build #116362 has finished for PR 27129 at commit abeb326.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

fuwhu · 2020-01-09T08:38:04Z

retest this please

SparkQA · 2020-01-09T11:40:06Z

Test build #116375 has finished for PR 27129 at commit abeb326.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

fuwhu · 2020-01-09T13:21:29Z

retest this please

SparkQA · 2020-01-09T17:43:49Z

Test build #116394 has finished for PR 27129 at commit abeb326.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-10T08:05:01Z

Test build #116459 has finished for PR 27129 at commit 2ff9960.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-10T13:04:52Z

Test build #116471 has finished for PR 27129 at commit 7e19d7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-12T00:10:03Z

Test build #116534 has finished for PR 27129 at commit 0a5ae3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-14T20:38:28Z

Test build #116712 has finished for PR 27129 at commit aac2bbe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-16T13:58:12Z

Test build #116842 has finished for PR 27129 at commit b29363b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-22T19:14:33Z

Test build #117242 has finished for PR 27129 at commit 5c432f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

fuwhu · 2020-01-28T02:26:57Z

cc @cloud-fan

SparkQA · 2020-01-28T05:56:20Z

Test build #117458 has finished for PR 27129 at commit a661994.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang

@fuwhu InMemoryFileIndex caches all the file status on construction. Is it true that statistic calculation is very expensive?

gengliangwang · 2020-01-30T18:57:18Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

4 spaces indent

@fuwhu InMemoryFileIndex caches all the file status on construction. Is it true that statistic calculation is very expensive?

yea, you are right. The refresh0 method will call listLeafFiles to get all file status, which already implement listing in parallel when path number exceed threshold.
So the statistic calculation here is actually just a sum of the length of all leaf files. Will remove the conf check here.

But for hive table, currently, there is no some place to get the file/dir status directly without accessing the file system. So i think it is still necessary to limit the partition number of statistic calculation. WDYT ?

For parallel statistic calculation, i am not sure whether it is worthwhile to start a distributed job to do the statistic calculation or start multiple threads to do it ? The benefit of statistic computation may not cover the cost of the statistic calculation. WDYT ?

I don't think it is a good idea to limit the number of partitions.

@gengliangwang So you prefer to do the statistic calculation in parallel in case the partition number exceed the threshold?
@cloud-fan WDYT?

The file listing is parallel, and the statistic calculation is just getting the sum of the files. So no need to make the statistic calculation parallel.
I updated my comments minutes after I left them.

the statistic calculation is just getting the sum of the files

this is true in PruneFileSourcePartitions, because InMemoryFileIndex already did file listing when InMemoryFileIndex object is constructed in CatalogFileIndex.filterPartitions method.
so I already removed the partition number check in PruneFileSourcePartitions.

but this is not true in PruneHiveTablePartitions, the size of each hive table partition need to be calculated via HDFS if it's not available in meta data.

Please correct me if i am wrong, thanks.

…A_FS to 100.

…han SQLConf.maxPartNumForStatsCalculateViaFS in rule PriveHiveTablePartitions.

…n PruneFileSourcePartitions.

SparkQA · 2020-01-31T04:16:52Z

Test build #117616 has started for PR 27129 at commit 3c27224.

gengliangwang · 2020-01-31T04:43:42Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala

+    if (partitionsWithSize.forall(_._2 > 0)) {
+      val sizeInBytes = partitionsWithSize.map(_._2).sum
+      tableMeta.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes))))
+    } else if (partitionsWithSize.count(_._2 == 0) <= conf.maxPartNumForStatsCalculateViaFS) {


why do we need to do calculation here? I think it introduces extra cost.

you mean just leave the tableMeta unchanged (which is table level meta without partition pruning) if there is at least one partition whose size is not available in meta store ?

some discussion about this was in #26805 (comment)

fuwhu · 2020-01-31T06:26:00Z

retest this please

SparkQA · 2020-01-31T08:05:02Z

Test build #117626 has finished for PR 27129 at commit 3c27224.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

fuwhu · 2020-01-31T10:44:06Z

retest this please

SparkQA · 2020-01-31T13:47:01Z

Test build #117648 has finished for PR 27129 at commit 3c27224.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-31T21:03:46Z

Test build #117665 has finished for PR 27129 at commit 8b14ce4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-18T06:55:06Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala

+    if (partitionsWithSize.forall(_._2 > 0)) {
+      val sizeInBytes = partitionsWithSize.map(_._2).sum
+      tableMeta.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes))))
+    } else if (partitionsWithSize.count(_._2 == 0) <= conf.maxPartNumForStatsCalculateViaFS) {


@fuwhu, are you're proposing a configuration to automatically calculate the size? why don't you just manually run analyze comment to calculate the stats? It's weird to do this based on the number of partitions.

github-actions · 2020-05-29T00:21:23Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

fuwhu force-pushed the SPARK-30427 branch from abeb326 to 2ff9960 Compare January 10, 2020 07:32

fuwhu mentioned this pull request Jan 14, 2020

[SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions #26805

Closed

fuwhu changed the title ~~[SPARK-30427] Add config item for limiting partition number when calculating statistics through File System~~ [SPARK-30427][SQL] Add config item for limiting partition number when calculating statistics through File System Jan 14, 2020

fuwhu mentioned this pull request Jan 16, 2020

[SPARK-30516][SQL] involve partition filters in the statistic estimation of FileScan #27213

Closed

fuwhu force-pushed the SPARK-30427 branch from aac2bbe to b29363b Compare January 16, 2020 09:27

fuwhu force-pushed the SPARK-30427 branch from b29363b to 5c432f0 Compare January 22, 2020 14:51

fuwhu requested review from cloud-fan and wangyum and removed request for wangyum January 28, 2020 02:27

fuwhu requested a review from gengliangwang January 30, 2020 15:36

gengliangwang reviewed Jan 30, 2020

View reviewed changes

rebase code.

96469c8

fuwhu added 8 commits January 31, 2020 08:58

Fix compile error.

2287631

update default value of MAX_PARTITION_NUMBER_FOR_STATS_CALCULATION_VI…

2c2da94

…A_FS to 100.

Refine code.

6f31548

refine code

bccdc3e

Check whether the partition number to calculate sizeInBytes is less t…

d62171d

…han SQLConf.maxPartNumForStatsCalculateViaFS in rule PriveHiveTablePartitions.

fix scala code style.

faffdf8

refine tests.

a06288c

roll back the check on partition number when calculating statistics i…

3c27224

…n PruneFileSourcePartitions.

fuwhu force-pushed the SPARK-30427 branch from a661994 to 3c27224 Compare January 31, 2020 04:13

gengliangwang reviewed Jan 31, 2020

View reviewed changes

fix test failure

8b14ce4

dongjoon-hyun added the SQL label Feb 5, 2020

HyukjinKwon reviewed Feb 18, 2020

View reviewed changes

github-actions bot added the Stale label May 29, 2020

github-actions bot closed this May 30, 2020

[SPARK-30427][SQL] Add config item for limiting partition number when calculating statistics through File System #27129

[SPARK-30427][SQL] Add config item for limiting partition number when calculating statistics through File System #27129

Uh oh!

Conversation

fuwhu commented Jan 8, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

fuwhu commented Jan 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuwhu commented Jan 8, 2020

Uh oh!

SparkQA commented Jan 8, 2020

Uh oh!

fuwhu commented Jan 8, 2020

Uh oh!

SparkQA commented Jan 8, 2020

Uh oh!

SparkQA commented Jan 9, 2020

Uh oh!

fuwhu commented Jan 9, 2020

Uh oh!

SparkQA commented Jan 9, 2020

Uh oh!

fuwhu commented Jan 9, 2020

Uh oh!

SparkQA commented Jan 9, 2020

Uh oh!

SparkQA commented Jan 10, 2020

Uh oh!

SparkQA commented Jan 10, 2020

Uh oh!

SparkQA commented Jan 12, 2020

Uh oh!

SparkQA commented Jan 14, 2020

Uh oh!

SparkQA commented Jan 16, 2020

Uh oh!

SparkQA commented Jan 22, 2020

Uh oh!

fuwhu commented Jan 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 28, 2020

Uh oh!

gengliangwang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuwhu Jan 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuwhu Jan 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuwhu Jan 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuwhu commented Jan 8, 2020 •

edited

Loading

fuwhu commented Jan 28, 2020 •

edited

Loading

gengliangwang left a comment •

edited

Loading

fuwhu Jan 31, 2020 •

edited

Loading

fuwhu Jan 31, 2020 •

edited

Loading

fuwhu Jan 31, 2020 •

edited

Loading