-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30427][SQL] Add config item for limiting partition number when calculating statistics through File System #27129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc: @cloud-fan @wangyum |
|
Test build #116283 has finished for PR 27129 at commit
|
|
retest this please |
|
Test build #116294 has finished for PR 27129 at commit
|
|
Test build #116362 has finished for PR 27129 at commit
|
|
retest this please |
|
Test build #116375 has finished for PR 27129 at commit
|
|
retest this please |
|
Test build #116394 has finished for PR 27129 at commit
|
|
Test build #116459 has finished for PR 27129 at commit
|
|
Test build #116471 has finished for PR 27129 at commit
|
|
Test build #116534 has finished for PR 27129 at commit
|
|
Test build #116712 has finished for PR 27129 at commit
|
|
Test build #116842 has finished for PR 27129 at commit
|
|
Test build #117242 has finished for PR 27129 at commit
|
|
cc @cloud-fan |
|
Test build #117458 has finished for PR 27129 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fuwhu InMemoryFileIndex caches all the file status on construction. Is it true that statistic calculation is very expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 spaces indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fuwhu
InMemoryFileIndexcaches all the file status on construction. Is it true that statistic calculation is very expensive?
yea, you are right. The refresh0 method will call listLeafFiles to get all file status, which already implement listing in parallel when path number exceed threshold.
So the statistic calculation here is actually just a sum of the length of all leaf files. Will remove the conf check here.
But for hive table, currently, there is no some place to get the file/dir status directly without accessing the file system. So i think it is still necessary to limit the partition number of statistic calculation. WDYT ?
For parallel statistic calculation, i am not sure whether it is worthwhile to start a distributed job to do the statistic calculation or start multiple threads to do it ? The benefit of statistic computation may not cover the cost of the statistic calculation. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a good idea to limit the number of partitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang So you prefer to do the statistic calculation in parallel in case the partition number exceed the threshold?
@cloud-fan WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file listing is parallel, and the statistic calculation is just getting the sum of the files. So no need to make the statistic calculation parallel.
I updated my comments minutes after I left them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the statistic calculation is just getting the sum of the files
this is true in PruneFileSourcePartitions, because InMemoryFileIndex already did file listing when InMemoryFileIndex object is constructed in CatalogFileIndex.filterPartitions method.
so I already removed the partition number check in PruneFileSourcePartitions.
but this is not true in PruneHiveTablePartitions, the size of each hive table partition need to be calculated via HDFS if it's not available in meta data.
Please correct me if i am wrong, thanks.
…han SQLConf.maxPartNumForStatsCalculateViaFS in rule PriveHiveTablePartitions.
…n PruneFileSourcePartitions.
|
Test build #117616 has started for PR 27129 at commit |
| if (partitionsWithSize.forall(_._2 > 0)) { | ||
| val sizeInBytes = partitionsWithSize.map(_._2).sum | ||
| tableMeta.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes)))) | ||
| } else if (partitionsWithSize.count(_._2 == 0) <= conf.maxPartNumForStatsCalculateViaFS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to do calculation here? I think it introduces extra cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean just leave the tableMeta unchanged (which is table level meta without partition pruning) if there is at least one partition whose size is not available in meta store ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some discussion about this was in #26805 (comment)
|
retest this please |
|
Test build #117626 has finished for PR 27129 at commit
|
|
retest this please |
|
Test build #117648 has finished for PR 27129 at commit
|
|
Test build #117665 has finished for PR 27129 at commit
|
| if (partitionsWithSize.forall(_._2 > 0)) { | ||
| val sizeInBytes = partitionsWithSize.map(_._2).sum | ||
| tableMeta.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes)))) | ||
| } else if (partitionsWithSize.count(_._2 == 0) <= conf.maxPartNumForStatsCalculateViaFS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fuwhu, are you're proposing a configuration to automatically calculate the size? why don't you just manually run analyze comment to calculate the stats? It's weird to do this based on the number of partitions.
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Add config "spark.sql.statistics.fallBackToFs.maxPartitionNumber" and use it to control whether calculate statistics through file system.
Why are the changes needed?
Currently, when spark need to calculate the statistics (eg. sizeInBytes) of table partition through file system (eg. HDFS), it does not consider the number of partitions. Then if the the number of partitions is huge, it will cost much time to calculate the statistics which may be not be that useful.
It should be reasonable to add a config item to control the limit of partition number allowable to calculate statistics through file system.
Does this PR introduce any user-facing change?
Yes, statistics of logical plan may be changed which may impact some spark strategies part, eg. JoinSelection.
How was this patch tested?
Added new unit test.