Skip to content

Conversation

@lianhuiwang
Copy link
Contributor

What changes were proposed in this pull request?

Currently if some partitions of a partitioned table are used in join operation we rely on Metastore returned size of table to calculate if we can convert the operation to Broadcast join.
if Filter can prune some partitions, Hive can prune partition before determining to use broadcast joins according to HDFS size of partitions that are involved in Query.So sparkSQL needs it that can improve join's performance for partitioned table.

How was this patch tested?

add unit tests.

@lianhuiwang lianhuiwang changed the title [SPARK-15616] [SQL] Metastore relation should fallback to HDFS size of partitions that are involved in Query for JoinSelection. [SPARK-15616] [SQL] CatalogRelation should fallback to HDFS size of partitions that are involved in Query for JoinSelection. Jun 4, 2017
@SparkQA
Copy link

SparkQA commented Jun 4, 2017

Test build #77711 has finished for PR 18193 at commit 5591a1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DeterminePartitionedTableStats(

fsRelation.copy(location = prunedFileIndex)(sparkSession)
val prunedLogicalRelation = logicalRelation.copy(relation = prunedFsRelation)
val withStats = logicalRelation.catalogTable.map(_.copy(
stats = Some(CatalogStatistics(sizeInBytes = BigInt(prunedFileIndex.sizeInBytes)))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I think this is a bug and worth a separated PR to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i have created SPARK-20986. thanks.

customCheckRules
}

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks.

@SparkQA
Copy link

SparkQA commented Jun 5, 2017

Test build #77742 has finished for PR 18193 at commit 754af2f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 5, 2017

Test build #77745 has finished for PR 18193 at commit c44a589.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 6, 2017

Test build #77764 has finished for PR 18193 at commit 171a9e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

session: SparkSession) extends Rule[LogicalPlan] with PredicateHelper {
override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
case filter @ Filter(condition, relation: CatalogRelation)
if DDLUtils.isHiveTable(relation.tableMeta) && relation.isPartitioned =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only for hive table? what about data source table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that PruneFileSourcePartitions can be for data source table now.

}
}

case class DeterminePartitionedTableStats(
Copy link
Contributor

@cloud-fan cloud-fan Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it kind of a PruneFileSourcePartitions rule for hive tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, DeterminePartitionedTableStats is kind of a PruneFileSourcePartitions rule for Hive tables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then shall we give it a better name, like PruneHiveTablePartitions ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i will rename it to PruneHiveTablePartitions.

session.sessionState.conf.sessionLocalTimeZone)
val hiveTable = HiveClientImpl.toHiveTable(relation.tableMeta)
val partitions = prunedPartitions.map(HiveClientImpl.toHivePartition(_, hiveTable))
val sizeInBytes = try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we already have partition level statistics at hive metastore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already have partition level statistics, But we cannot know total number of partition, so it cannot compute the statistics for pruned partitions.

partitions.map { partition =>
val fs: FileSystem = partition.getDataLocation.getFileSystem(hadoopConf)
fs.getContentSummary(partition.getDataLocation).getLength
}.sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are too many partitions, it will be very slow.
can you add a check that whether the sum is larger than threshold, if true then break.

@cloud-fan
Copy link
Contributor

The logic looks similar to PruneFileSourcePartitions, can we just merge this new rule to it?

@gatorsmile
Copy link
Member

ping @lianhuiwang

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80991 has finished for PR 18193 at commit 4344fbd.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class PruneHiveTablePartitions(

@lianhuiwang
Copy link
Contributor Author

@cloud-fan PruneFileSourcePartitions is kind of a rule for datasource, But now we cannot make hive as a data source.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80993 has finished for PR 18193 at commit ff59140.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lianhuiwang
Copy link
Contributor Author

retest it please.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80994 has finished for PR 18193 at commit 37b20c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

session: SparkSession) extends Rule[LogicalPlan] with PredicateHelper {
override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
case filter @ Filter(condition, relation: HiveTableRelation)
if DDLUtils.isHiveTable(relation.tableMeta) && relation.isPartitioned =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DDLUtils.isHiveTable(relation.tableMeta) is no longer needed

}
}

case class PruneHiveTablePartitions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a todo that we should merge this rule with PruneFileSourcePartitions, after we completely make hive a data source.

pruningPredicates,
session.sessionState.conf.sessionLocalTimeZone)
val hiveTable = HiveClientImpl.toHiveTable(relation.tableMeta)
val partitions = prunedPartitions.map(HiveClientImpl.toHivePartition(_, hiveTable))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this? All we need is partition data location, and we can get it by CatalogTablePartition.storage.locationUri

@lianhuiwang
Copy link
Contributor Author

@cloud-fan I have address your comments. Thanks.

@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81134 has finished for PR 18193 at commit 118f4bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

pruningPredicates,
session.sessionState.conf.sessionLocalTimeZone)
val sizeInBytes = try {
prunedPartitions.map { part =>
Copy link
Contributor

@cenyuhai cenyuhai Sep 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should first check whether partition.parameters contains SetupConst.RAW_DATA_SIZE and SetupConst.TOTAL_SIZE) or not. If partition.parameters contains the size of the partition, use it instead of getConetSummary of hdfs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cenyuhai yes,Good idea. I will add it.Thanks.

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81937 has finished for PR 18193 at commit fdd63c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81939 has finished for PR 18193 at commit fd95fb3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val sizeInBytes = try {
prunedPartitions.map { part =>
val totalSize = part.parameters.get(StatsSetupConst.TOTAL_SIZE).map(_.toLong)
val rawDataSize = part.parameters.get(StatsSetupConst.RAW_DATA_SIZE).map(_.toLong)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should first use rawDataSize, because 1MB orc file is equal to 5MB textfile...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cenyuhai Yes,I think what you said is right.Thanks.

@SparkQA
Copy link

SparkQA commented Sep 20, 2017

Test build #81968 has finished for PR 18193 at commit 72f63fa.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 29, 2018

Test build #88712 has finished for PR 18193 at commit 72f63fa.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@Achuth17
Copy link
Contributor

Achuth17 commented Jul 17, 2018

@cloud-fan @lianhuiwang @gatorsmile This fix is useful, is there any update on this?

@cloud-fan
Copy link
Contributor

@advancedxy do you want to take over it? The PR looks good but we probably need to fix some tests.

@advancedxy
Copy link
Contributor

@advancedxy do you want to take over it? The PR looks good but we probably need to fix some tests.

Ok, I will add it to my backlog. Will take a look at this while/once #18324 is resolved by me.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 17, 2020
@github-actions github-actions bot closed this Jan 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants