SPARK-12619 Combine small files in a hadoop directory into single split #10572

navis · 2016-01-04T12:42:52Z

When a directory contains too many (small) files, whole spark cluster will be exhausted scheduling tasks created for each file. Custom input format can handle that but if you're using hive metastore, it could hardly be an option.

SparkQA · 2016-01-04T14:12:57Z

Test build #48661 has finished for PR 10572 at commit 055f613.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class SimpleCombiner<K, V> implements InputFormat<K, V>
- public static class InputSplits implements InputSplit, Configurable

SparkQA · 2016-01-06T04:15:23Z

Test build #48804 has finished for PR 10572 at commit e056332.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class SimpleCombiner<K, V> implements InputFormat<K, V>
- public static class InputSplits implements InputSplit, Configurable

HyukjinKwon · 2016-04-20T00:35:51Z

Maybe we might have to correct the title just like the others, [SPARK-XXXX][SQL] (this is described in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark).

davies · 2016-06-06T18:02:20Z

This is fixed in 2.0, could you close this PR?

cerisier · 2016-08-28T02:01:21Z

@davies do you have the commit that fixes this in 2.0 ?

HyukjinKwon · 2016-08-28T02:16:53Z

Is that #12095?

jinxing64 · 2017-11-20T06:58:21Z

@HyukjinKwon
To merge small files, should I tune spark.sql.files.maxPartitionBytes? But IIUC it only works for FileSourceScanExec. So when I select from hive table, it doesn't work.

SPARK-12619 Combine small files in a hadoop directory into single split

e056332

navis force-pushed the SPARK-12619 branch from 055f613 to e056332 Compare January 6, 2016 00:47

zhichao-li mentioned this pull request Feb 18, 2016

[SPARK-8813][SQL]Combine splits by size #9097

Closed

HyukjinKwon mentioned this pull request Aug 28, 2016

[BUILD] Closes some stale PRs. #14849

Closed

asfgit closed this in 1a48c00 Aug 29, 2016

HyukjinKwon mentioned this pull request Nov 12, 2019

[SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism #26461

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARK-12619 Combine small files in a hadoop directory into single split #10572

SPARK-12619 Combine small files in a hadoop directory into single split #10572

Uh oh!

navis commented Jan 4, 2016

Uh oh!

SparkQA commented Jan 4, 2016

Uh oh!

SparkQA commented Jan 6, 2016

Uh oh!

HyukjinKwon commented Apr 20, 2016

Uh oh!

davies commented Jun 6, 2016

Uh oh!

cerisier commented Aug 28, 2016

Uh oh!

HyukjinKwon commented Aug 28, 2016

Uh oh!

jinxing64 commented Nov 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

SPARK-12619 Combine small files in a hadoop directory into single split #10572

SPARK-12619 Combine small files in a hadoop directory into single split #10572

Uh oh!

Conversation

navis commented Jan 4, 2016

Uh oh!

SparkQA commented Jan 4, 2016

Uh oh!

SparkQA commented Jan 6, 2016

Uh oh!

HyukjinKwon commented Apr 20, 2016

Uh oh!

davies commented Jun 6, 2016

Uh oh!

cerisier commented Aug 28, 2016

Uh oh!

HyukjinKwon commented Aug 28, 2016

Uh oh!

jinxing64 commented Nov 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants