Skip to content

Conversation

@navis
Copy link
Contributor

@navis navis commented Jan 4, 2016

When a directory contains too many (small) files, whole spark cluster will be exhausted scheduling tasks created for each file. Custom input format can handle that but if you're using hive metastore, it could hardly be an option.

@SparkQA
Copy link

SparkQA commented Jan 4, 2016

Test build #48661 has finished for PR 10572 at commit 055f613.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class SimpleCombiner<K, V> implements InputFormat<K, V>
    • public static class InputSplits implements InputSplit, Configurable

@SparkQA
Copy link

SparkQA commented Jan 6, 2016

Test build #48804 has finished for PR 10572 at commit e056332.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class SimpleCombiner<K, V> implements InputFormat<K, V>
    • public static class InputSplits implements InputSplit, Configurable

@HyukjinKwon
Copy link
Member

Maybe we might have to correct the title just like the others, [SPARK-XXXX][SQL] (this is described in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark).

@davies
Copy link
Contributor

davies commented Jun 6, 2016

This is fixed in 2.0, could you close this PR?

@cerisier
Copy link

@davies do you have the commit that fixes this in 2.0 ?

@HyukjinKwon
Copy link
Member

Is that #12095?

@jinxing64
Copy link

@HyukjinKwon
To merge small files, should I tune spark.sql.files.maxPartitionBytes? But IIUC it only works for FileSourceScanExec. So when I select from hive table, it doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants