[SPARK-22233] [core] Allow user to filter out empty split in HadoopRDD #19464

liutang123 · 2017-10-10T10:22:21Z

What changes were proposed in this pull request?

Add a flag spark.files.ignoreEmptySplits. When true, methods like that use HadoopRDD and NewHadoopRDD such as SparkContext.textFiles will not create a partition for input splits that are empty.

kiszk · 2017-10-10T10:57:46Z

Could you please update the title of this PR appropriately? e.g. [SPARK-22233][core] ...

liutang123 · 2017-10-11T08:11:28Z

@kiszk Any other suggestions an can this PR be merged?

srowen · 2017-10-11T11:40:31Z

Interesting. On the one hand I don't like adding yet another flag that changes behavior, when the user often can't meaningfully decide to set it. There is probably no value in processing an empty partition, sure. Then again it does change behavior slightly, and I wonder if that impacts assumptions that apps rely on somehow.

If there's no reason to expect downside, we could do this in Spark 3.x, or make the change now but yes introduce a flag as a safety valve to go back to old behavior, leaving the default to true.

But first are there any known impacts to skipping the empty partitions?

srowen · 2017-10-11T11:41:08Z

core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala

-    val rawSplits = inputFormat.getSplits(jobContext).toArray
+    var rawSplits = inputFormat.getSplits(jobContext).toArray(Array.empty[InputSplit])
+    if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) {
+      rawSplits = rawSplits.filter(_.getLength>0)


Space around operator.
You should filter before making an array.

Is there any one use empty file to do something ?
for example:
sc.textFile("/somepath/*").mapPartitions(....)
setting this flag to true by default may change the behavior of user's application.

jiangxb1987

This looks reasonable, also cc @cloud-fan

jiangxb1987 · 2017-10-11T14:31:50Z

docs/configuration.md

    This setting is ignored for jobs generated through Spark Streaming's StreamingContext, since
    data may need to be rewritten to pre-existing output directories during checkpoint recovery.</td>
 </tr>
+<tr>


We should add the config to internal/config.

jiangxb1987 · 2017-10-11T14:33:48Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
-    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
+    var inputSplits = inputFormat.getSplits(jobConf, minPartitions)


How about:

val inputSplits = if (......) { inputFormat.getSplits(jobConf, minPartitions).filter(_.getLength > 0) } else { inputFormat.getSplits(jobConf, minPartitions) }

We should alway try to not use var.

jiangxb1987 · 2017-10-11T14:34:26Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

-    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
+    var inputSplits = inputFormat.getSplits(jobConf, minPartitions)
+    if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) {
+      inputSplits = inputSplits.filter(_.getLength>0)


nit: extra space around operator.

jiangxb1987 · 2017-10-11T14:36:04Z

core/src/test/scala/org/apache/spark/FileSuite.scala

+    assert(new File(tempDir.getPath + "/output/part-00000").exists() === true)
+
+    val hadoopRDD = sc.textFile(tempDir.getPath + "/output/part-00000")
+    assert(hadoopRDD.partitions.length === 0)


You should recycle the resources you required in the test case.

The resources will be recycled by default in the afterEach function.

jiangxb1987 · 2017-10-11T14:37:47Z

core/src/test/scala/org/apache/spark/FileSuite.scala

+    emptyRDD.saveAsHadoopFile[TextOutputFormat[String, String]](tempDir.getPath + "/output")
+    assert(new File(tempDir.getPath + "/output/part-00000").exists() === true)
+
+    val hadoopRDD = sc.textFile(tempDir.getPath + "/output/part-00000")


We should also add the following test cases:

Ensure that if no split is empty, we don't lose any splits;

Ensure that if part of the splits are empty, we remove the splits correctly.

jerryshao · 2017-10-12T02:21:24Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

    val inputFormat = getInputFormat(jobConf)
-    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
+    var inputSplits = inputFormat.getSplits(jobConf, minPartitions)
+    if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) {


I would suggest not to use the name started by "spark.hadoop", this kind of configurations will be treated as Hadoop configuration and set into Hadoop Configuration, it might be better to choose another name.

I'd use spark.files prefix, taken after spark.files.ignoreCorruptFiles, spark.files.maxPartitionBytes and spark.files.openCostInBytes.

jerryshao · 2017-10-12T02:28:51Z

IIUC this issue also existed in NewHadoopRDD and FileScanRDD (possibly), we'd better also fix them.

HyukjinKwon · 2017-10-12T05:37:42Z

I think the optimisation by spark.sql.files.maxPartitionBytes sql specific conf includes this concept in FileScanRDD and it looks already partially doing it in combining input splits. I'd suggest to avoid putting this conf in FileScanRDD, for now, if I didn't miss something.

HyukjinKwon · 2017-10-12T05:41:24Z

ok to test

SparkQA · 2017-10-12T07:05:01Z

Test build #82658 has finished for PR 19464 at commit cf0c350.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

…Split and add it to internal/config; add test case

SparkQA · 2017-10-12T12:23:32Z

Test build #82672 has finished for PR 19464 at commit 31a5d30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-10-12T14:47:15Z

core/src/test/scala/org/apache/spark/FileSuite.scala

+
+    // Ensure that if all of the splits are empty, we remove the splits correctly
+    val emptyRDD = sc.parallelize(Array.empty[Tuple2[String, String]], 1)
+    emptyRDD.saveAsHadoopFile[TextOutputFormat[String, String]](tempDir.getPath + "/output")


don't hardcode the path separator, use new File(tempDir, output).

cloud-fan · 2017-10-12T14:58:05Z

I can't think of any downside, but it's always safe to avoid behavior changes. LGTM

srowen · 2017-10-12T15:11:12Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

    .longConf
    .createWithDefault(4 * 1024 * 1024)

+  private [spark] val FILTER_OUT_EMPTY_SPLIT = ConfigBuilder("spark.files.filterOutEmptySplit")


Nit: no space after private
This doc is much too verbose for a flag. Just say, "If true, methods like that use HadoopRDD and NewHadoopRDD such as SparkContext.textFiles will not create a partition for input splits that are empty."

srowen · 2017-10-12T15:11:37Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

    SparkHadoopUtil.get.addCredentials(jobConf)
    val inputFormat = getInputFormat(jobConf)
-    val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
+    val inputSplits = if (sparkContext.getConf.get(FILTER_OUT_EMPTY_SPLIT)) {


You can avoid duplicating inputFormat.getSplits(jobConf, minPartitions)

srowen · 2017-10-12T15:12:08Z

core/src/test/scala/org/apache/spark/FileSuite.scala

  }

+  test("allow user to filter out empty split (old Hadoop API)") {
+    val sf = new SparkConf()


sf -> conf. You can fix it above too.

srowen · 2017-10-12T15:12:38Z

docs/configuration.md

    then the partitions with small files will be faster than partitions with bigger files.
  </td>
 </tr>
+<tr>


I don't think I'd document this. It should be just a safety valve flag

yea we can make this conf an internal conf.

HyukjinKwon

LGTM too except for those comments.

HyukjinKwon · 2017-10-12T15:46:22Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

    .longConf
    .createWithDefault(4 * 1024 * 1024)

+  private [spark] val FILTER_OUT_EMPTY_SPLIT = ConfigBuilder("spark.files.filterOutEmptySplit")


nit: how about ignoreEmptySplits to be matched with ignoreCorruptFiles?

HyukjinKwon · 2017-10-12T17:13:36Z

core/src/test/scala/org/apache/spark/FileSuite.scala

+    }
+
+    // Ensure that if all of the splits are empty, we remove the splits correctly
+    testIgnoreEmptySplits(Array.empty[Tuple2[String, String]], 1, 0, "part-00000", 0)


I'd call it with named arguments, for example,

testIgnoreEmptySplits( Array.empty[Tuple2[String, String]], numSlices = 1, outputSuffix = 0, checkPart = "part-00000", expectedPartitionNum = 0)

HyukjinKwon · 2017-10-12T17:29:25Z

core/src/test/scala/org/apache/spark/FileSuite.scala

+      assert(new File(output, checkPart).exists() === true)
+      val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
+      assert(hadoopRDD.partitions.length === expectedPartitionNum)
+    }


Could we maybe do this something like ... as below? (not tested)

def testIgnoreEmptySplits( data: Array[Tuple2[String, String]], actualPartitionNum: Int, expectedName: String, expectedPartitionNum: Int): Unit = { val output = new File(tempDir, "output") sc.parallelize(data, actualPartitionNum) .saveAsHadoopFile[TextOutputFormat[String, String]](output.getAbsolutePath) assert(new File(output, expectedPart).exists()) val hadoopRDD = sc.textFile(new File(output, "part-*").getAbsolutePath) assert(hadoopRDD.partitions.length === expectedPartitionNum) } ... testIgnoreEmptySplits( data = Array.empty[Tuple2[String, String]], actualPartitionNum = 1, expectedName = "part-00000", expectedPartitionNum = 0)

Actually, to me the previous tests were also okay to me as well ..

HyukjinKwon · 2017-10-12T17:30:27Z

core/src/test/scala/org/apache/spark/FileSuite.scala

+      val output = new File(tempDir, "output" + outputSuffix)
+      dataRDD.saveAsNewAPIHadoopFile[NewTextOutputFormat[String, String]](output.getPath)
+      assert(new File(output, checkPart).exists() === true)
+      val hadoopRDD = sc.textFile(new File(output, "part-r-*").getPath)


I think we should read it with new hadoop API to test NewHadoopRDD I guess?

SparkQA · 2017-10-12T19:33:31Z

Test build #82694 has finished for PR 19464 at commit 4dcfd83.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-12T20:31:39Z

Test build #82696 has finished for PR 19464 at commit 527b367.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-10-13T02:15:58Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

    .createWithDefault(4 * 1024 * 1024)

+  private[spark] val IGNORE_EMPTY_SPLITS = ConfigBuilder("spark.files.ignoreEmptySplits")
+    .doc("If true, methods like that use HadoopRDD and NewHadoopRDD such as " +


like that -> that

jerryshao · 2017-10-13T06:09:31Z

core/src/test/scala/org/apache/spark/FileSuite.scala

+    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
+    sc = new SparkContext(conf)
+
+    def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int,


nit: one argument per line.

jerryshao · 2017-10-13T06:10:00Z

core/src/test/scala/org/apache/spark/FileSuite.scala

+    conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true)
+    sc = new SparkContext(conf)
+
+    def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int,


SparkQA · 2017-10-13T06:49:42Z

Test build #82716 has finished for PR 19464 at commit 25f98d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-13T10:45:15Z

Test build #82726 has finished for PR 19464 at commit 534d8fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-10-13T16:51:56Z

core/src/test/scala/org/apache/spark/FileSuite.scala

+      val output = new File(tempDir, "output")
+      sc.parallelize(data, actualPartitionNum)
+        .saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath)
+      assert(new File(output, expectedPart).exists() === true)


I don't think we need the expectedPart parameter, just

for (i <- 0 until actualPartitionNum) { assert(new File(output, s"part-0000$i").exists() === true) }

HyukjinKwon · 2017-10-13T17:46:28Z

core/src/test/scala/org/apache/spark/FileSuite.scala

+      assert(new File(output, expectedPart).exists() === true)
+      val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
+      assert(hadoopRDD.partitions.length === expectedPartitionNum)
+      Utils.deleteRecursively(output)


Maybe:

try { ... } finally { Utils.deleteRecursively(output) }

I think we don't need try... finally here. Because Utils.deleteRecursively(output) just to ensure
the success of next invocation of the testIgnoreEmptySplits. When test finished, wether be passed or not, the tempDir will be deleted in FileSuite.afterEach().

HyukjinKwon · 2017-10-13T17:47:46Z

core/src/test/scala/org/apache/spark/FileSuite.scala

+      data: Array[Tuple2[String, String]],
+      actualPartitionNum: Int,
+      expectedPart: String,
+      expectedPartitionNum: Int): Unit = {


Indentation..

def testIgnoreEmptySplits( data: Array[Tuple2[String, String]], ... expectedPartitionNum: Int): Unit = { val output = new File(tempDir, "output") ...

HyukjinKwon · 2017-10-14T03:23:55Z

LGTM

SparkQA · 2017-10-14T05:41:23Z

Test build #82752 has finished for PR 19464 at commit a6818b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Let's fix this nit when we change some codes around here next time.

HyukjinKwon · 2017-10-14T05:48:16Z

core/src/test/scala/org/apache/spark/FileSuite.scala

+      }
+      val hadoopRDD = sc.newAPIHadoopFile(new File(output, "part-r-*").getPath,
+        classOf[NewTextInputFormat], classOf[LongWritable], classOf[Text])
+        .asInstanceOf[NewHadoopRDD[_, _]]


nit:

val hadoopRDD = sc.newAPIHadoopFile( new File(output, "part-r-*").getPath, classOf[NewTextInputFormat], classOf[LongWritable], classOf[Text]).asInstanceOf[NewHadoopRDD[_, _]]

HyukjinKwon · 2017-10-14T08:38:27Z

Merged to master.

jiangxb1987 · 2017-10-16T02:35:49Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

    .longConf
    .createWithDefault(4 * 1024 * 1024)

+  private[spark] val IGNORE_EMPTY_SPLITS = ConfigBuilder("spark.files.ignoreEmptySplits")


This config should be made internal, and the name should be improved because it's not about spark files.

I'll send a follow-up PR to fix this.

…n HadoopRDD ## What changes were proposed in this pull request? Update the config `spark.files.ignoreEmptySplits`, rename it and make it internal. This is followup of apache#19464 ## How was this patch tested? Exsiting tests. Author: Xingbo Jiang <[email protected]> Closes apache#19504 from jiangxb1987/partitionsplit.

allow user to filter out empty split in HadoopRDD

cf0c350

liutang123 force-pushed the SPARK-22233 branch from 3999961 to cf0c350 Compare October 10, 2017 10:32

liutang123 changed the title ~~Spark 22233~~ [SPARK-22233] [core] Oct 10, 2017

liutang123 changed the title ~~[SPARK-22233] [core]~~ [SPARK-22233] [core] Allow user to filter out empty split in HadoopRDD Oct 10, 2017

srowen requested changes Oct 11, 2017

View reviewed changes

jiangxb1987 reviewed Oct 11, 2017

View reviewed changes

jerryshao reviewed Oct 12, 2017

View reviewed changes

change spark.hadoop.filterOutEmptySplit to spark.files.filterOutEmpty…

31a5d30

…Split and add it to internal/config; add test case

cloud-fan reviewed Oct 12, 2017

View reviewed changes

srowen requested changes Oct 12, 2017

View reviewed changes

HyukjinKwon approved these changes Oct 12, 2017

View reviewed changes

simpfy code; use spark.files.ignoreEmptySplits as config name

4dcfd83

HyukjinKwon reviewed Oct 12, 2017

View reviewed changes

optimize code.

527b367

HyukjinKwon reviewed Oct 12, 2017

View reviewed changes

cloud-fan reviewed Oct 13, 2017

View reviewed changes

test read data by NewHadoopRDD.

25f98d0

jerryshao reviewed Oct 13, 2017

View reviewed changes

optimize code.

534d8fb

cloud-fan reviewed Oct 13, 2017

View reviewed changes

HyukjinKwon reviewed Oct 13, 2017

View reviewed changes

Adjust code to conform to the code style.

a6818b6

HyukjinKwon reviewed Oct 14, 2017

View reviewed changes

code format adjustment.

9f42f9f

asfgit closed this in 014dc84 Oct 14, 2017

jiangxb1987 reviewed Oct 16, 2017

View reviewed changes

jiangxb1987 mentioned this pull request Oct 16, 2017

[SPARK-22233] [CORE] [FOLLOW-UP] Allow user to filter out empty split in HadoopRDD #19504

Closed

[SPARK-22233] [core] Allow user to filter out empty split in HadoopRDD #19464

[SPARK-22233] [core] Allow user to filter out empty split in HadoopRDD #19464

Uh oh!

Conversation

liutang123 commented Oct 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Uh oh!

kiszk commented Oct 10, 2017

Uh oh!

liutang123 commented Oct 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Oct 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryshao commented Oct 12, 2017

Uh oh!

HyukjinKwon commented Oct 12, 2017

Uh oh!

HyukjinKwon commented Oct 12, 2017

Uh oh!

SparkQA commented Oct 12, 2017

Uh oh!

SparkQA commented Oct 12, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 12, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 12, 2017

Uh oh!

SparkQA commented Oct 12, 2017

Uh oh!

Choose a reason for hiding this comment

liutang123 commented Oct 10, 2017 •

edited

Loading

liutang123 commented Oct 11, 2017 •

edited

Loading