-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22233] [core] Allow user to filter out empty split in HadoopRDD #19464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3999961 to
cf0c350
Compare
|
Could you please update the title of this PR appropriately? e.g. |
|
@kiszk Any other suggestions an can this PR be merged? |
|
Interesting. On the one hand I don't like adding yet another flag that changes behavior, when the user often can't meaningfully decide to set it. There is probably no value in processing an empty partition, sure. Then again it does change behavior slightly, and I wonder if that impacts assumptions that apps rely on somehow. If there's no reason to expect downside, we could do this in Spark 3.x, or make the change now but yes introduce a flag as a safety valve to go back to old behavior, leaving the default to true. But first are there any known impacts to skipping the empty partitions? |
| val rawSplits = inputFormat.getSplits(jobContext).toArray | ||
| var rawSplits = inputFormat.getSplits(jobContext).toArray(Array.empty[InputSplit]) | ||
| if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) { | ||
| rawSplits = rawSplits.filter(_.getLength>0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space around operator.
You should filter before making an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any one use empty file to do something ?
for example:
sc.textFile("/somepath/*").mapPartitions(....)
setting this flag to true by default may change the behavior of user's application.
jiangxb1987
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable, also cc @cloud-fan
docs/configuration.md
Outdated
| This setting is ignored for jobs generated through Spark Streaming's StreamingContext, since | ||
| data may need to be rewritten to pre-existing output directories during checkpoint recovery.</td> | ||
| </tr> | ||
| <tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add the config to internal/config.
| SparkHadoopUtil.get.addCredentials(jobConf) | ||
| val inputFormat = getInputFormat(jobConf) | ||
| val inputSplits = inputFormat.getSplits(jobConf, minPartitions) | ||
| var inputSplits = inputFormat.getSplits(jobConf, minPartitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
val inputSplits = if (......) {
inputFormat.getSplits(jobConf, minPartitions).filter(_.getLength > 0)
} else {
inputFormat.getSplits(jobConf, minPartitions)
}
We should alway try to not use var.
| val inputSplits = inputFormat.getSplits(jobConf, minPartitions) | ||
| var inputSplits = inputFormat.getSplits(jobConf, minPartitions) | ||
| if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) { | ||
| inputSplits = inputSplits.filter(_.getLength>0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra space around operator.
| assert(new File(tempDir.getPath + "/output/part-00000").exists() === true) | ||
|
|
||
| val hadoopRDD = sc.textFile(tempDir.getPath + "/output/part-00000") | ||
| assert(hadoopRDD.partitions.length === 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should recycle the resources you required in the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resources will be recycled by default in the afterEach function.
| emptyRDD.saveAsHadoopFile[TextOutputFormat[String, String]](tempDir.getPath + "/output") | ||
| assert(new File(tempDir.getPath + "/output/part-00000").exists() === true) | ||
|
|
||
| val hadoopRDD = sc.textFile(tempDir.getPath + "/output/part-00000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add the following test cases:
- Ensure that if no split is empty, we don't lose any splits;
- Ensure that if part of the splits are empty, we remove the splits correctly.
| val inputFormat = getInputFormat(jobConf) | ||
| val inputSplits = inputFormat.getSplits(jobConf, minPartitions) | ||
| var inputSplits = inputFormat.getSplits(jobConf, minPartitions) | ||
| if (sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest not to use the name started by "spark.hadoop", this kind of configurations will be treated as Hadoop configuration and set into Hadoop Configuration, it might be better to choose another name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use spark.files prefix, taken after spark.files.ignoreCorruptFiles, spark.files.maxPartitionBytes and spark.files.openCostInBytes.
|
IIUC this issue also existed in |
|
I think the optimisation by |
|
ok to test |
|
Test build #82658 has finished for PR 19464 at commit
|
…Split and add it to internal/config; add test case
|
Test build #82672 has finished for PR 19464 at commit
|
|
|
||
| // Ensure that if all of the splits are empty, we remove the splits correctly | ||
| val emptyRDD = sc.parallelize(Array.empty[Tuple2[String, String]], 1) | ||
| emptyRDD.saveAsHadoopFile[TextOutputFormat[String, String]](tempDir.getPath + "/output") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't hardcode the path separator, use new File(tempDir, output).
|
I can't think of any downside, but it's always safe to avoid behavior changes. LGTM |
| .longConf | ||
| .createWithDefault(4 * 1024 * 1024) | ||
|
|
||
| private [spark] val FILTER_OUT_EMPTY_SPLIT = ConfigBuilder("spark.files.filterOutEmptySplit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: no space after private
This doc is much too verbose for a flag. Just say, "If true, methods like that use HadoopRDD and NewHadoopRDD such as SparkContext.textFiles will not create a partition for input splits that are empty."
| SparkHadoopUtil.get.addCredentials(jobConf) | ||
| val inputFormat = getInputFormat(jobConf) | ||
| val inputSplits = inputFormat.getSplits(jobConf, minPartitions) | ||
| val inputSplits = if (sparkContext.getConf.get(FILTER_OUT_EMPTY_SPLIT)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid duplicating inputFormat.getSplits(jobConf, minPartitions)
| } | ||
|
|
||
| test("allow user to filter out empty split (old Hadoop API)") { | ||
| val sf = new SparkConf() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sf -> conf. You can fix it above too.
docs/configuration.md
Outdated
| then the partitions with small files will be faster than partitions with bigger files. | ||
| </td> | ||
| </tr> | ||
| <tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I'd document this. It should be just a safety valve flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea we can make this conf an internal conf.
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too except for those comments.
| .longConf | ||
| .createWithDefault(4 * 1024 * 1024) | ||
|
|
||
| private [spark] val FILTER_OUT_EMPTY_SPLIT = ConfigBuilder("spark.files.filterOutEmptySplit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about ignoreEmptySplits to be matched with ignoreCorruptFiles?
| } | ||
|
|
||
| // Ensure that if all of the splits are empty, we remove the splits correctly | ||
| testIgnoreEmptySplits(Array.empty[Tuple2[String, String]], 1, 0, "part-00000", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call it with named arguments, for example,
testIgnoreEmptySplits(
Array.empty[Tuple2[String, String]],
numSlices = 1,
outputSuffix = 0,
checkPart = "part-00000",
expectedPartitionNum = 0)| assert(new File(output, checkPart).exists() === true) | ||
| val hadoopRDD = sc.textFile(new File(output, "part-*").getPath) | ||
| assert(hadoopRDD.partitions.length === expectedPartitionNum) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe do this something like ... as below? (not tested)
def testIgnoreEmptySplits(
data: Array[Tuple2[String, String]],
actualPartitionNum: Int,
expectedName: String,
expectedPartitionNum: Int): Unit = {
val output = new File(tempDir, "output")
sc.parallelize(data, actualPartitionNum)
.saveAsHadoopFile[TextOutputFormat[String, String]](output.getAbsolutePath)
assert(new File(output, expectedPart).exists())
val hadoopRDD = sc.textFile(new File(output, "part-*").getAbsolutePath)
assert(hadoopRDD.partitions.length === expectedPartitionNum)
}
...
testIgnoreEmptySplits(
data = Array.empty[Tuple2[String, String]],
actualPartitionNum = 1,
expectedName = "part-00000",
expectedPartitionNum = 0)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, to me the previous tests were also okay to me as well ..
| val output = new File(tempDir, "output" + outputSuffix) | ||
| dataRDD.saveAsNewAPIHadoopFile[NewTextOutputFormat[String, String]](output.getPath) | ||
| assert(new File(output, checkPart).exists() === true) | ||
| val hadoopRDD = sc.textFile(new File(output, "part-r-*").getPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should read it with new hadoop API to test NewHadoopRDD I guess?
|
Test build #82694 has finished for PR 19464 at commit
|
|
Test build #82696 has finished for PR 19464 at commit
|
| .createWithDefault(4 * 1024 * 1024) | ||
|
|
||
| private[spark] val IGNORE_EMPTY_SPLITS = ConfigBuilder("spark.files.ignoreEmptySplits") | ||
| .doc("If true, methods like that use HadoopRDD and NewHadoopRDD such as " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like that -> that
| conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true) | ||
| sc = new SparkContext(conf) | ||
|
|
||
| def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: one argument per line.
| conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, true) | ||
| sc = new SparkContext(conf) | ||
|
|
||
| def testIgnoreEmptySplits(data: Array[Tuple2[String, String]], numSlices: Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
|
Test build #82716 has finished for PR 19464 at commit
|
|
Test build #82726 has finished for PR 19464 at commit
|
| val output = new File(tempDir, "output") | ||
| sc.parallelize(data, actualPartitionNum) | ||
| .saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath) | ||
| assert(new File(output, expectedPart).exists() === true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the expectedPart parameter, just
for (i <- 0 until actualPartitionNum) {
assert(new File(output, s"part-0000$i").exists() === true)
}
| assert(new File(output, expectedPart).exists() === true) | ||
| val hadoopRDD = sc.textFile(new File(output, "part-*").getPath) | ||
| assert(hadoopRDD.partitions.length === expectedPartitionNum) | ||
| Utils.deleteRecursively(output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
try {
...
} finally {
Utils.deleteRecursively(output)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need try... finally here. Because Utils.deleteRecursively(output) just to ensure
the success of next invocation of the testIgnoreEmptySplits. When test finished, wether be passed or not, the tempDir will be deleted in FileSuite.afterEach().
| data: Array[Tuple2[String, String]], | ||
| actualPartitionNum: Int, | ||
| expectedPart: String, | ||
| expectedPartitionNum: Int): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation..
def testIgnoreEmptySplits(
data: Array[Tuple2[String, String]],
...
expectedPartitionNum: Int): Unit = {
val output = new File(tempDir, "output")
...|
LGTM |
|
Test build #82752 has finished for PR 19464 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix this nit when we change some codes around here next time.
| } | ||
| val hadoopRDD = sc.newAPIHadoopFile(new File(output, "part-r-*").getPath, | ||
| classOf[NewTextInputFormat], classOf[LongWritable], classOf[Text]) | ||
| .asInstanceOf[NewHadoopRDD[_, _]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
val hadoopRDD = sc.newAPIHadoopFile(
new File(output, "part-r-*").getPath,
classOf[NewTextInputFormat],
classOf[LongWritable],
classOf[Text]).asInstanceOf[NewHadoopRDD[_, _]]|
Merged to master. |
| .longConf | ||
| .createWithDefault(4 * 1024 * 1024) | ||
|
|
||
| private[spark] val IGNORE_EMPTY_SPLITS = ConfigBuilder("spark.files.ignoreEmptySplits") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config should be made internal, and the name should be improved because it's not about spark files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll send a follow-up PR to fix this.
…n HadoopRDD ## What changes were proposed in this pull request? Update the config `spark.files.ignoreEmptySplits`, rename it and make it internal. This is followup of apache#19464 ## How was this patch tested? Exsiting tests. Author: Xingbo Jiang <[email protected]> Closes apache#19504 from jiangxb1987/partitionsplit.
What changes were proposed in this pull request?
Add a flag spark.files.ignoreEmptySplits. When true, methods like that use HadoopRDD and NewHadoopRDD such as SparkContext.textFiles will not create a partition for input splits that are empty.