[SPARK-27534][SQL] Do not load `content` column in binary data source if it is not selected #24473

WeichenXu123 · 2019-04-26T18:43:24Z

What changes were proposed in this pull request?

A follow-up task from SPARK-25348. To save I/O cost, Spark shouldn't attempt to read the file if users didn't request the content column. For example:

spark.read.format("binaryFile").load(path).filter($"length" < 1000000).count()

How was this patch tested?

Unit test added.

Please review http://spark.apache.org/contributing.html before opening a pull request.

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

mengxr · 2019-04-26T19:11:11Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+      case CONTENT => readContent
+      case name => throw new RuntimeException(s"Unexcepted field name: ${name}")
+    }
+    InternalRow(values: _*)


Do we need to change inferSchema() or still return content field with null values? cc: @cloud-fan

What about adding a "keep invalid" option, when file read error, fill content column "null"?
Now when file loaded error, the datasource loading broken.

...test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala

mengxr · 2019-04-26T19:17:54Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+      case PATH => UTF8String.fromString(path)
+      case LENGTH => status.getLen
+      case MODIFICATION_TIME => DateTimeUtils.fromMillis(status.getModificationTime)
+      case CONTENT => readContent


I don't see a strong reason to prune other columns that are inexpensive. Code is much simpler if we only prune content.

But I think current code is simpler.
The previous code contains some code which is hard to read:

val fullOutput = dataSchema.map { f => AttributeReference(f.name, f.dataType, f.nullable, f.metadata)() } val requiredOutput = fullOutput.filter { a => requiredSchema.fieldNames.contains(a.name) } val requiredColumns = GenerateUnsafeProjection.generate(requiredOutput, fullOutput) ... Iterator(requiredColumns(internalRow))

mengxr · 2019-04-26T19:18:16Z

...test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala

  }
+
+
+  test("genPrunedRow") {


Can we just test buildReader on one file?

Do we need (and how to) test when pruned, the file is actually not read ?

SparkQA · 2019-04-26T22:25:14Z

Test build #104942 has finished for PR 24473 at commit d9bfdde.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-27T01:09:30Z

Test build #104947 has finished for PR 24473 at commit e26053d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2019-04-27T06:37:39Z

@WeichenXu123 I pushed some changes. @cloud-fan Could you help review? Thanks!

SparkQA · 2019-04-27T07:05:01Z

Test build #104956 has finished for PR 24473 at commit 4b02637.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-04-27T09:35:30Z

retest this please.

viirya · 2019-04-27T09:53:16Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

-          )
-
-          Iterator(requiredColumns(internalRow))
+          Iterator.single(InternalRow(values: _*))


Why don't project to unsafe row like previously?

Does projecting to unsafe row improve performance ?

Seems dataSchema is not used, is it possible that the required schema contains fields that not exist in the dataSchema?

I recall that we return unsafe row if possible.

Agree with @viirya, unsafe row can be more efficient in space, and if there are no gain to do this change, we'd better keep it unchanged.

Updated using UnsafeRowWriter

SparkQA · 2019-04-27T12:26:16Z

Test build #104959 has finished for PR 24473 at commit 4b02637.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2019-04-27T22:43:23Z

LGTM.

HyukjinKwon · 2019-04-28T01:50:46Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

-          }
-          val requiredOutput = fullOutput.filter { a =>
-            requiredSchema.fieldNames.contains(a.name)
+      if (pathGlobPattern.forall(new GlobFilter(_).accept(path))) {


Can we make it one if-else while we're here? For instance,

val isPatternMatched = pathGlobPattern.forall(new GlobFilter(_).accept(fsPath)) // These vals are intentionally lazy to avoid unnecessary file access via short-circuiting. lazy val fs = fsPath.getFileSystem(broadcastedHadoopConf.value.value) lazy val fileStatus = fs.getFileStatus(fsPath) lazy val shouldNotFilterOut = filterFuncs.forall(_.apply(fileStatus)) if (isPatternMatched && shouldNotFilterOut) { ... Iterator(requiredColumns(internalRow)) } else { Iterator.empty[InternalRow] }

In the current impl, getFileStatus and filterFuncs are not touched if path doesn't match.

Yes, this suggestion does not touch both because they are lazy

...test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala

HyukjinKwon · 2019-04-28T01:57:12Z

...test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala

+        assert(p.asInstanceOf[String].endsWith(file.getAbsolutePath))
+    }
+    file.setReadable(false)
+    withClue("cannot read content") {


Looks like both positive and negative cases are within one test. Can we split them?

Is it necessary? test is grouped by function/feature already.

Not neccesary but why don't we make the test case simple and separate :)

We can still retain the group too, for instance, column pruning - positive and column pruning - negative. Not a big deal but I don't think it's difficult or too demanding to fix.

HyukjinKwon · 2019-04-28T01:59:42Z

Looks fine to me too otherwise.

HyukjinKwon

Looks good since my other comments left are rather style and cleanups. I will leave it to you @WeichenXu123 and @mengxr.

mengxr · 2019-04-28T14:59:36Z

@HyukjinKwon Thanks for the review! I'm merging this into master. I do think the suggested changes are unnecessary. For example, two if branches vs couple lazy vals with a comment to explain why they are lazy.

… if it is not selected ## What changes were proposed in this pull request? A follow-up task from SPARK-25348. To save I/O cost, Spark shouldn't attempt to read the file if users didn't request the `content` column. For example: ``` spark.read.format("binaryFile").load(path).filter($"length" < 1000000).count() ``` ## How was this patch tested? Unit test added. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#24473 from WeichenXu123/SPARK-27534. Lead-authored-by: Xiangrui Meng <[email protected]> Co-authored-by: WeichenXu <[email protected]> Signed-off-by: Xiangrui Meng <[email protected]>

init pr

d9bfdde

mengxr requested changes Apr 26, 2019

View reviewed changes

address comments

e26053d

update impl and test

4b02637

mengxr approved these changes Apr 27, 2019

View reviewed changes

viirya reviewed Apr 27, 2019

View reviewed changes

viirya approved these changes Apr 28, 2019

View reviewed changes

HyukjinKwon reviewed Apr 28, 2019

View reviewed changes

...test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 28, 2019

View reviewed changes

mengxr added 2 commits April 27, 2019 22:13

use UnsafeWriter and update tests

cc2ceee

l => len

f4a6469

HyukjinKwon mentioned this pull request Apr 28, 2019

Address my own comments at PR 24473 (clean up) WeichenXu123/spark#8

Closed

HyukjinKwon approved these changes Apr 28, 2019

View reviewed changes

asfgit closed this in 20a3ef7 Apr 28, 2019

WeichenXu123 deleted the SPARK-27534 branch April 28, 2019 18:30

[SPARK-27534][SQL] Do not load content column in binary data source if it is not selected #24473

[SPARK-27534][SQL] Do not load content column in binary data source if it is not selected #24473

Uh oh!

Conversation

WeichenXu123 commented Apr 26, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 26, 2019

Uh oh!

SparkQA commented Apr 27, 2019

Uh oh!

mengxr commented Apr 27, 2019

Uh oh!

SparkQA commented Apr 27, 2019

Uh oh!

viirya commented Apr 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 27, 2019

Uh oh!

WeichenXu123 commented Apr 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr Apr 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 28, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

mengxr commented Apr 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

[SPARK-27534][SQL] Do not load `content` column in binary data source if it is not selected #24473

[SPARK-27534][SQL] Do not load `content` column in binary data source if it is not selected #24473

mengxr Apr 28, 2019 •

edited

Loading

HyukjinKwon Apr 28, 2019 •

edited

Loading