Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

What changes were proposed in this pull request?

A follow-up task from SPARK-25348. To save I/O cost, Spark shouldn't attempt to read the file if users didn't request the content column. For example:

spark.read.format("binaryFile").load(path).filter($"length" < 1000000).count()

How was this patch tested?

Unit test added.

Please review http://spark.apache.org/contributing.html before opening a pull request.

case CONTENT => readContent
case name => throw new RuntimeException(s"Unexcepted field name: ${name}")
}
InternalRow(values: _*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change inferSchema() or still return content field with null values? cc: @cloud-fan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a "keep invalid" option, when file read error, fill content column "null"?
Now when file loaded error, the datasource loading broken.

case PATH => UTF8String.fromString(path)
case LENGTH => status.getLen
case MODIFICATION_TIME => DateTimeUtils.fromMillis(status.getModificationTime)
case CONTENT => readContent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a strong reason to prune other columns that are inexpensive. Code is much simpler if we only prune content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think current code is simpler.
The previous code contains some code which is hard to read:

          val fullOutput = dataSchema.map { f =>
            AttributeReference(f.name, f.dataType, f.nullable, f.metadata)()
          }
          val requiredOutput = fullOutput.filter { a =>
            requiredSchema.fieldNames.contains(a.name)
          }
          val requiredColumns = GenerateUnsafeProjection.generate(requiredOutput, fullOutput)
          ...
          Iterator(requiredColumns(internalRow))

}


test("genPrunedRow") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just test buildReader on one file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need (and how to) test when pruned, the file is actually not read ?

@SparkQA
Copy link

SparkQA commented Apr 26, 2019

Test build #104942 has finished for PR 24473 at commit d9bfdde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 27, 2019

Test build #104947 has finished for PR 24473 at commit e26053d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Apr 27, 2019

@WeichenXu123 I pushed some changes. @cloud-fan Could you help review? Thanks!

@SparkQA
Copy link

SparkQA commented Apr 27, 2019

Test build #104956 has finished for PR 24473 at commit 4b02637.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Apr 27, 2019

retest this please.

)

Iterator(requiredColumns(internalRow))
Iterator.single(InternalRow(values: _*))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't project to unsafe row like previously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does projecting to unsafe row improve performance ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems dataSchema is not used, is it possible that the required schema contains fields that not exist in the dataSchema?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall that we return unsafe row if possible.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @viirya, unsafe row can be more efficient in space, and if there are no gain to do this change, we'd better keep it unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated using UnsafeRowWriter

@SparkQA
Copy link

SparkQA commented Apr 27, 2019

Test build #104959 has finished for PR 24473 at commit 4b02637.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

LGTM.

}
val requiredOutput = fullOutput.filter { a =>
requiredSchema.fieldNames.contains(a.name)
if (pathGlobPattern.forall(new GlobFilter(_).accept(path))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it one if-else while we're here? For instance,

val isPatternMatched = pathGlobPattern.forall(new GlobFilter(_).accept(fsPath))

// These vals are intentionally lazy to avoid unnecessary file access via short-circuiting. 
lazy val fs = fsPath.getFileSystem(broadcastedHadoopConf.value.value)
lazy val fileStatus = fs.getFileStatus(fsPath)
lazy val shouldNotFilterOut = filterFuncs.forall(_.apply(fileStatus))
      
if (isPatternMatched && shouldNotFilterOut) {
  ...
  Iterator(requiredColumns(internalRow))
} else {
  Iterator.empty[InternalRow]
}

Copy link
Contributor

@mengxr mengxr Apr 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current impl, getFileStatus and filterFuncs are not touched if path doesn't match.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this suggestion does not touch both because they are lazy

assert(p.asInstanceOf[String].endsWith(file.getAbsolutePath))
}
file.setReadable(false)
withClue("cannot read content") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like both positive and negative cases are within one test. Can we split them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary? test is grouped by function/feature already.

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not neccesary but why don't we make the test case simple and separate :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still retain the group too, for instance, column pruning - positive and column pruning - negative. Not a big deal but I don't think it's difficult or too demanding to fix.

@HyukjinKwon
Copy link
Member

Looks fine to me too otherwise.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good since my other comments left are rather style and cleanups. I will leave it to you @WeichenXu123 and @mengxr.

@mengxr
Copy link
Contributor

mengxr commented Apr 28, 2019

@HyukjinKwon Thanks for the review! I'm merging this into master. I do think the suggested changes are unnecessary. For example, two if branches vs couple lazy vals with a comment to explain why they are lazy.

@asfgit asfgit closed this in 20a3ef7 Apr 28, 2019
@WeichenXu123 WeichenXu123 deleted the SPARK-27534 branch April 28, 2019 18:30
lwwmanning pushed a commit to palantir/spark that referenced this pull request Jan 9, 2020
… if it is not selected

## What changes were proposed in this pull request?

A follow-up task from SPARK-25348. To save I/O cost, Spark shouldn't attempt to read the file if users didn't request the `content` column. For example:
```
spark.read.format("binaryFile").load(path).filter($"length" < 1000000).count()
```

## How was this patch tested?

Unit test added.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes apache#24473 from WeichenXu123/SPARK-27534.

Lead-authored-by: Xiangrui Meng <[email protected]>
Co-authored-by: WeichenXu <[email protected]>
Signed-off-by: Xiangrui Meng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants