Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Apr 17, 2019

What changes were proposed in this pull request?

Support 5 kinds of filters with and/or:

  • LessThan
  • LessThanOrEqual
  • GreatThan
  • GreatThanOrEqual
  • EqualTo

Support filters applied on 2 columns:

  • modificationTime
  • length

Note:
In order to support datasource filter push-down, I flatten schema to be:

val schema = StructType(
    StructField("path", StringType, false) ::
    StructField("modificationTime", TimestampType, false) ::
    StructField("length", LongType, false) ::
    StructField("content", BinaryType, true) :: Nil)

How was this patch tested?

To be added.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@SparkQA
Copy link

SparkQA commented Apr 17, 2019

Test build #104639 has finished for PR 24387 at commit 83380a7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 changed the title [SPARK-27473][SQL][WIP] Support filter push down for status fields in binary file data source [SPARK-27473][SQL] Support filter push down for status fields in binary file data source Apr 18, 2019

private[binaryfile] def createFilterFunctions(filter: Filter): Seq[FileStatus => Boolean] = {
filter match {
case andFilter: And =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case And(left, right) =>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we support Or?

filter match {
case andFilter: And =>
createFilterFunctions(andFilter.left) ++ createFilterFunctions(andFilter.right)
case LessThan("length", value: Long) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of group by filter, could you group them by the field name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

case andFilter: And =>
createFilterFunctions(andFilter.left) ++ createFilterFunctions(andFilter.right)
case LessThan("length", value: Long) =>
Seq((status: FileStatus) => status.getLen < value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Seq => Some

Seq((status: FileStatus) => status.getLen < value)
case LessThan("modificationTime", value: Timestamp) =>
Seq((status: FileStatus) => status.getModificationTime < value.getTime)
case LessThanOrEqual("length", value: Long) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, does SQL always convert the value to Long?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if the data type is long type.

Seq((status: FileStatus) => status.getLen >= value)
case GreaterThanOrEqual("modificationTime", value: Timestamp) =>
Seq((status: FileStatus) => status.getModificationTime >= value.getTime)
case _ => Seq.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: => None

col("year") // this is a partition column
)
if (sqlFilter != null) {
resultDF = resultDF.filter(sqlFilter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test doesn't really test filter push down because without it the test still passes. I think we should unit test the buildReader function alone.

}


def testCreateFilterFunctions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above. We can just unit test the buildReader func, which would look similar to your test here.

@mengxr
Copy link
Contributor

mengxr commented Apr 18, 2019

@WeichenXu123 Can you also handle EqualTo? Are we expecting Or and Not? cc: @cloud-fan

@SparkQA
Copy link

SparkQA commented Apr 18, 2019

Test build #104688 has finished for PR 24387 at commit 8acd36e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2019

Test build #104743 has finished for PR 24387 at commit ecc156d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 19, 2019

Test build #104757 has finished for PR 24387 at commit ecc156d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2019

Test build #104772 has finished for PR 24387 at commit a4aba74.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Apr 20, 2019

WeichenXu123#7

Change to unit test `createFilterFunction`, which should be sufficient.
@WeichenXu123
Copy link
Contributor Author

Thanks @mengxr 's update! Looks good to me.

@mengxr
Copy link
Contributor

mengxr commented Apr 20, 2019

@WeichenXu123 I pushed some changes to your branch that added more tests.

@mengxr
Copy link
Contributor

mengxr commented Apr 20, 2019

LGTM pending Jenkins. @cloud-fan Do you want to make a pass?

@SparkQA
Copy link

SparkQA commented Apr 20, 2019

Test build #104777 has finished for PR 24387 at commit 600aa9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2019

Test build #104778 has finished for PR 24387 at commit 7d9bdb0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2019

Test build #104779 has finished for PR 24387 at commit 71db855.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 9793d9e Apr 21, 2019
@WeichenXu123 WeichenXu123 deleted the binary_ds_filter branch April 22, 2019 00:10
case GreaterThanOrEqual(MODIFICATION_TIME, value: Timestamp) =>
_.getModificationTime >= value.getTime
case EqualTo(MODIFICATION_TIME, value: Timestamp) =>
_.getModificationTime == value.getTime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks EqualNullSafe and In are missing.

lwwmanning pushed a commit to palantir/spark that referenced this pull request Jan 9, 2020
…ry file data source

## What changes were proposed in this pull request?

Support 4 kinds of filters:
- LessThan
- LessThanOrEqual
- GreatThan
- GreatThanOrEqual

Support filters applied on 2 columns:
- modificationTime
- length

Note:
In order to support datasource filter push-down, I flatten schema to be:
```
val schema = StructType(
    StructField("path", StringType, false) ::
    StructField("modificationTime", TimestampType, false) ::
    StructField("length", LongType, false) ::
    StructField("content", BinaryType, true) :: Nil)
```

## How was this patch tested?

To be added.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes apache#24387 from WeichenXu123/binary_ds_filter.

Lead-authored-by: WeichenXu <[email protected]>
Co-authored-by: Xiangrui Meng <[email protected]>
Signed-off-by: Xiangrui Meng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants