Commit f5bf861
[SPARK-36760][SQL] Add interface SupportsPushDownV2Filters
Co-Authored-By: DB Tsai d_tsaiapple.com
Co-Authored-By: Huaxin Gao huaxin_gaoapple.com
### What changes were proposed in this pull request?
This is the 2nd PR for V2 Filter support. This PR does the following:
- Add interface SupportsPushDownV2Filters
Future work:
- refactor `OrcFilters`, `ParquetFilters`, `JacksonParser`, `UnivocityParser` so both V1 file source and V2 file source can use them
- For V2 file source: implement v2 filter -> parquet/orc filter. csv and Json don't have real filters, but also need to change the current code to have v2 filter -> `JacksonParser`/`UnivocityParser`
- For V1 file source, keep what we currently have: v1 filter -> parquet/orc filter
- We don't need v1filter.toV2 and v2filter.toV1 since we have two separate paths
The reasons that we have reached the above conclusion:
- The major motivation to implement V2Filter is to eliminate the unnecessary conversion between Catalyst types and Scala types when using Filters.
- We provide this `SupportsPushDownV2Filters` in this PR so V2 data source (e.g. iceberg) can implement it and use V2 Filters
- There are lots of work to implement v2 filters in the V2 file sources because of the following reasons:
possible approaches for implementing V2Filter:
1. keep what we have for file source v1: v1 filter -> parquet/orc filter
file source v2 we will implement v2 filter -> parquet/orc filter
We don't need v1->v2 and v2->v1
problem with this approach: there are lots of code duplication
2. We will implement v2 filter -> parquet/orc filter
file source v1: v1 filter -> v2 filter -> parquet/orc filter
We will need V1 -> V2
This is the approach I am using in apache#33973
In that PR, I have
v2 orc: v2 filter -> orc filter
V1 orc: v1 -> v2 -> orc filter
v2 csv: v2->v1, new UnivocityParser
v1 csv: new UnivocityParser
v2 Json: v2->v1, new JacksonParser
v1 Json: new JacksonParser
csv and Json don't have real filters, they just use filter references, should be OK to use either v1 and v2. Easier to use
v1 because no need to change.
I haven't finished parquet yet. The PR doesn't have the parquet V2Filter implementation, but I plan to have
v2 parquet: v2 filter -> parquet filter
v1 parquet: v1 -> v2 -> parquet filter
Problem with this approach:
1. It's not easy to implement V1->V2 because V2 filter have `LiteralValue` and needs type info. We already lost the type information when we convert Expression filer to v1 filter.
2. parquet is OK
Use Timestamp as example, parquet filter takes long for timestamp
v2 parquet: v2 filter -> parquet filter
timestamp
Expression (Long) -> v2 filter (LiteralValue Long)-> parquet filter (Long)
V1 parquet: v1 -> v2 -> parquet filter
timestamp
Expression (Long) -> v1 filter (timestamp) -> v2 filter (LiteralValue Long)-> parquet filter (Long)
but we have problem for orc because orc filter takes java Timestamp
v2 orc: v2 filter -> orc filter
timestamp
Expression (Long) -> v2 filter (LiteralValue Long)-> parquet filter (Timestamp)
V1 orc: v1 -> v2 -> orc filter
Expression (Long) -> v1 filter (timestamp) -> v2 filter (LiteralValue Long)-> parquet filter (Timestamp)
This defeats the purpose of implementing v2 filters.
3. keep what we have for file source v1: v1 filter -> parquet/orc filter
file source v2: v2 filter -> v1 filter -> parquet/orc filter
We will need V2 -> V1
we have similar problem as approach 2.
So the conclusion is: approach 1 (keep what we have for file source v1: v1 filter -> parquet/orc filter
file source v2 we will implement v2 filter -> parquet/orc filter) is better, but there are lots of code duplication. We will need to refactor `OrcFilters`, `ParquetFilters`, `JacksonParser`, `UnivocityParser` so both V1 file source and V2 file source can use them.
### Why are the changes needed?
Use V2Filters to eliminate the unnecessary conversion between Catalyst types and Scala types.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Added new UT
Closes apache#34001 from huaxingao/v2filter.
Lead-authored-by: Huaxin Gao <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>1 parent 64697b2 commit f5bf861
7 files changed
Lines changed: 593 additions & 11 deletions
File tree
- sql
- catalyst/src/main/java/org/apache/spark/sql/connector
- expressions/filter
- read
- core/src
- main/scala/org/apache/spark/sql/execution/datasources/v2
- test
- java/test/org/apache/spark/sql/connector
- scala/org/apache/spark/sql/connector
Lines changed: 3 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
| 21 | + | |
20 | 22 | | |
21 | 23 | | |
22 | 24 | | |
| |||
27 | 29 | | |
28 | 30 | | |
29 | 31 | | |
30 | | - | |
| 32 | + | |
31 | 33 | | |
32 | 34 | | |
33 | 35 | | |
| |||
Lines changed: 57 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
Lines changed: 163 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
21 | 22 | | |
22 | 23 | | |
23 | 24 | | |
24 | | - | |
| 25 | + | |
| 26 | + | |
25 | 27 | | |
26 | 28 | | |
27 | 29 | | |
28 | 30 | | |
| 31 | + | |
| 32 | + | |
29 | 33 | | |
30 | 34 | | |
31 | 35 | | |
32 | 36 | | |
33 | 37 | | |
34 | | - | |
| 38 | + | |
35 | 39 | | |
36 | 40 | | |
| 41 | + | |
37 | 42 | | |
38 | 43 | | |
| 44 | + | |
39 | 45 | | |
40 | 46 | | |
41 | 47 | | |
| |||
427 | 433 | | |
428 | 434 | | |
429 | 435 | | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
Lines changed: 35 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
31 | 30 | | |
32 | 31 | | |
33 | 32 | | |
| |||
40 | 39 | | |
41 | 40 | | |
42 | 41 | | |
43 | | - | |
| 42 | + | |
44 | 43 | | |
45 | 44 | | |
46 | 45 | | |
| |||
69 | 68 | | |
70 | 69 | | |
71 | 70 | | |
72 | | - | |
| 71 | + | |
73 | 72 | | |
74 | | - | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
75 | 103 | | |
76 | 104 | | |
77 | 105 | | |
| |||
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala
Lines changed: 7 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
61 | 67 | | |
62 | 68 | | |
63 | 69 | | |
64 | 70 | | |
65 | 71 | | |
66 | | - | |
| 72 | + | |
67 | 73 | | |
68 | 74 | | |
69 | 75 | | |
| |||
0 commit comments