Commit fe3242f
[SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat
## What changes were proposed in this pull request?
This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source.
apache#14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately.
Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below:
``` scala
spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
spark.read.parquet("/tmp/parquet")
```
gives the paths below without directories but just valid data files:
``` bash
/tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
/tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
/tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet
...
```
to `FileFormat.inferSchema`.
## How was this patch tested?
Unit test added in `HadoopFsRelationTest` and related existing tests.
Author: hyukjinkwon <[email protected]>
Closes apache#14627 from HyukjinKwon/SPARK-16975.1 parent 6e45753 commit fe3242f
4 files changed
Lines changed: 21 additions & 15 deletions
File tree
- sql
- core/src/main/scala/org/apache/spark/sql/execution/datasources
- csv
- json
- parquet
- hive/src/test/scala/org/apache/spark/sql/sources
Lines changed: 2 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
58 | | - | |
59 | 58 | | |
60 | | - | |
61 | | - | |
| 59 | + | |
| 60 | + | |
62 | 61 | | |
63 | 62 | | |
64 | 63 | | |
| |||
Lines changed: 1 addition & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | 54 | | |
60 | | - | |
| 55 | + | |
61 | 56 | | |
62 | 57 | | |
63 | 58 | | |
| |||
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
Lines changed: 1 addition & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
241 | 241 | | |
242 | 242 | | |
243 | 243 | | |
244 | | - | |
245 | | - | |
246 | | - | |
247 | | - | |
248 | | - | |
249 | | - | |
| 244 | + | |
250 | 245 | | |
251 | 246 | | |
252 | 247 | | |
| |||
Lines changed: 17 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
877 | 877 | | |
878 | 878 | | |
879 | 879 | | |
| 880 | + | |
| 881 | + | |
| 882 | + | |
| 883 | + | |
| 884 | + | |
| 885 | + | |
| 886 | + | |
| 887 | + | |
| 888 | + | |
| 889 | + | |
| 890 | + | |
| 891 | + | |
| 892 | + | |
| 893 | + | |
| 894 | + | |
| 895 | + | |
| 896 | + | |
880 | 897 | | |
881 | 898 | | |
882 | 899 | | |
| |||
0 commit comments