Commit 62cf4d4
[SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
### What changes were proposed in this pull request?
This PR proposes a new interface in Spark SQL that allows users to query the metadata of the input files for all file formats. Spark SQL will expose them as **built-in hidden columns** meaning **users can only see them when they explicitly reference them**. Currently, This PR proposes to support the following metadata columns inside of a metadata struct `_metadata`:
| Name | Type | Description | Example |
| ------------- | ------------- | ------------- | ------------- |
| _metadata.file_path | String | The absolute file path of the input file. | file:/tmp/spark-7f600b30-b3ec-43a8-8cd2-686491654f9b/f0.csv |
| _metadata.file_name | String | The name of the input file along with the extension. | f0.csv |
| _metadata.file_size | Long | The length of the input file, in bytes. | 628 |
| _metadata.file_modification_time | Timestamp | The modification timestamp of the file. | 2021-12-20 20:05:21 |
This proposed hidden file metadata interface has the following behaviors:
- **Hidden**: metadata columns are hidden. They will not show up when only selecting data columns or selecting all `(SELECT *)`. In other words, they are not returned unless being explicitly referenced.
- **Not overwrite the data schema**: in the case of name collisions with data columns, data columns will be returned instead of the metadata columns. In other words, metadata columns can not overwrite user data in any case.
### Why are the changes needed?
To improve the Spark SQL observability for **all file formats** that still leverage DSV1.
### Does this PR introduce _any_ user-facing change?
Yes.
```
spark.read.format("csv")
.schema(schema)
.load("file:/tmp/*")
.select("name", "age",
"_metadata.file_path", "_metadata.file_name",
"_metadata.file_size", "_metadata.file_modification_time")
```
Example return:
| name | age | file_path | file_name | file_size | file_modification_time |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| Debbie | 18 | file:/tmp/f0.csv | f0.csv | 12 | 2021-07-02 01:05:21 |
| Frank | 24 | file:/tmp/f1.csv | f1.csv | 11 | 2021-12-20 02:06:21 |
### How was this patch tested?
Add new testsuite: FileMetadataColumnsSuite
Closes #34575 from Yaohua628/spark-37273.
Authored-by: yaohua <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>1 parent 86b9592 commit 62cf4d4
File tree
13 files changed
+633
-22
lines changed- sql
- catalyst/src/main/scala/org/apache/spark/sql
- catalyst
- analysis
- expressions
- plans/logical
- execution/datasources/v2
- core/src
- main/scala/org/apache/spark/sql/execution
- datasources
- test/scala/org/apache/spark/sql/execution/datasources
13 files changed
+633
-22
lines changedLines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
972 | 972 | | |
973 | 973 | | |
974 | 974 | | |
975 | | - | |
| 975 | + | |
976 | 976 | | |
977 | 977 | | |
978 | 978 | | |
| |||
Lines changed: 20 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
28 | | - | |
| 28 | + | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| |||
432 | 432 | | |
433 | 433 | | |
434 | 434 | | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
Lines changed: 7 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
277 | 277 | | |
278 | 278 | | |
279 | 279 | | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
Lines changed: 2 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
| 22 | + | |
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | | - | |
| 47 | + | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| |||
Lines changed: 17 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| 38 | + | |
38 | 39 | | |
39 | 40 | | |
40 | 41 | | |
| |||
198 | 199 | | |
199 | 200 | | |
200 | 201 | | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
201 | 205 | | |
202 | 206 | | |
203 | 207 | | |
| |||
216 | 220 | | |
217 | 221 | | |
218 | 222 | | |
219 | | - | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
220 | 227 | | |
221 | 228 | | |
222 | 229 | | |
| |||
359 | 366 | | |
360 | 367 | | |
361 | 368 | | |
362 | | - | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
363 | 374 | | |
364 | 375 | | |
365 | 376 | | |
| |||
601 | 612 | | |
602 | 613 | | |
603 | 614 | | |
604 | | - | |
| 615 | + | |
| 616 | + | |
605 | 617 | | |
606 | 618 | | |
607 | 619 | | |
| |||
657 | 669 | | |
658 | 670 | | |
659 | 671 | | |
660 | | - | |
| 672 | + | |
| 673 | + | |
661 | 674 | | |
662 | 675 | | |
663 | 676 | | |
| |||
Lines changed: 4 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | | - | |
| 39 | + | |
| 40 | + | |
40 | 41 | | |
41 | 42 | | |
42 | 43 | | |
| |||
48 | 49 | | |
49 | 50 | | |
50 | 51 | | |
51 | | - | |
| 52 | + | |
| 53 | + | |
52 | 54 | | |
53 | 55 | | |
54 | 56 | | |
| |||
Lines changed: 24 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
| 32 | + | |
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| |||
171 | 171 | | |
172 | 172 | | |
173 | 173 | | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
174 | 197 | | |
175 | 198 | | |
176 | 199 | | |
| |||
Lines changed: 113 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
27 | 29 | | |
28 | 30 | | |
| 31 | + | |
29 | 32 | | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
30 | 36 | | |
| 37 | + | |
31 | 38 | | |
32 | 39 | | |
33 | 40 | | |
| |||
38 | 45 | | |
39 | 46 | | |
40 | 47 | | |
41 | | - | |
| 48 | + | |
| 49 | + | |
42 | 50 | | |
43 | 51 | | |
44 | 52 | | |
45 | 53 | | |
46 | 54 | | |
47 | 55 | | |
48 | | - | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
49 | 59 | | |
50 | 60 | | |
51 | 61 | | |
| |||
57 | 67 | | |
58 | 68 | | |
59 | 69 | | |
60 | | - | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
61 | 73 | | |
62 | 74 | | |
63 | 75 | | |
| |||
103 | 115 | | |
104 | 116 | | |
105 | 117 | | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
106 | 213 | | |
107 | 214 | | |
108 | 215 | | |
| |||
118 | 225 | | |
119 | 226 | | |
120 | 227 | | |
121 | | - | |
| 228 | + | |
122 | 229 | | |
123 | 230 | | |
124 | 231 | | |
| |||
134 | 241 | | |
135 | 242 | | |
136 | 243 | | |
| 244 | + | |
137 | 245 | | |
138 | 246 | | |
139 | 247 | | |
| |||
201 | 309 | | |
202 | 310 | | |
203 | 311 | | |
| 312 | + | |
204 | 313 | | |
205 | 314 | | |
206 | 315 | | |
| |||
0 commit comments