-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty string being parsed as null when nullValue is set. #22367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
465ed7a
48e143d
70e2171
8665f93
867c6de
e0cb879
e23098c
40cfa28
a56d001
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1897,6 +1897,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see | |
| - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`. | ||
| - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. | ||
| - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. | ||
| - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings were equal to `null` values and didn't reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. | ||
|
||
|
|
||
| ## Upgrading From Spark SQL 2.3.0 to 2.3.1 and above | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -91,9 +91,10 @@ abstract class CSVDataSource extends Serializable { | |
| } | ||
|
|
||
| row.zipWithIndex.map { case (value, index) => | ||
| if (value == null || value.isEmpty || value == options.nullValue) { | ||
| // When there are empty strings or the values set in `nullValue`, put the | ||
| // index as the suffix. | ||
| if (value == null || value.isEmpty || value == options.nullValue || | ||
| value == options.emptyValueInRead) { | ||
|
||
| // When there are empty strings or the values set in `nullValue` or in `emptyValue`, | ||
| // put the index as the suffix. | ||
| s"_c$index" | ||
| } else if (!caseSensitive && duplicates.contains(value.toLowerCase)) { | ||
| // When there are case-insensitive duplicates, put the index as the suffix. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -79,7 +79,8 @@ private[csv] object CSVInferSchema { | |
| * point checking if it is an Int, as the final type must be Double or higher. | ||
| */ | ||
| def inferField(typeSoFar: DataType, field: String, options: CSVOptions): DataType = { | ||
| if (field == null || field.isEmpty || field == options.nullValue) { | ||
| if (field == null || field.isEmpty || field == options.nullValue || | ||
| field == options.emptyValueInRead) { | ||
|
||
| typeSoFar | ||
| } else { | ||
| typeSoFar match { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| year,make,model,comment,blank | ||
| "2012","Tesla","S","","" | ||
| 1997,Ford,E350,"Go get one now they are going fast", | ||
| 2015,Chevy,Volt,,"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal at all but i would avoid abbreviation (
didn't) in the documentation personally.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoided