-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25660][SQL] Fix for the backward slash as CSV fields delimiter #22654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
dd16ca3
7bf453a
728aac2
983315b
1c2ac25
20856b4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -97,23 +97,21 @@ object CSVUtils { | |
| */ | ||
| @throws[IllegalArgumentException] | ||
| def toChar(str: String): Char = { | ||
| if (str.charAt(0) == '\\') { | ||
| str.charAt(1) | ||
| match { | ||
| case 't' => '\t' | ||
| case 'r' => '\r' | ||
| case 'b' => '\b' | ||
| case 'f' => '\f' | ||
| case '\"' => '\"' // In case user changes quote char and uses \" as delimiter in options | ||
| case '\'' => '\'' | ||
| case 'u' if str == """\u0000""" => '\u0000' | ||
| case _ => | ||
| throw new IllegalArgumentException(s"Unsupported special character for delimiter: $str") | ||
| } | ||
| } else if (str.length == 1) { | ||
| str.charAt(0) | ||
| } else { | ||
| throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str") | ||
| (str: Seq[Char]) match { | ||
| case Seq() => throw new IllegalArgumentException(s"Delimiter cannot be empty string") | ||
| case Seq(c) => c | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm missing why we had to switch up the case statement like this. I get that we need to cover more cases, but there was duplication and now there is a bit more. What about ...
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer more declarative way and less nested levels of controls. but this is personal opinion. Let's look at your example: you didn't check that str can be null. If it has length 2, how You should produce control chars not just second char. For example: In my approach, everything is simple. One input case is mapped to one output. There is no unnecessary complexity.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah yeah good points. This is too clever, off the top of my head. I still wonder if the code here can reduce the duplication of |
||
| case Seq('\\', 't') => '\t' | ||
| case Seq('\\', 'r') => '\r' | ||
| case Seq('\\', 'b') => '\b' | ||
| case Seq('\\', 'f') => '\f' | ||
| // In case user changes quote char and uses \" as delimiter in options | ||
| case Seq('\\', '\"') => '\"' | ||
| case Seq('\\', '\'') => '\'' | ||
MaxGekk marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| case _ if str == """\u0000""" => '\u0000' | ||
| case Seq('\\', _) => | ||
| throw new IllegalArgumentException(s"Unsupported special character for delimiter: $str") | ||
| case _ => | ||
| throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str") | ||
| } | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1820,4 +1820,13 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te | |
| checkAnswer(spark.read.option("multiLine", true).schema(schema).csv(input), Row(null)) | ||
| assert(spark.read.csv(input).collect().toSet == Set(Row())) | ||
| } | ||
|
|
||
| test("using the backward slash as the delimiter") { | ||
| val input = Seq("""abc\1""").toDS() | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't Or is the point that delimiting takes precedence?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right, if an user specified
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a user specified
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prohibited single backslash and throw an exception with a tip of using double backslash. |
||
| checkAnswer(spark.read.option("delimiter", "\\").csv(input), Row("abc", "1")) | ||
| checkAnswer(spark.read.option("inferSchema", true).option("delimiter", "\\").csv(input), | ||
| Row("abc", 1)) | ||
| val schema = new StructType().add("a", StringType).add("b", IntegerType) | ||
| checkAnswer(spark.read.schema(schema).option("delimiter", "\\").csv(input), Row("abc", 1)) | ||
| } | ||
| } | ||
Uh oh!
There was an error while loading. Please reload this page.