Skip to content
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
c567dcc
Add a test
MaxGekk Nov 8, 2018
2b41eba
Fix decimal parsing
MaxGekk Nov 8, 2018
cf438ae
Add locale option
MaxGekk Nov 8, 2018
f9438c4
Updating the migration guide
MaxGekk Nov 8, 2018
3125c23
Fix imports
MaxGekk Nov 8, 2018
64a97a2
Merge remote-tracking branch 'origin/master' into decimal-parsing-locale
MaxGekk Nov 9, 2018
2f76352
Renaming decimalParser to decimalFormat
MaxGekk Nov 11, 2018
3dfce18
Moving the test to UnivocityParserSuite
MaxGekk Nov 11, 2018
bdca7c4
Support the SQL config spark.sql.legacy.decimalParsing.enabled
MaxGekk Nov 12, 2018
8c5593e
Updating the migration guide.
MaxGekk Nov 12, 2018
18470b0
Refactoring
MaxGekk Nov 12, 2018
c28b79f
Removing internal
MaxGekk Nov 12, 2018
1723da2
Test refactoring
MaxGekk Nov 12, 2018
6cdafa5
Added a test for inferring the decimal type
MaxGekk Nov 13, 2018
14b5109
Inferring decimals from CSV
MaxGekk Nov 14, 2018
bab8fb2
Renaming df to decimalFormat
MaxGekk Nov 22, 2018
5236336
Merge remote-tracking branch 'origin/master' into decimal-parsing-locale
MaxGekk Nov 23, 2018
0d1a4f0
Merge branch 'master' into decimal-parsing-locale
MaxGekk Nov 27, 2018
8b1456c
Merge remote-tracking branch 'origin/master' into decimal-parsing-locale
MaxGekk Nov 28, 2018
0859624
Removing SQL config and special handling of Locale.US
MaxGekk Nov 28, 2018
e989b77
Merge remote-tracking branch 'fork/decimal-parsing-locale' into decim…
MaxGekk Nov 28, 2018
521bd45
Merge remote-tracking branch 'origin/master' into decimal-parsing-locale
MaxGekk Nov 29, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/sql-migration-guide-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide

## Upgrading From Spark SQL 2.4 to 3.0

- Since Spark 3.0, to parse decimals in locale specific format from CSV, set the `locale` option to proper value.

- In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder comes to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.

- In Spark version 2.4 and earlier, the parser of JSON data source treats empty strings as null for some data types such as `IntegerType`. For `FloatType` and `DoubleType`, it fails on empty strings and throws exceptions. Since Spark 3.0, we disallow empty strings and will throw exceptions for data types except for `StringType` and `BinaryType`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ package org.apache.spark.sql.catalyst.csv

import java.io.InputStream
import java.math.BigDecimal
import java.text.{DecimalFormat, DecimalFormatSymbols}

import scala.util.Try
import scala.util.control.NonFatal
Expand Down Expand Up @@ -104,6 +105,12 @@ class UnivocityParser(
requiredSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray
}

private val decimalParser = {
val df = new DecimalFormat("", new DecimalFormatSymbols(options.locale))
df.setParseBigDecimal(true)
df
}

/**
* Create a converter which converts the string value to a value according to a desired type.
* Currently, we do not support complex types (`ArrayType`, `MapType`, `StructType`).
Expand Down Expand Up @@ -149,8 +156,8 @@ class UnivocityParser(

case dt: DecimalType => (d: String) =>
nullSafeDatum(d, name, nullable, options) { datum =>
val value = new BigDecimal(datum.replaceAll(",", ""))
Decimal(value, dt.precision, dt.scale)
val bigDecimal = decimalParser.parse(datum).asInstanceOf[BigDecimal]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, is it safe that we assume this Number is BigDecimal? Looks there are some possibilities that it can return other types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe that we assume this Number is BigDecimal?

I am not absolutely sure that it always return BigDecimal. Found this at https://docs.oracle.com/javase/8/docs/api/java/text/DecimalFormat.html#parse(java.lang.String,java.text.ParsePosition) :

If isParseBigDecimal() is true, values are returned as BigDecimal objects. The values are the ones constructed by BigDecimal.BigDecimal(String) for corresponding strings in locale-independent format. The special cases negative and positive infinity and NaN are returned as Double instances holding the values of the corresponding Double constants.

So, isParseBigDecimal() returns true when setParseBigDecimal was called with true as in the PR.

Looks there are some possibilities that it can return other types.

In that case we just fail with a cast exception and the record will be handled as a bad record. or you want to see more clear message in the exception?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. The previous codes will anyway throw an exception, I see. One thing I am a little bit unsure is how much different the behaviour is. For instance, looks the previous one handles sign character as well (+ and -).

Let me take a closer look. I think I need to.

Copy link
Member

@HyukjinKwon HyukjinKwon Nov 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, there was a similar try to change the date parsing library (#21363). I already know the different is quite breaking and the workaround is difficult as far as I know - so I suggested to add a configuration or fallback for now. Probably we should similarily just document the behaviour change in the migration guide but actually less sure yet even about this. anyway will take another look shortly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I suggested to add a configuration or fallback for now ...

What about SQL config spark.sql.legacy.decimalParsing.enabled with default value false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good if that's not difficult.

Decimal(bigDecimal, dt.precision, dt.scale)
}

case _: TimestampType => (d: String) =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

package org.apache.spark.sql.catalyst.expressions

import java.text.{DecimalFormat, DecimalFormatSymbols}
import java.text.SimpleDateFormat
import java.util.{Calendar, Locale}

Expand Down Expand Up @@ -226,4 +227,17 @@ class CsvExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with P
InternalRow(17836)) // number of days from 1970-01-01
}
}

test("parse decimals using locale") {
Seq("en-US", "ko-KR", "ru-RU", "de-DE").foreach { langTag =>
val schema = new StructType().add("d", DecimalType(10, 5))
val options = Map("locale" -> langTag, "sep" -> "|")
val expected = Decimal(1000.001, 10, 5)
val df = new DecimalFormat("", new DecimalFormatSymbols(Locale.forLanguageTag(langTag)))
val input = df.format(expected.toBigDecimal)
checkEvaluation(
CsvToStructs(schema, options, Literal.create(input), gmtId),
InternalRow(expected))
}
}
}