Skip to content
Closed
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ case class JSONOptions(
allowUnquotedFieldNames: Boolean = false,
allowSingleQuotes: Boolean = true,
allowNumericLeadingZeros: Boolean = false,
allowNonNumericNumbers: Boolean = false) {
allowNonNumericNumbers: Boolean = true) {

/** Sets config options on a Jackson [[JsonFactory]]. */
def setJacksonOptions(factory: JsonFactory): Unit = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ package org.apache.spark.sql.execution.datasources.json

import java.io.CharArrayWriter

import com.fasterxml.jackson.core.JsonFactory
import com.fasterxml.jackson.core.{JsonGenerator, JsonFactory}
import com.google.common.base.Objects
import org.apache.hadoop.fs.{FileStatus, Path}
import org.apache.hadoop.io.{LongWritable, NullWritable, Text}
Expand Down Expand Up @@ -161,12 +161,13 @@ private[sql] class JSONRelation(
}

override def prepareJobForWrite(job: Job): OutputWriterFactory = {
val quoteNonNumeric = !options.allowNonNumericNumbers
new OutputWriterFactory {
override def newInstance(
path: String,
dataSchema: StructType,
context: TaskAttemptContext): OutputWriter = {
new JsonOutputWriter(path, dataSchema, context)
new JsonOutputWriter(path, dataSchema, context, quoteNonNumeric)
}
}
}
Expand All @@ -175,12 +176,15 @@ private[sql] class JSONRelation(
private[json] class JsonOutputWriter(
path: String,
dataSchema: StructType,
context: TaskAttemptContext)
context: TaskAttemptContext,
quoteNonNumeric: Boolean)
extends OutputWriter with SparkHadoopMapRedUtil with Logging {

private[this] val writer = new CharArrayWriter()
// create the Generator without separator inserted between 2 records
private[this] val gen = new JsonFactory().createGenerator(writer).setRootValueSeparator(null)
private[this] val factory = new JsonFactory()
factory.configure(JsonGenerator.Feature.QUOTE_NON_NUMERIC_NUMBERS, quoteNonNumeric)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd always write them in the same way to avoid heterogeneity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS can only recognize the unquoted non numeric numbers. That is why it causes test cases failed above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK then maybe we should still do the special case handling ourselves in order to support quoted strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we still need to do the special case handling ourselves, so looks like we don't want to make allowNonNumericNumbers as true by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but without setting it to true, can we process NaN that's not quoted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can't.

Since we need to do special case handling for quoted non-numeric numbers, is JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS still useful for us? Because no matter we use it or not, we all need to do some kind of handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to be able to process both quoted nans and nans and treat them as numeric values. I think in order to do that we need both special handling for quoted chars, and turn JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin I think it is here if I don't misunderstand it.

private[this] val gen = factory.createGenerator(writer).setRootValueSeparator(null)
private[this] val result = new Text()

private val recordWriter: RecordWriter[NullWritable, Text] = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,35 +100,13 @@ object JacksonParser {
parser.getFloatValue

case (VALUE_STRING, FloatType) =>
// Special case handling for NaN and Infinity.
val value = parser.getText
val lowerCaseValue = value.toLowerCase()
if (lowerCaseValue.equals("nan") ||
lowerCaseValue.equals("infinity") ||
lowerCaseValue.equals("-infinity") ||
lowerCaseValue.equals("inf") ||
lowerCaseValue.equals("-inf")) {
value.toFloat
} else {
sys.error(s"Cannot parse $value as FloatType.")
}
parser.getFloatValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing the special handling for float types here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, should revert it back. BTW, do we actually test "inf" and "-inf" before? Because "inf".toFloat is not legal.


case (VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT, DoubleType) =>
parser.getDoubleValue

case (VALUE_STRING, DoubleType) =>
// Special case handling for NaN and Infinity.
val value = parser.getText
val lowerCaseValue = value.toLowerCase()
if (lowerCaseValue.equals("nan") ||
lowerCaseValue.equals("infinity") ||
lowerCaseValue.equals("-infinity") ||
lowerCaseValue.equals("inf") ||
lowerCaseValue.equals("-inf")) {
value.toDouble
} else {
sys.error(s"Cannot parse $value as DoubleType.")
}
parser.getDoubleValue

case (VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT, dt: DecimalType) =>
Decimal(parser.getDecimalValue, dt.precision, dt.scale)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -93,22 +93,29 @@ class JsonParsingOptionsSuite extends QueryTest with SharedSQLContext {
assert(df.first().getLong(0) == 18)
}

// The following two tests are not really working - need to look into Jackson's
// JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS.
ignore("allowNonNumericNumbers off") {
val str = """{"age": NaN}"""
val rdd = sqlContext.sparkContext.parallelize(Seq(str))
val df = sqlContext.read.json(rdd)
test("allowNonNumericNumbers off") {
val testCases: Seq[String] = Seq("""{"age": NaN}""", """{"age": Infinity}""",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a test for quoted NaN, inf, etc.

"""{"age": -Infinity}""")

assert(df.schema.head.name == "_corrupt_record")
testCases.foreach { str =>
val rdd = sqlContext.sparkContext.parallelize(Seq(str))
val df = sqlContext.read.option("allowNonNumericNumbers", "false").json(rdd)

assert(df.schema.head.name == "_corrupt_record")
}
}

ignore("allowNonNumericNumbers on") {
val str = """{"age": NaN}"""
val rdd = sqlContext.sparkContext.parallelize(Seq(str))
val df = sqlContext.read.option("allowNonNumericNumbers", "true").json(rdd)
test("allowNonNumericNumbers on") {
val testCases: Seq[String] = Seq("""{"age": NaN}""", """{"age": Infinity}""",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we still read them if they are quoted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, so if we don't set JsonGenerator.Feature.QUOTE_NON_NUMERIC_NUMBERS to false, we can't read them normally.

"""{"age": -Infinity}""")
val tests: Seq[Double => Boolean] = Seq(_.isNaN, _.isPosInfinity, _.isNegInfinity)

assert(df.schema.head.name == "age")
assert(df.first().getDouble(0).isNaN)
testCases.zipWithIndex.foreach { case (str, idx) =>
val rdd = sqlContext.sparkContext.parallelize(Seq(str))
val df = sqlContext.read.json(rdd)

assert(df.schema.head.name == "age")
assert(tests(idx)(df.first().getDouble(0)))
}
}
}