Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

Here is the problem description from the JIRA.

When the inputs contain the constant 'infinity', Spark SQL does not generate the expected results.

SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('1'), (CAST('infinity' AS DOUBLE))) v(x);
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('infinity'), ('1')) v(x);
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('infinity'), ('infinity')) v(x);
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('-infinity'), ('infinity')) v(x);
 The root cause: Spark SQL does not recognize the special constants in a case insensitive way. In PostgreSQL, they are recognized in a case insensitive way. 

Link: https://www.postgresql.org/docs/9.3/datatype-numeric.html 

In this PR, the casting code is enhanced to handle these special string literals in case insensitive manner.

How was this patch tested?

Added tests in CastSuite and modified existing test suites.

@dilipbiswal
Copy link
Contributor Author

@maropu
Copy link
Member

maropu commented Aug 1, 2019

ok, I'll check in hours ;)

$evPrim = Float.valueOf($c.toString());
} catch (java.lang.NumberFormatException e) {
$evNull = true;
final String $str = $c.toString().trim();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is called only in case of exceptions, it would be great to call .toString once by reusing the one in line 1265.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun oops.. sorry.. i will fix.

Float.NaN
} else {
null
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid duplications, shall we make a function for this since we have the same Float logic twice here and line 1268?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun The assignment variable names are different ? Is there an example i can follow where codegen and non-codegen code is refactored to a function ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need an assignment if you handle everything inside the function.
The function will return the right-side value of the assignment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Ah... thanks .. will try.

}
}

test("SPARK-27768 process infinity, -infinity , nan in case insensitive manner") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is not a bug. Instead, this is a behavior change for PostgreSQL compatibility.
Previously, Spark works like Java/Scala/Hive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun OK.. will remove the JIRA id.

buildCast[UTF8String](_, s => try s.toString.toDouble catch {
case _: NumberFormatException => null
case _: NumberFormatException =>
val str = s.trim.toString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its just a check; do we support this case, e.g., ' infinity '? (I know pg does so though). If we support this case, IMO you'd be better to clearly describe in the PR description, which case this PR try to support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Currently i trim spaces from both ends and compare. So this case should work.
Ok.. i will update the description.

case _: NumberFormatException => null
case _: NumberFormatException =>
val str = s.trim.toString
if (str.equalsIgnoreCase("infinity")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a check; we don't support some alternatives, e.g., inf, -inf, ... ?https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/float.c#L194

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu We don't support inf and -inf, should we support ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another db-like systems supporting this, e.g., DB2?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Yes.. db2 does allow inf and -inf. Lets add in spark also ? Thanks a lot for pointing this..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO some users might think inf is more common, e.g., even in python, both inf and infinity can be used;

Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
>>> float('inf')
inf
>>> float('infinity')
inf

Its ok to wait for other reviewer's comments. @gatorsmile @dongjoon-hyun

Copy link
Contributor Author

@dilipbiswal dilipbiswal Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile @dongjoon-hyun Could you please help make a decision on whether to support inf, +inf for positive infinity and '-inf` for negative infinity ? Thanks a lot. Thanks again @maropu for pointing this.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Aug 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @maropu mentioned, since PostgreSQL supports that, I'm +1 for supporting it.

		else if (pg_strncasecmp(num, "inf", 3) == 0)
		{
			val = get_float4_infinity();
			endptr = num + 3;
		}
		else if (pg_strncasecmp(num, "+inf", 4) == 0)
		{
			val = get_float4_infinity();
			endptr = num + 4;
		}
		else if (pg_strncasecmp(num, "-inf", 4) == 0)
		{
			val = -get_float4_infinity();
			endptr = num + 4;
		}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun OK.. thanks.. i will make the change.

@SparkQA
Copy link

SparkQA commented Aug 2, 2019

Test build #108533 has finished for PR 25331 at commit 68c45e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

buildCast[UTF8String](_, s => try s.toString.toFloat catch {
case _: NumberFormatException => null
case _: NumberFormatException =>
val str = s.trim.toString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge lines 551-560 and 577-586 into one function? Since the return type will be Any, we can merge two cases Float and Double into one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK @kiszk

@SparkQA
Copy link

SparkQA commented Aug 2, 2019

Test build #108557 has finished for PR 25331 at commit b7d9de1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Aug 2, 2019

@dongjoon-hyun I have incorporated the changes.. Please take a look when you get a chance. I am not fully sure if the refactoring between codegen and non-codegen is necessary. I guess, since this code path is exercised only in the exception path, its okay to go for code-reuse in this specific case :-)

}
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this newline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Thx.. will do.

buildCast[UTF8String](_, s => {
val floatStr = s.toString
try floatStr.toFloat catch {
case _: NumberFormatException =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly match strings we want? We can avoid try-catch of an exception for the known strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya So my idea here was that.. since we don't expect these kind of literals a lot i.e its not a common case .. we don't change our normal processing path to add any possible runtime costs. Thats why we keep all these in the exception processing.

Copy link
Member

@maropu maropu Aug 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the @viirya approach for readability, but I understand your concern for the performance. So, could you check actual performance changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Oops.. didn't see this comment. I suppose i have to use the benchmark framework for this ? Appreciate any tip on this..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we check the numbers by a simple query?, e.g.,

scala> spark.range(10).selectExpr("CAST(double(id) AS STRING) a").write.save("/tmp/test")
scala> spark.read.load("/tmp/test").selectExpr("CAST(a AS DOUBLE)").write.format("noop").save()

In another pr, I observed that a logic depending on exceptions cause high performance penalties: lz4/lz4-java#143
So, I'm a bit worried that this current logic has the same issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Actually my input data didn't contain any of these special literals. Basically i was testing with the condition when we don't hit the catch block. Basically we are trying optimize for the best case ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just make sure that a small number of inf/-inf does not worse the prformance. If this is true, I think the current code looks ok to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu In the mean time i had tried with 1% of data being 'inf' and i can confirm that it does not hurt the performance :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds nice, too ;)


def resolvableNullability(from: Boolean, to: Boolean): Boolean = !from || to


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an extra empty line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Thx.. will do.

@SparkQA
Copy link

SparkQA commented Aug 3, 2019

Test build #108583 has finished for PR 25331 at commit 928e4c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, where does "infiNity" come up? or "+infinity"? I don't think this is how the JVM or Python renders them?


def processFloatingPointSpecialLiterals(v: String, isFloat: Boolean): Any = {
val str = v.trim
if (str.equalsIgnoreCase("infinity") || str.equalsIgnoreCase("+infinity")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need Locale.ROOT here? it's a really obscure possibility, but "i" can cause problems

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen OK.. will add Locale.ROOT by following how its used in other places.

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Aug 7, 2019

@srowen

Wait, where does "infiNity" come up? or "+infinity"? I don't think this is how the JVM or Python renders them?

May i please request you to refer to the discussions in the associated JIRA ? Basically our intention here is accept these "special" float and double literals in a case insensitive manner as some other database systems allow them ? Do you see a problem here ?


def resolvableNullability(from: Boolean, to: Boolean): Boolean = !from || to

def processFloatingPointSpecialLiterals(v: String, isFloat: Boolean): Any = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually some literals like InFiniTy can be casted, it looks a bit weird. However, postgresql accepts such things. Maybe add a comment here to explain why allowing case insensitive match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Does this look ok ?

We process literals such as 'Infinity', 'Inf', '-Infinity' and 'NaN' in case insensitive manner to be compatible with other database systems such as Postgres and DB2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel ok. Other reviewers may also have advice too.

@SparkQA
Copy link

SparkQA commented Aug 9, 2019

Test build #108861 has finished for PR 25331 at commit e901dc4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Aug 9, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Aug 9, 2019

Test build #108876 has finished for PR 25331 at commit e901dc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2019

Test build #108982 has finished for PR 25331 at commit 429048b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

</tr>
<tr>
<td>
CAST('infinity' TO DOUBLE)<br>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a SQL syntax, TO -> AS?

@dongjoon-hyun
Copy link
Member

Thank you for the decision, @srowen !


/**
* We process literals such as 'Infinity', 'Inf', '-Infinity' and 'NaN' etc in case
* insensitive manner to be compatible with other database systems such as Postgres and DB2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postgres -> PostgreSQL.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27768][SQL] Infinity, -Infinity, NaN should be recognized in a case insensitive manner. [SPARK-27768][SQL] Support Inf/Infinity/-Inf/-Infinity/NaN float/double literals case-insensitively Aug 13, 2019
@dongjoon-hyun
Copy link
Member

@dilipbiswal . I update the PR title because this PR focuses on only Spark's float/double data types while PostgreSQL have more special cases in other data types like the following date.

postgres=# select cast('infinity' as date);
   date
----------
 infinity
(1 row)

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27768][SQL] Support Inf/Infinity/-Inf/-Infinity/NaN float/double literals case-insensitively [SPARK-27768][SQL] Support Infinity/NaN-related float/double literals case-insensitively Aug 13, 2019
<b>Operation</b>
</th>
<th>
<b>Result prior to spark 3.0</b>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark -> Spark.

<b>Result prior to spark 3.0</b>
</th>
<th>
<b>Result starting spark 3.0</b>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

</tr>
<tr>
<td>
CAST('nan' TO FLOAT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix all instances of CAST. It seems to wrong as I commented in the above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun oops...thanks.. fixed.

$evPrim = Float.valueOf($floatStr);
} catch (java.lang.NumberFormatException e) {
$evNull = true;
Float f = (Float) Cast.processFloatingPointSpecialLiterals($floatStr, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Float -> final Float?

$evPrim = Double.valueOf($doubleStr);
} catch (java.lang.NumberFormatException e) {
$evNull = true;
Double d = (Double) Cast.processFloatingPointSpecialLiterals($doubleStr, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double -> final Double?

checkEvaluation(cast("-infinity", FloatType), Float.NegativeInfinity)
checkEvaluation(cast("-infiniTy", FloatType), Float.NegativeInfinity)
checkEvaluation(cast(" -infinity ", FloatType), Float.NegativeInfinity)
checkEvaluation(cast(" -inf ", FloatType), Float.NegativeInfinity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have negative cases returning NULL in somewhere else?


checkEvaluation(cast("nan", FloatType), Float.NaN)
checkEvaluation(cast("nAn", FloatType), Float.NaN)
checkEvaluation(cast(" nan ", FloatType), Float.NaN)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Can we have ' -naN ', too?


checkEvaluation(cast("nan", DoubleType), Double.NaN)
checkEvaluation(cast("nAn", DoubleType), Double.NaN)
checkEvaluation(cast(" nan ", DoubleType), Double.NaN)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for asking this, but can we use the following shorter form?

    Seq("inf", "+inf", "infinity", "+infiNity", " infinity ").foreach { value =>
      checkEvaluation(cast(value, FloatType), Float.PositiveInfinity)
    }
    Seq("-infinity", "-infiniTy", "  -infinity  ", "  -inf  ").foreach { value =>
      checkEvaluation(cast(value, FloatType), Float.NegativeInfinity)
    }
    Seq("inf", "+inf", "infinity", "+infiNity", " infinity ").foreach { value =>
      checkEvaluation(cast(value, DoubleType), Double.PositiveInfinity)
    }
    Seq("-infinity", "-infiniTy", "  -infinity  ", "  -inf  ").foreach { value =>
      checkEvaluation(cast(value, DoubleType), Double.NegativeInfinity)
    }
    Seq("nan", "nAn", " nan ").foreach { value =>
      checkEvaluation(cast(value, FloatType), Float.NaN)
    }
    Seq("nan", "nAn", " nan ").foreach { value =>
      checkEvaluation(cast(value, DoubleType), Double.NaN)
    }

struct<CAST(nan AS DOUBLE):double>
-- !query 11 output
NULL
NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove SPARK-28060 in float8.sql.

@dongjoon-hyun
Copy link
Member

Thanks. @dilipbiswal . I'll review once more after you fix the above comments.

@SparkQA
Copy link

SparkQA commented Aug 13, 2019

Test build #109014 has finished for PR 25331 at commit 1a5d978.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 13, 2019

Test build #109025 has finished for PR 25331 at commit 1a5d978.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.
Thank you, @dilipbiswal , @maropu , @kiszk , @viirya , @srowen.

@dilipbiswal
Copy link
Contributor Author

Thanks a lot @dongjoon-hyun @maropu @kiszk @viirya @srowen for very detailed review. Appreciate it very much !!

@DylanGuedes
Copy link
Contributor

Hey @dilipbiswal , I get the following error for inline tables (although it work great for normal tables):

SELECT a, b,                                                                                                                                                  
       SUM(b) OVER(ORDER BY A ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)                                                                                       
FROM (VALUES(1,1),(2,2),(3,'NaN'),(4,3),(5,4)) t(a,b); 
org.apache.spark.sql.AnalysisException
incompatible types found in column col2 for inline table; line 3 pos 6

However, as I said, it works fine for normal tables:

insert into numerics values (0, 'NaN', 'NaN', '-1000');
select f_float4, sum(f_float4) over (order by f_float8 rows between 1 preceding and current row) from numerics;

Do you know if this is related to me making use of a window function or it is related to case-insensitively NaN for inline tables?

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Aug 14, 2019

Hello @DylanGuedes

Don't think its related to case sensitivity. I have not looked at it in detail, but from a cursory look it seems like we are not able to promote the col2 values to a common type in this code path. Removing windowing all together, the following fails :

spark-sql> select * from (VALUES(1,1),(2,2),(3,'NaN'),(4,3),(5,4));
Error in query: incompatible types found in column col2 for inline table; line 1 pos 15

What is the table definition for numerics ? In case of normal table, the target type is known and so we try to implicitly convert/cast the input values to the target type. It is slightly different in case of inline tables.

@DylanGuedes
Copy link
Contributor

Thank you for your answer, @dilipbiswal
numerics has the following scheme:

create table numerics (                                                                                                                                     |2
    id int,                                                                                                                                                 |2
    f_float4 float,                                                                                                                                         |2
    f_float8 float,                                                                                                                                         |2
    f_numeric int                                                                                                                                           |2
) using parquet;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants