-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27768][SQL] Support Infinity/NaN-related float/double literals case-insensitively #25331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok, I'll check in hours ;) |
| $evPrim = Float.valueOf($c.toString()); | ||
| } catch (java.lang.NumberFormatException e) { | ||
| $evNull = true; | ||
| final String $str = $c.toString().trim(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is called only in case of exceptions, it would be great to call .toString once by reusing the one in line 1265.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun oops.. sorry.. i will fix.
| Float.NaN | ||
| } else { | ||
| null | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid duplications, shall we make a function for this since we have the same Float logic twice here and line 1268?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun The assignment variable names are different ? Is there an example i can follow where codegen and non-codegen code is refactored to a function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need an assignment if you handle everything inside the function.
The function will return the right-side value of the assignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Ah... thanks .. will try.
| } | ||
| } | ||
|
|
||
| test("SPARK-27768 process infinity, -infinity , nan in case insensitive manner") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is not a bug. Instead, this is a behavior change for PostgreSQL compatibility.
Previously, Spark works like Java/Scala/Hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun OK.. will remove the JIRA id.
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
Outdated
Show resolved
Hide resolved
| buildCast[UTF8String](_, s => try s.toString.toDouble catch { | ||
| case _: NumberFormatException => null | ||
| case _: NumberFormatException => | ||
| val str = s.trim.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its just a check; do we support this case, e.g., ' infinity '? (I know pg does so though). If we support this case, IMO you'd be better to clearly describe in the PR description, which case this PR try to support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Currently i trim spaces from both ends and compare. So this case should work.
Ok.. i will update the description.
| case _: NumberFormatException => null | ||
| case _: NumberFormatException => | ||
| val str = s.trim.toString | ||
| if (str.equalsIgnoreCase("infinity")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a check; we don't support some alternatives, e.g., inf, -inf, ... ?https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/float.c#L194
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu We don't support inf and -inf, should we support ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there another db-like systems supporting this, e.g., DB2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Yes.. db2 does allow inf and -inf. Lets add in spark also ? Thanks a lot for pointing this..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO some users might think inf is more common, e.g., even in python, both inf and infinity can be used;
Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46)
>>> float('inf')
inf
>>> float('infinity')
inf
Its ok to wait for other reviewer's comments. @gatorsmile @dongjoon-hyun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile @dongjoon-hyun Could you please help make a decision on whether to support inf, +inf for positive infinity and '-inf` for negative infinity ? Thanks a lot. Thanks again @maropu for pointing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @maropu mentioned, since PostgreSQL supports that, I'm +1 for supporting it.
else if (pg_strncasecmp(num, "inf", 3) == 0)
{
val = get_float4_infinity();
endptr = num + 3;
}
else if (pg_strncasecmp(num, "+inf", 4) == 0)
{
val = get_float4_infinity();
endptr = num + 4;
}
else if (pg_strncasecmp(num, "-inf", 4) == 0)
{
val = -get_float4_infinity();
endptr = num + 4;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun OK.. thanks.. i will make the change.
|
Test build #108533 has finished for PR 25331 at commit
|
| buildCast[UTF8String](_, s => try s.toString.toFloat catch { | ||
| case _: NumberFormatException => null | ||
| case _: NumberFormatException => | ||
| val str = s.trim.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge lines 551-560 and 577-586 into one function? Since the return type will be Any, we can merge two cases Float and Double into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK @kiszk
|
Test build #108557 has finished for PR 25331 at commit
|
|
@dongjoon-hyun I have incorporated the changes.. Please take a look when you get a chance. I am not fully sure if the refactoring between codegen and non-codegen is necessary. I guess, since this code path is exercised only in the exception path, its okay to go for code-reuse in this specific case :-) |
| } | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this newline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Thx.. will do.
| buildCast[UTF8String](_, s => { | ||
| val floatStr = s.toString | ||
| try floatStr.toFloat catch { | ||
| case _: NumberFormatException => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explicitly match strings we want? We can avoid try-catch of an exception for the known strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya So my idea here was that.. since we don't expect these kind of literals a lot i.e its not a common case .. we don't change our normal processing path to add any possible runtime costs. Thats why we keep all these in the exception processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the @viirya approach for readability, but I understand your concern for the performance. So, could you check actual performance changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Oops.. didn't see this comment. I suppose i have to use the benchmark framework for this ? Appreciate any tip on this..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we check the numbers by a simple query?, e.g.,
scala> spark.range(10).selectExpr("CAST(double(id) AS STRING) a").write.save("/tmp/test")
scala> spark.read.load("/tmp/test").selectExpr("CAST(a AS DOUBLE)").write.format("noop").save()
In another pr, I observed that a logic depending on exceptions cause high performance penalties: lz4/lz4-java#143
So, I'm a bit worried that this current logic has the same issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Actually my input data didn't contain any of these special literals. Basically i was testing with the condition when we don't hit the catch block. Basically we are trying optimize for the best case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just make sure that a small number of inf/-inf does not worse the prformance. If this is true, I think the current code looks ok to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu In the mean time i had tried with 1% of data being 'inf' and i can confirm that it does not hurt the performance :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds nice, too ;)
|
|
||
| def resolvableNullability(from: Boolean, to: Boolean): Boolean = !from || to | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an extra empty line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Thx.. will do.
|
Test build #108583 has finished for PR 25331 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, where does "infiNity" come up? or "+infinity"? I don't think this is how the JVM or Python renders them?
|
|
||
| def processFloatingPointSpecialLiterals(v: String, isFloat: Boolean): Any = { | ||
| val str = v.trim | ||
| if (str.equalsIgnoreCase("infinity") || str.equalsIgnoreCase("+infinity")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need Locale.ROOT here? it's a really obscure possibility, but "i" can cause problems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen OK.. will add Locale.ROOT by following how its used in other places.
May i please request you to refer to the discussions in the associated JIRA ? Basically our intention here is accept these "special" float and double literals in a case insensitive manner as some other database systems allow them ? Do you see a problem here ? |
928e4c1 to
e901dc4
Compare
|
|
||
| def resolvableNullability(from: Boolean, to: Boolean): Boolean = !from || to | ||
|
|
||
| def processFloatingPointSpecialLiterals(v: String, isFloat: Boolean): Any = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually some literals like InFiniTy can be casted, it looks a bit weird. However, postgresql accepts such things. Maybe add a comment here to explain why allowing case insensitive match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Does this look ok ?
We process literals such as 'Infinity', 'Inf', '-Infinity' and 'NaN' in case insensitive manner to be compatible with other database systems such as Postgres and DB2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel ok. Other reviewers may also have advice too.
|
Test build #108861 has finished for PR 25331 at commit
|
|
retest this please |
|
Test build #108876 has finished for PR 25331 at commit
|
|
Test build #108982 has finished for PR 25331 at commit
|
docs/sql-migration-guide-upgrade.md
Outdated
| </tr> | ||
| <tr> | ||
| <td> | ||
| CAST('infinity' TO DOUBLE)<br> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a SQL syntax, TO -> AS?
|
Thank you for the decision, @srowen ! |
|
|
||
| /** | ||
| * We process literals such as 'Infinity', 'Inf', '-Infinity' and 'NaN' etc in case | ||
| * insensitive manner to be compatible with other database systems such as Postgres and DB2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgres -> PostgreSQL.
|
@dilipbiswal . I update the PR title because this PR focuses on only Spark's postgres=# select cast('infinity' as date);
date
----------
infinity
(1 row) |
docs/sql-migration-guide-upgrade.md
Outdated
| <b>Operation</b> | ||
| </th> | ||
| <th> | ||
| <b>Result prior to spark 3.0</b> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark -> Spark.
docs/sql-migration-guide-upgrade.md
Outdated
| <b>Result prior to spark 3.0</b> | ||
| </th> | ||
| <th> | ||
| <b>Result starting spark 3.0</b> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
docs/sql-migration-guide-upgrade.md
Outdated
| </tr> | ||
| <tr> | ||
| <td> | ||
| CAST('nan' TO FLOAT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix all instances of CAST. It seems to wrong as I commented in the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun oops...thanks.. fixed.
| $evPrim = Float.valueOf($floatStr); | ||
| } catch (java.lang.NumberFormatException e) { | ||
| $evNull = true; | ||
| Float f = (Float) Cast.processFloatingPointSpecialLiterals($floatStr, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Float -> final Float?
| $evPrim = Double.valueOf($doubleStr); | ||
| } catch (java.lang.NumberFormatException e) { | ||
| $evNull = true; | ||
| Double d = (Double) Cast.processFloatingPointSpecialLiterals($doubleStr, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double -> final Double?
| checkEvaluation(cast("-infinity", FloatType), Float.NegativeInfinity) | ||
| checkEvaluation(cast("-infiniTy", FloatType), Float.NegativeInfinity) | ||
| checkEvaluation(cast(" -infinity ", FloatType), Float.NegativeInfinity) | ||
| checkEvaluation(cast(" -inf ", FloatType), Float.NegativeInfinity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have negative cases returning NULL in somewhere else?
|
|
||
| checkEvaluation(cast("nan", FloatType), Float.NaN) | ||
| checkEvaluation(cast("nAn", FloatType), Float.NaN) | ||
| checkEvaluation(cast(" nan ", FloatType), Float.NaN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Can we have ' -naN ', too?
|
|
||
| checkEvaluation(cast("nan", DoubleType), Double.NaN) | ||
| checkEvaluation(cast("nAn", DoubleType), Double.NaN) | ||
| checkEvaluation(cast(" nan ", DoubleType), Double.NaN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for asking this, but can we use the following shorter form?
Seq("inf", "+inf", "infinity", "+infiNity", " infinity ").foreach { value =>
checkEvaluation(cast(value, FloatType), Float.PositiveInfinity)
}
Seq("-infinity", "-infiniTy", " -infinity ", " -inf ").foreach { value =>
checkEvaluation(cast(value, FloatType), Float.NegativeInfinity)
}
Seq("inf", "+inf", "infinity", "+infiNity", " infinity ").foreach { value =>
checkEvaluation(cast(value, DoubleType), Double.PositiveInfinity)
}
Seq("-infinity", "-infiniTy", " -infinity ", " -inf ").foreach { value =>
checkEvaluation(cast(value, DoubleType), Double.NegativeInfinity)
}
Seq("nan", "nAn", " nan ").foreach { value =>
checkEvaluation(cast(value, FloatType), Float.NaN)
}
Seq("nan", "nAn", " nan ").foreach { value =>
checkEvaluation(cast(value, DoubleType), Double.NaN)
}| struct<CAST(nan AS DOUBLE):double> | ||
| -- !query 11 output | ||
| NULL | ||
| NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove SPARK-28060 in float8.sql.
|
Thanks. @dilipbiswal . I'll review once more after you fix the above comments. |
|
Test build #109014 has finished for PR 25331 at commit
|
|
retest this please |
|
Test build #109025 has finished for PR 25331 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master.
Thank you, @dilipbiswal , @maropu , @kiszk , @viirya , @srowen.
|
Thanks a lot @dongjoon-hyun @maropu @kiszk @viirya @srowen for very detailed review. Appreciate it very much !! |
|
Hey @dilipbiswal , I get the following error for inline tables (although it work great for normal tables): SELECT a, b,
SUM(b) OVER(ORDER BY A ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
FROM (VALUES(1,1),(2,2),(3,'NaN'),(4,3),(5,4)) t(a,b); However, as I said, it works fine for normal tables: insert into numerics values (0, 'NaN', 'NaN', '-1000');
select f_float4, sum(f_float4) over (order by f_float8 rows between 1 preceding and current row) from numerics;Do you know if this is related to me making use of a window function or it is related to case-insensitively NaN for inline tables? |
|
Hello @DylanGuedes Don't think its related to case sensitivity. I have not looked at it in detail, but from a cursory look it seems like we are not able to promote the col2 values to a common type in this code path. Removing windowing all together, the following fails : What is the table definition for |
|
Thank you for your answer, @dilipbiswal create table numerics ( |2
id int, |2
f_float4 float, |2
f_float8 float, |2
f_numeric int |2
) using parquet; |
What changes were proposed in this pull request?
Here is the problem description from the JIRA.
In this PR, the casting code is enhanced to handle these
specialstring literals in case insensitive manner.How was this patch tested?
Added tests in CastSuite and modified existing test suites.