[SPARK-30082][SQL] Depend on Scala type coercion when building replace query #628

johnhany97 · 2019-12-02T16:07:23Z

What changes were proposed in this pull request?

Depend on type coercion when building the replace query. This would solve an edge case where when trying to replace NaNs, 0s would get replace too.

Why are the changes needed?

This Scala code snippet:

import scala.math;

println(Double.NaN.toLong)

returns 0 which is problematic as if you run the following Spark code, 0s get replaced as well:

>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  NaN|    0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    2|
|  0.0|    3|
|  2.0|    2|
+-----+-----+

Does this PR introduce any user-facing change?

Yes, after the PR, running the same above code snippet returns the correct expected results:

>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  NaN|    0|
+-----+-----+

>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  2.0|    0|
+-----+-----+

And additionally, query results are changed as a result of the change in depending on scala's type coercion rules.

How was this patch tested?

Added unit tests to verify replacing NaN only affects columns of type Float and Double.

palantirtech · 2019-12-02T16:07:26Z

Thanks for your interest in palantir/spark, @johnhany97! Before we can accept your pull request, you need to sign our contributor license agreement - just visit https://cla.palantir.com/ and follow the instructions. Once you sign, I'll automatically update this pull request.

johnhany97 · 2020-01-10T15:53:03Z

@mccheah, can you take a look?

### What changes were proposed in this pull request? Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced. ### Why are the changes needed? This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 2| | 0.0| 3| | 2.0| 2| +-----+-----+ ``` ### Does this PR introduce any user-facing change? Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | 2.0| 0| +-----+-----+ ``` ### How was this patch tested? Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double` Closes apache#26749 from johnhany97/SPARK-30082-2.4. Authored-by: John Ayad <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

mccheah · 2020-01-15T19:37:14Z

Approving as this is a cherry-pick

johnhany97 force-pushed the SPARK-30082-palantir branch from 6d31ede to a491e1f Compare December 2, 2019 16:10

johnhany97 changed the title ~~[SPARK-30082] Do not replace Zeros when replacing NaNs~~ [WIP[SPARK-30082] Do not replace Zeros when replacing NaNs Dec 2, 2019

johnhany97 changed the title ~~[WIP[SPARK-30082] Do not replace Zeros when replacing NaNs~~ [WIP][SPARK-30082] Do not replace Zeros when replacing NaNs Dec 2, 2019

johnhany97 changed the title ~~[WIP][SPARK-30082] Do not replace Zeros when replacing NaNs~~ [SPARK-30082] Do not replace Zeros when replacing NaNs Dec 2, 2019

johnhany97 changed the title ~~[SPARK-30082] Do not replace Zeros when replacing NaNs~~ [SPARK-30082][SQL] Do not replace Zeros when replacing NaNs Dec 3, 2019

johnhany97 changed the title ~~[SPARK-30082][SQL] Do not replace Zeros when replacing NaNs~~ [SPARK-30082][SQL] Depend on Scala type coercion when building replace query Jan 10, 2020

johnhany97 force-pushed the SPARK-30082-palantir branch from ea4d09e to 82e0669 Compare January 15, 2020 14:00

lwwmanning approved these changes Jan 15, 2020

View reviewed changes

mccheah approved these changes Jan 15, 2020

View reviewed changes

bulldozer-bot bot merged commit f3200ec into palantir:master Jan 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-30082][SQL] Depend on Scala type coercion when building replace query #628

[SPARK-30082][SQL] Depend on Scala type coercion when building replace query #628

Uh oh!

johnhany97 commented Dec 2, 2019 •

edited

Loading

Uh oh!

palantirtech commented Dec 2, 2019

Uh oh!

johnhany97 commented Jan 10, 2020

Uh oh!

mccheah commented Jan 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-30082][SQL] Depend on Scala type coercion when building replace query #628

[SPARK-30082][SQL] Depend on Scala type coercion when building replace query #628

Uh oh!

Conversation

johnhany97 commented Dec 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

palantirtech commented Dec 2, 2019

Uh oh!

johnhany97 commented Jan 10, 2020

Uh oh!

mccheah commented Jan 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

johnhany97 commented Dec 2, 2019 •

edited

Loading