Skip to content

Conversation

@johnhany97
Copy link
Member

@johnhany97 johnhany97 commented Dec 2, 2019

apache#26738
apache#26749

What changes were proposed in this pull request?

Depend on type coercion when building the replace query. This would solve an edge case where when trying to replace NaNs, 0s would get replace too.

Why are the changes needed?

This Scala code snippet:

import scala.math;

println(Double.NaN.toLong)

returns 0 which is problematic as if you run the following Spark code, 0s get replaced as well:

>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  NaN|    0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    2|
|  0.0|    3|
|  2.0|    2|
+-----+-----+ 

Does this PR introduce any user-facing change?

Yes, after the PR, running the same above code snippet returns the correct expected results:

>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  NaN|    0|
+-----+-----+

>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  2.0|    0|
+-----+-----+

And additionally, query results are changed as a result of the change in depending on scala's type coercion rules.

How was this patch tested?

Added unit tests to verify replacing NaN only affects columns of type Float and Double.

@palantirtech
Copy link
Member

Thanks for your interest in palantir/spark, @johnhany97! Before we can accept your pull request, you need to sign our contributor license agreement - just visit https://cla.palantir.com/ and follow the instructions. Once you sign, I'll automatically update this pull request.

@johnhany97 johnhany97 force-pushed the SPARK-30082-palantir branch from 6d31ede to a491e1f Compare December 2, 2019 16:10
@johnhany97 johnhany97 changed the title [SPARK-30082] Do not replace Zeros when replacing NaNs [WIP[SPARK-30082] Do not replace Zeros when replacing NaNs Dec 2, 2019
@johnhany97 johnhany97 changed the title [WIP[SPARK-30082] Do not replace Zeros when replacing NaNs [WIP][SPARK-30082] Do not replace Zeros when replacing NaNs Dec 2, 2019
@johnhany97 johnhany97 changed the title [WIP][SPARK-30082] Do not replace Zeros when replacing NaNs [SPARK-30082] Do not replace Zeros when replacing NaNs Dec 2, 2019
@johnhany97 johnhany97 changed the title [SPARK-30082] Do not replace Zeros when replacing NaNs [SPARK-30082][SQL] Do not replace Zeros when replacing NaNs Dec 3, 2019
@johnhany97 johnhany97 changed the title [SPARK-30082][SQL] Do not replace Zeros when replacing NaNs [SPARK-30082][SQL] Depend on Scala type coercion when building replace query Jan 10, 2020
@johnhany97
Copy link
Member Author

@mccheah, can you take a look?

### What changes were proposed in this pull request?
Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced.

### Why are the changes needed?
This Scala code snippet:
```
import scala.math;

println(Double.NaN.toLong)
```
returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well:
```
>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  NaN|    0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    2|
|  0.0|    3|
|  2.0|    2|
+-----+-----+
```

### Does this PR introduce any user-facing change?
Yes, after the PR, running the same above code snippet returns the correct expected results:
```
>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  NaN|    0|
+-----+-----+

>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  2.0|    0|
+-----+-----+
```

### How was this patch tested?

Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double`

Closes apache#26749 from johnhany97/SPARK-30082-2.4.

Authored-by: John Ayad <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@johnhany97 johnhany97 force-pushed the SPARK-30082-palantir branch from ea4d09e to 82e0669 Compare January 15, 2020 14:00
@mccheah
Copy link

mccheah commented Jan 15, 2020

Approving as this is a cherry-pick

@bulldozer-bot bulldozer-bot bot merged commit f3200ec into palantir:master Jan 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants