[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python #14198

zasdfgbnm · 2016-07-14T08:58:04Z

What changes were proposed in this pull request?

Fix bugs about types that result an array of null when creating DataFrame using python.

Python's array.array have richer type than python itself, e.g. we can have array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have array('f') in your rows.

A simple code to reproduce this bug is:

from pyspark import SparkContext
from pyspark.sql import SQLContext,Row,DataFrame
from array import array

sc = SparkContext()
sqlContext = SQLContext(sc)

row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
rows = sc.parallelize([ row1 ])
df = sqlContext.createDataFrame(rows)
df.show()

which have output

+---------------+------------------+
|    doublearray|        floatarray|
+---------------+------------------+
|[1.0, 2.0, 3.0]|[null, null, null]|
+---------------+------------------+

How was this patch tested?

tested manually

Python's array has more type than python it self, for example python only has float while array support 'f' (float) and 'd' (double) Switching to array.typecode helps spark make a better inference For example, for the code: from pyspark.sql.types import _infer_type from array import array a = array('f',[1,2,3,4,5,6]) _infer_type(a) We will get ArrayType(DoubleType,true) before change, but ArrayType(FloatType,true) after change

holdenk · 2016-10-07T20:24:08Z

Oh interesting - thanks for working on this @zasdfgbnm and sorry its sort of fallen through the cracks. Is this something you are still working on? For PRs to get in you generally need some form of automated tests, let me know if you would like some help adding tests for this issue.

zasdfgbnm · 2016-10-07T20:34:38Z

I'd love to help @holdenk

zasdfgbnm · 2016-10-07T20:45:41Z

Something to mention is, there is still one problem that I'm not sure whether I solve it correctly: in python's array, unsigned types are supported, but unsigned types are not supported in JVM. The solution in this PR is to convert unsigned types to a larger type, e.g. unsigned int -> long. I'm not sure whether it would be better to reject the unsigned types in python and throw an exception.

zasdfgbnm · 2016-10-18T00:25:12Z

Hi @holdenk , I think I'm done. I create a test for this issue and I do find from the test that spark has the same issue not only for float but also for byte and short. After several commits, ./python/run-tests --modules=pyspark-sql passes on my computer.

To be clear, I need to say that only array with typecode b,h,i,l,f,d are supported, array with typecode u is not supported because it "corresponds to Python’s obsolete unicode character", array with typecode B,H,I,L are not supported because there is no unsigned types on JVM, array with typecode q,Q are not supported because they "are available only if the platform C compiler used to build Python supports C long long", which makes supporting them complicated. For the unsupported typecodes, a TypeError will be raised if the user try to create a DataFrame of it.

Would you, or any other developer, review my code and get it merged?

gatorsmile · 2017-06-13T16:14:16Z

cc @ueshin

ueshin · 2017-06-13T18:14:29Z

~~@holdenk Are you still working on this? If so, could you rebase or merge master to fix conflicts please?~~
Oops, I'm sorry, I made a mistake to ping to.

ueshin · 2017-06-13T18:18:40Z

@zasdfgbnm Are you still working on this? If so, could you rebase or merge master to fix conflicts please?

gatorsmile · 2017-06-27T06:41:41Z

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

zasdfgbnm · 2017-06-27T09:20:48Z

@ueshin @gatorsmile I'm happy to resolve the conflicts IF AND ONLY IF there will be a developer work on the code review for this. This PR was opened more than a year ago and I keep waiting for the review for one year. If it is guaranteed that there will be a reviewer assigned for this recently, I will resolve the conflicts. Otherwise, I don't want to maintain a PR forever just to wait for review.

felixcheung · 2017-06-27T16:33:11Z

@zasdfgbnm I think you can ping @ueshin to review.
Sounds important to me to have. Ping me if it falls through.

gatorsmile · 2017-06-27T20:37:13Z

@zasdfgbnm Please reopen the PR and @ueshin can help review your PR. Thanks!

zasdfgbnm · 2017-06-28T01:36:36Z

reopened at #18444

gatorsmile · 2017-06-28T03:31:34Z

retest this please

… of null when creating DataFrame using python ## What changes were proposed in this pull request? This is the reopen of #14198, with merge conflicts resolved. ueshin Could you please take a look at my code? Fix bugs about types that result an array of null when creating DataFrame using python. Python's array.array have richer type than python itself, e.g. we can have `array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have `array('f')` in your rows. A simple code to reproduce this bug is: ``` from pyspark import SparkContext from pyspark.sql import SQLContext,Row,DataFrame from array import array sc = SparkContext() sqlContext = SQLContext(sc) row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) rows = sc.parallelize([ row1 ]) df = sqlContext.createDataFrame(rows) df.show() ``` which have output ``` +---------------+------------------+ | doublearray| floatarray| +---------------+------------------+ |[1.0, 2.0, 3.0]|[null, null, null]| +---------------+------------------+ ``` ## How was this patch tested? New test case added Author: Xiang Gao <[email protected]> Author: Gao, Xiang <[email protected]> Author: Takuya UESHIN <[email protected]> Closes #18444 from zasdfgbnm/fix_array_infer.

## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 I believe the author in apache#14807 removed his account. Closes apache#7075 Closes apache#8927 Closes apache#9202 Closes apache#9366 Closes apache#10861 Closes apache#11420 Closes apache#12356 Closes apache#13028 Closes apache#13506 Closes apache#14191 Closes apache#14198 Closes apache#14330 Closes apache#14807 Closes apache#15839 Closes apache#16225 Closes apache#16685 Closes apache#16692 Closes apache#16995 Closes apache#17181 Closes apache#17211 Closes apache#17235 Closes apache#17237 Closes apache#17248 Closes apache#17341 Closes apache#17708 Closes apache#17716 Closes apache#17721 Closes apache#17937 Added: Closes apache#14739 Closes apache#17139 Closes apache#17445 Closes apache#18042 Closes apache#18359 Added: Closes apache#16450 Closes apache#16525 Closes apache#17738 Added: Closes apache#16458 Closes apache#16508 Closes apache#17714 Added: Closes apache#17830 Closes apache#14742 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18417 from HyukjinKwon/close-stale-pr.

zasdfgbnm added 5 commits July 8, 2016 20:58

Merge branch 'master' into fix_array_infer

70131f3

sync with upstream

505e819

add case (c: Float, FloatType) to fromJava

05979ca

sync with upstream

5cd817a

zasdfgbnm changed the title ~~Fix bugs about types that result an array of null when creating dataframe using python~~ [SPARK-16542] Fix bugs about types that result an array of null when creating dataframe using python Jul 14, 2016

zasdfgbnm changed the title ~~[SPARK-16542] Fix bugs about types that result an array of null when creating dataframe using python~~ [SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python Jul 14, 2016

zasdfgbnm added 5 commits October 16, 2016 22:44

add test for array in dataframe

cd2ec6b

merge with upstream/master

527d969

set unsigned types and Py_UNICODE as unsupported

82223c0

fix code style

0a967e2

fix the same problem for byte and short

2059435

HyukjinKwon mentioned this pull request Jun 25, 2017

[INFRA] Close stale PRs #18417

Closed

asfgit closed this in b32bd00 Jun 27, 2017

zasdfgbnm mentioned this pull request Jun 28, 2017

[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating DataFrame using python #18444

Closed

[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python #14198

[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python #14198

Uh oh!

Conversation

zasdfgbnm commented Jul 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdenk commented Oct 7, 2016

Uh oh!

zasdfgbnm commented Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zasdfgbnm commented Oct 7, 2016

Uh oh!

zasdfgbnm commented Oct 18, 2016

Uh oh!

gatorsmile commented Jun 13, 2017

Uh oh!

ueshin commented Jun 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ueshin commented Jun 13, 2017

Uh oh!

gatorsmile commented Jun 27, 2017

Uh oh!

zasdfgbnm commented Jun 27, 2017

Uh oh!

felixcheung commented Jun 27, 2017

Uh oh!

gatorsmile commented Jun 27, 2017

Uh oh!

zasdfgbnm commented Jun 28, 2017

Uh oh!

gatorsmile commented Jun 28, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zasdfgbnm commented Jul 14, 2016 •

edited

Loading

zasdfgbnm commented Oct 7, 2016 •

edited

Loading

ueshin commented Jun 13, 2017 •

edited

Loading