-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python #14198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Python's array has more type than python it self, for example
python only has float while array support 'f' (float) and 'd' (double)
Switching to array.typecode helps spark make a better inference
For example, for the code:
from pyspark.sql.types import _infer_type
from array import array
a = array('f',[1,2,3,4,5,6])
_infer_type(a)
We will get ArrayType(DoubleType,true) before change,
but ArrayType(FloatType,true) after change
|
Oh interesting - thanks for working on this @zasdfgbnm and sorry its sort of fallen through the cracks. Is this something you are still working on? For PRs to get in you generally need some form of automated tests, let me know if you would like some help adding tests for this issue. |
|
I'd love to help @holdenk |
|
Something to mention is, there is still one problem that I'm not sure whether I solve it correctly: in python's array, unsigned types are supported, but unsigned types are not supported in JVM. The solution in this PR is to convert unsigned types to a larger type, e.g. unsigned int -> long. I'm not sure whether it would be better to reject the unsigned types in python and throw an exception. |
|
Hi @holdenk , I think I'm done. I create a test for this issue and I do find from the test that spark has the same issue not only for float but also for byte and short. After several commits, To be clear, I need to say that only array with typecode Would you, or any other developer, review my code and get it merged? |
|
cc @ueshin |
|
|
|
@zasdfgbnm Are you still working on this? If so, could you rebase or merge master to fix conflicts please? |
|
We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks! |
|
@ueshin @gatorsmile I'm happy to resolve the conflicts IF AND ONLY IF there will be a developer work on the code review for this. This PR was opened more than a year ago and I keep waiting for the review for one year. If it is guaranteed that there will be a reviewer assigned for this recently, I will resolve the conflicts. Otherwise, I don't want to maintain a PR forever just to wait for review. |
|
@zasdfgbnm I think you can ping @ueshin to review. |
|
@zasdfgbnm Please reopen the PR and @ueshin can help review your PR. Thanks! |
|
reopened at #18444 |
|
retest this please |
… of null when creating DataFrame using python ## What changes were proposed in this pull request? This is the reopen of #14198, with merge conflicts resolved. ueshin Could you please take a look at my code? Fix bugs about types that result an array of null when creating DataFrame using python. Python's array.array have richer type than python itself, e.g. we can have `array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have `array('f')` in your rows. A simple code to reproduce this bug is: ``` from pyspark import SparkContext from pyspark.sql import SQLContext,Row,DataFrame from array import array sc = SparkContext() sqlContext = SQLContext(sc) row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) rows = sc.parallelize([ row1 ]) df = sqlContext.createDataFrame(rows) df.show() ``` which have output ``` +---------------+------------------+ | doublearray| floatarray| +---------------+------------------+ |[1.0, 2.0, 3.0]|[null, null, null]| +---------------+------------------+ ``` ## How was this patch tested? New test case added Author: Xiang Gao <[email protected]> Author: Gao, Xiang <[email protected]> Author: Takuya UESHIN <[email protected]> Closes #18444 from zasdfgbnm/fix_array_infer.
## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 I believe the author in apache#14807 removed his account. Closes apache#7075 Closes apache#8927 Closes apache#9202 Closes apache#9366 Closes apache#10861 Closes apache#11420 Closes apache#12356 Closes apache#13028 Closes apache#13506 Closes apache#14191 Closes apache#14198 Closes apache#14330 Closes apache#14807 Closes apache#15839 Closes apache#16225 Closes apache#16685 Closes apache#16692 Closes apache#16995 Closes apache#17181 Closes apache#17211 Closes apache#17235 Closes apache#17237 Closes apache#17248 Closes apache#17341 Closes apache#17708 Closes apache#17716 Closes apache#17721 Closes apache#17937 Added: Closes apache#14739 Closes apache#17139 Closes apache#17445 Closes apache#18042 Closes apache#18359 Added: Closes apache#16450 Closes apache#16525 Closes apache#17738 Added: Closes apache#16458 Closes apache#16508 Closes apache#17714 Added: Closes apache#17830 Closes apache#14742 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18417 from HyukjinKwon/close-stale-pr.
What changes were proposed in this pull request?
Fix bugs about types that result an array of null when creating DataFrame using python.
Python's array.array have richer type than python itself, e.g. we can have
array('f',[1,2,3])andarray('d',[1,2,3]). Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you havearray('f')in your rows.A simple code to reproduce this bug is:
which have output
How was this patch tested?
tested manually