Skip to content

Conversation

@ueshin
Copy link

@ueshin ueshin commented Sep 13, 2017

Added tests from apache#19147 and adjusted.

Currently there are 1 failure and 7 errors with these tests.

  • failure
    • test_vectorized_udf_invalid_length
  • error
    • test_vectorized_udf_null_boolean
    • test_vectorized_udf_null_byte
    • test_vectorized_udf_null_short
    • test_vectorized_udf_null_int
    • test_vectorized_udf_null_long
    • test_vectorized_udf_null_string
    • test_vectorized_udf_zero_parameter
    • test_vectorized_udf_datatype_string

@ueshin
Copy link
Author

ueshin commented Sep 13, 2017

As for null-related errors, the reason is Pandas changes dtypes if integral type columns contain None. In some cases Arrow throws an exception saying ArrowNotImplementedError, though.
I found workaround like https://github.com/apache/spark/pull/19147/files#diff-e954728a2630cfd7c824e97405d08aafR599 when I implemented before.

from pyspark.sql.functions import pandas_udf
import pandas as pd
df = self.spark.range(100000)
f0 = pandas_udf(lambda **kwargs: pd.Series(1).repeat(kwargs['size']), LongType())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that you use **kwargs way for size hint.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a try and we can see how it turns out, then dscuss

@BryanCutler
Copy link
Owner

Thanks @ueshin , this looks good. I'll merge it now and work on the errors.

As for null-related errors, the reason is Pandas changes dtypes if integral type columns contain None. In some cases Arrow throws an exception saying ArrowNotImplementedError, though.

Thanks for the pointer, I'll take a look at what's not implemented and see if I can help push that along.

BryanCutler pushed a commit that referenced this pull request Sep 15, 2017
Modify test_vectorized_udf_datatype_string not to fail by unrelated error.

closes #26
@BryanCutler
Copy link
Owner

Thanks, merged now. I still need to fix the null-related errors with your suggestion. Would you mind if I used the toPandasSchema you wrote here https://github.com/apache/spark/pull/19147/files#diff-c1cf83efe5a5b1f2f1f770589988e997R1608?

@ueshin
Copy link
Author

ueshin commented Sep 16, 2017

@BryanCutler Sure, go ahead and use it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants