Add tests for vectorized UDF. #26

ueshin · 2017-09-13T01:17:09Z

Added tests from apache#19147 and adjusted.

Currently there are 1 failure and 7 errors with these tests.

failure
- test_vectorized_udf_invalid_length
error
- test_vectorized_udf_null_boolean
- test_vectorized_udf_null_byte
- test_vectorized_udf_null_short
- test_vectorized_udf_null_int
- test_vectorized_udf_null_long
- test_vectorized_udf_null_string
- test_vectorized_udf_zero_parameter
- ~~test_vectorized_udf_datatype_string~~

ueshin · 2017-09-13T04:30:32Z

As for null-related errors, the reason is Pandas changes dtypes if integral type columns contain None. In some cases Arrow throws an exception saying ArrowNotImplementedError, though.
I found workaround like https://github.com/apache/spark/pull/19147/files#diff-e954728a2630cfd7c824e97405d08aafR599 when I implemented before.

…rror.

ueshin · 2017-09-13T05:48:53Z

python/pyspark/sql/tests.py

+        from pyspark.sql.functions import pandas_udf
+        import pandas as pd
+        df = self.spark.range(100000)
+        f0 = pandas_udf(lambda **kwargs: pd.Series(1).repeat(kwargs['size']), LongType())


I assumed that you use **kwargs way for size hint.

I'll give it a try and we can see how it turns out, then dscuss

BryanCutler · 2017-09-13T17:45:47Z

Thanks @ueshin , this looks good. I'll merge it now and work on the errors.

As for null-related errors, the reason is Pandas changes dtypes if integral type columns contain None. In some cases Arrow throws an exception saying ArrowNotImplementedError, though.

Thanks for the pointer, I'll take a look at what's not implemented and see if I can help push that along.

Modify test_vectorized_udf_datatype_string not to fail by unrelated error. closes #26

BryanCutler · 2017-09-15T22:24:30Z

Thanks, merged now. I still need to fix the null-related errors with your suggestion. Would you mind if I used the toPandasSchema you wrote here https://github.com/apache/spark/pull/19147/files#diff-c1cf83efe5a5b1f2f1f770589988e997R1608?

ueshin · 2017-09-16T09:57:46Z

@BryanCutler Sure, go ahead and use it. Thanks!

Add tests.

9761cb5

ueshin mentioned this pull request Sep 13, 2017

[SPARK-21190][PYSPARK] Python Vectorized UDFs apache/spark#18659

Closed

5 tasks

Modify test_vectorized_udf_datatype_string not to fail by unrelated e…

9a647fd

…rror.

ueshin commented Sep 13, 2017

View reviewed changes

BryanCutler pushed a commit that referenced this pull request Sep 15, 2017

Add tests.

518126e

Modify test_vectorized_udf_datatype_string not to fail by unrelated error. closes #26

BryanCutler closed this Sep 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tests for vectorized UDF. #26

Add tests for vectorized UDF. #26

Uh oh!

ueshin commented Sep 13, 2017 •

edited

Loading

Uh oh!

ueshin commented Sep 13, 2017

Uh oh!

ueshin Sep 13, 2017

Uh oh!

BryanCutler Sep 13, 2017

Uh oh!

BryanCutler commented Sep 13, 2017

Uh oh!

BryanCutler commented Sep 15, 2017

Uh oh!

ueshin commented Sep 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add tests for vectorized UDF. #26

Add tests for vectorized UDF. #26

Uh oh!

Conversation

ueshin commented Sep 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ueshin commented Sep 13, 2017

Uh oh!

ueshin Sep 13, 2017

Choose a reason for hiding this comment

Uh oh!

BryanCutler Sep 13, 2017

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Sep 13, 2017

Uh oh!

BryanCutler commented Sep 15, 2017

Uh oh!

ueshin commented Sep 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ueshin commented Sep 13, 2017 •

edited

Loading