Skip to content

Commit ccb0a59

Browse files
HyukjinKwongatorsmile
authored andcommitted
[SPARK-23446][PYTHON] Explicitly check supported types in toPandas
## What changes were proposed in this pull request? This PR explicitly specifies and checks the types we supported in `toPandas`. This was a hole. For example, we haven't finished the binary type support in Python side yet but now it allows as below: ```python spark.conf.set("spark.sql.execution.arrow.enabled", "false") df = spark.createDataFrame([[bytearray("a")]]) df.toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", "true") df.toPandas() ``` ``` _1 0 [97] _1 0 a ``` This should be disallowed. I think the same things also apply to nested timestamps too. I also added some nicer message about `spark.sql.execution.arrow.enabled` in the error message. ## How was this patch tested? Manually tested and tests added in `python/pyspark/sql/tests.py`. Author: hyukjinkwon <[email protected]> Closes #20625 from HyukjinKwon/pandas_convertion_supported_type. (cherry picked from commit c5857e4) Signed-off-by: gatorsmile <[email protected]>
1 parent 75bb19a commit ccb0a59

File tree

2 files changed

+17
-7
lines changed

2 files changed

+17
-7
lines changed

python/pyspark/sql/dataframe.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1943,10 +1943,11 @@ def toPandas(self):
19431943
if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true":
19441944
try:
19451945
from pyspark.sql.types import _check_dataframe_convert_date, \
1946-
_check_dataframe_localize_timestamps
1946+
_check_dataframe_localize_timestamps, to_arrow_schema
19471947
from pyspark.sql.utils import require_minimum_pyarrow_version
1948-
import pyarrow
19491948
require_minimum_pyarrow_version()
1949+
import pyarrow
1950+
to_arrow_schema(self.schema)
19501951
tables = self._collectAsArrow()
19511952
if tables:
19521953
table = pyarrow.concat_tables(tables)
@@ -1955,10 +1956,12 @@ def toPandas(self):
19551956
return _check_dataframe_localize_timestamps(pdf, timezone)
19561957
else:
19571958
return pd.DataFrame.from_records([], columns=self.columns)
1958-
except ImportError as e:
1959-
msg = "note: pyarrow must be installed and available on calling Python process " \
1960-
"if using spark.sql.execution.arrow.enabled=true"
1961-
raise ImportError("%s\n%s" % (_exception_message(e), msg))
1959+
except Exception as e:
1960+
msg = (
1961+
"Note: toPandas attempted Arrow optimization because "
1962+
"'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
1963+
"to disable this.")
1964+
raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
19621965
else:
19631966
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
19641967

python/pyspark/sql/tests.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3443,7 +3443,14 @@ def test_unsupported_datatype(self):
34433443
schema = StructType([StructField("map", MapType(StringType(), IntegerType()), True)])
34443444
df = self.spark.createDataFrame([(None,)], schema=schema)
34453445
with QuietTest(self.sc):
3446-
with self.assertRaisesRegexp(Exception, 'Unsupported data type'):
3446+
with self.assertRaisesRegexp(Exception, 'Unsupported type'):
3447+
df.toPandas()
3448+
3449+
df = self.spark.createDataFrame([(None,)], schema="a binary")
3450+
with QuietTest(self.sc):
3451+
with self.assertRaisesRegexp(
3452+
Exception,
3453+
'Unsupported type.*\nNote: toPandas attempted Arrow optimization because'):
34473454
df.toPandas()
34483455

34493456
def test_null_conversion(self):

0 commit comments

Comments
 (0)