You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-22221][DOCS] Adding User Documentation for Arrow
## What changes were proposed in this pull request?
Adding user facing documentation for working with Arrow in Spark
Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#19575 from BryanCutler/arrow-user-docs-SPARK-2221.
(cherry picked from commit 0d60b32)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
1738
+
`ArrayType` of `TimestampType`, and nested `StructType`.
1739
+
1740
+
### Setting Arrow Batch Size
1741
+
1742
+
Data partitions in Spark are converted into Arrow record batches, which can temporarily lead to
1743
+
high memory usage in the JVM. To avoid possible out of memory exceptions, the size of the Arrow
1744
+
record batches can be adjusted by setting the conf "spark.sql.execution.arrow.maxRecordsPerBatch"
1745
+
to an integer that will determine the maximum number of rows for each batch. The default value is
1746
+
10,000 records per batch. If the number of columns is large, the value should be adjusted
1747
+
accordingly. Using this limit, each data partition will be made into 1 or more record batches for
1748
+
processing.
1749
+
1750
+
### Timestamp with Time Zone Semantics
1751
+
1752
+
Spark internally stores timestamps as UTC values, and timestamp data that is brought in without
1753
+
a specified time zone is converted as local time to UTC with microsecond resolution. When timestamp
1754
+
data is exported or displayed in Spark, the session time zone is used to localize the timestamp
1755
+
values. The session time zone is set with the configuration 'spark.sql.session.timeZone' and will
1756
+
default to the JVM system local time zone if not set. Pandas uses a `datetime64` type with nanosecond
1757
+
resolution, `datetime64[ns]`, with optional time zone on a per-column basis.
1758
+
1759
+
When timestamp data is transferred from Spark to Pandas it will be converted to nanoseconds
1760
+
and each column will be converted to the Spark session time zone then localized to that time
1761
+
zone, which removes the time zone and displays values as local time. This will occur
1762
+
when calling `toPandas()` or `pandas_udf` with timestamp columns.
1763
+
1764
+
When timestamp data is transferred from Pandas to Spark, it will be converted to UTC microseconds. This
1765
+
occurs when calling `createDataFrame` with a Pandas DataFrame or when returning a timestamp from a
1766
+
`pandas_udf`. These conversions are done automatically to ensure Spark will have data in the
1767
+
expected format, so it is not necessary to do any of these conversions yourself. Any nanosecond
1768
+
values will be truncated.
1769
+
1770
+
Note that a standard UDF (non-Pandas) will load timestamp data as Python datetime objects, which is
1771
+
different than a Pandas timestamp. It is recommended to use Pandas time series functionality when
1772
+
working with timestamps in `pandas_udf`s to get the best performance, see
1773
+
[here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details.
1774
+
1643
1775
# Migration Guide
1644
1776
1645
1777
## Upgrading From Spark SQL 2.2 to 2.3
@@ -1788,7 +1920,7 @@ options.
1788
1920
Note that, for <b>DecimalType(38,0)*</b>, the table above intentionally does not cover all other combinations of scales and precisions because currently we only infer decimal type like `BigInteger`/`BigInt`. For example, 1.1 is inferred as double type.
1789
1921
- In PySpark, now we need Pandas 0.19.2 or upper if you want to use Pandas related functionalities, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc.
1790
1922
- In PySpark, the behavior of timestamp values for Pandas related functionalities was changed to respect session timezone. If you want to use the old behavior, you need to set a configuration `spark.sql.execution.pandas.respectSessionTimeZone` to `False`. See [SPARK-22395](https://issues.apache.org/jira/browse/SPARK-22395) for details.
1791
-
- In PySpark, `na.fill()` or `fillna` also accepts boolean and replaces nulls with booleans. In prior Spark versions, PySpark just ignores it and returns the original Dataset/DataFrame.
1923
+
- In PySpark, `na.fill()` or `fillna` also accepts boolean and replaces nulls with booleans. In prior Spark versions, PySpark just ignores it and returns the original Dataset/DataFrame.
1792
1924
- Since Spark 2.3, when either broadcast hash join or broadcast nested loop join is applicable, we prefer to broadcasting the table that is explicitly specified in a broadcast hint. For details, see the section [Broadcast Hint](#broadcast-hint-for-sql-queries) and [SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489).
1793
1925
- Since Spark 2.3, when all inputs are binary, `functions.concat()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.concatBinaryAsString` to `true`.
1794
1926
- Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.eltOutputAsString` to `true`.
0 commit comments