Skip to content

Commit ca89dbf

Browse files
committed
Address comments
1 parent 185b7ee commit ca89dbf

4 files changed

Lines changed: 102 additions & 12 deletions

File tree

dev/lint-python

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -197,17 +197,17 @@ function sphinx_test {
197197
fi
198198

199199
# TODO(SPARK-32666): Install nbsphinx in Jenkins machines
200-
PYTHON_HAS_NBSPHINX=$("$PYTHON_EXECUTABLE" -c 'import importlib.util; print(importlib.util.find_spec("sphinx") is not None)')
200+
PYTHON_HAS_NBSPHINX=$("$PYTHON_EXECUTABLE" -c 'import importlib.util; print(importlib.util.find_spec("nbsphinx") is not None)')
201201
if [[ "$PYTHON_HAS_NBSPHINX" == "False" ]]; then
202-
echo "$PYTHON_HAS_NBSPHINX does not have nbsphinx installed. Skipping Sphinx build for now."
202+
echo "$PYTHON_EXECUTABLE does not have nbsphinx installed. Skipping Sphinx build for now."
203203
echo
204204
return
205205
fi
206206

207207
# TODO(SPARK-32666): Install ipython in Jenkins machines
208208
PYTHON_HAS_IPYTHON=$("$PYTHON_EXECUTABLE" -c 'import importlib.util; print(importlib.util.find_spec("ipython") is not None)')
209209
if [[ "$PYTHON_HAS_IPYTHON" == "False" ]]; then
210-
echo "$PYTHON_HAS_IPYTHON does not have ipython installed. Skipping Sphinx build for now."
210+
echo "$PYTHON_EXECUTABLE does not have ipython installed. Skipping Sphinx build for now."
211211
echo
212212
return
213213
fi

postBuild

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,5 +17,8 @@
1717
# limitations under the License.
1818
#
1919

20+
# This file is used for Binder integration to install PySpark available in
21+
# Jupyter notebook.
22+
2023
VERSION=$(python -c "exec(open('python/pyspark/version.py').read()); print(__version__)")
2124
pip install "pyspark[sql,ml,mllib]<=$VERSION"

python/docs/source/conf.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,8 @@
5050
'IPython.sphinxext.ipython_console_highlighting'
5151
]
5252

53-
# Links
53+
# Links used globally in the RST files.
54+
# These are defined here to allow link substitutions dynamically.
5455
rst_epilog = """
5556
.. |binder| replace:: Live Notebook
5657
.. _binder: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb

python/docs/source/getting_started/quickstart.ipynb

Lines changed: 94 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
"source": [
77
"# Quickstart\n",
88
"\n",
9-
"This is a short introduction and quickstart for PySpark DataFrame. PySpark DataFrame is lazily evaludated and implemented on thetop of [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview). When the data is [transformed](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations), it does not actually compute but plans how to compute later. When the [actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions) such as `collect()` are explicitly called, the computation starts.\n",
9+
"This is a short introduction and quickstart for the PySpark DataFrame API. PySpark DataFrames are lazily evaluated. They are implemented on top of [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview)s. When Spark [transforms](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations) data, it does not immediately compute the transformation but plans how to compute later. When [actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions) such as `collect()` are explicitly called, the computation starts.\n",
1010
"This notebook shows the basic usages of the DataFrame, geared mainly for new users. You can run the latest version of these examples by yourself on a live notebook [here](https://mybinder.org/v2/gh/databricks/apache/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).\n",
1111
"\n",
1212
"There are also other useful information in Apache Spark documentation site, see the latest version of [Spark SQL and DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html), [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html), [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html), [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html) and [Machine Learning Library (MLlib) Guide](https://spark.apache.org/docs/latest/ml-guide.html).\n",
@@ -242,7 +242,7 @@
242242
"cell_type": "markdown",
243243
"metadata": {},
244244
"source": [
245-
"Alternatively, you can enable `spark.sql.repl.eagerEval.enabled` configuration to enable the eager evaluation of PySpark DataFrame in notebooks such as Jupyter."
245+
"Alternatively, you can enable `spark.sql.repl.eagerEval.enabled` configuration for the eager evaluation of PySpark DataFrame in notebooks such as Jupyter. The number of rows to show can be controled via `spark.sql.repl.eagerEval.maxNumRows` configuration."
246246
]
247247
},
248248
{
@@ -309,7 +309,7 @@
309309
"cell_type": "markdown",
310310
"metadata": {},
311311
"source": [
312-
"Its schema and column names can be shown as below:"
312+
"You can see the DataFrame's schema and column names as follows:"
313313
]
314314
},
315315
{
@@ -392,7 +392,7 @@
392392
"cell_type": "markdown",
393393
"metadata": {},
394394
"source": [
395-
"`DataFrame.collect()` collects the distributed data to the driver side as Python premitive representation. Note that this can throw out-of-memory error when the dataset is too larget to fit in the driver side because it collects all the data from executors to the driver side."
395+
"`DataFrame.collect()` collects the distributed data to the driver side as the local data in Python. Note that this can throw an out-of-memory error when the dataset is too larget to fit in the driver side because it collects all the data from executors to the driver side."
396396
]
397397
},
398398
{
@@ -448,7 +448,7 @@
448448
"cell_type": "markdown",
449449
"metadata": {},
450450
"source": [
451-
"PySpark DataFrame also provides the conversion back to a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) in order to leverage pandas APIs."
451+
"PySpark DataFrame also provides the conversion back to a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) in order to leverage pandas APIs. Note that `toPandas` also collects all data into the driver side that can easily cause an out-of-memory-error when the data is too large to fit into the driver side."
452452
]
453453
},
454454
{
@@ -562,7 +562,7 @@
562562
"cell_type": "markdown",
563563
"metadata": {},
564564
"source": [
565-
"In fact, most of column-weise operations return `Column`s."
565+
"In fact, most of column-wise operations return `Column`s."
566566
]
567567
},
568568
{
@@ -685,7 +685,7 @@
685685
"source": [
686686
"## Applying a Function\n",
687687
"\n",
688-
"PySpark supports various UDFs and APIs to allow users to execute Python native functions. See also Pandas UDFs and Pandas Function APIs in User Guide. For instance, the example below allows users to directly use the APIs in [a pandas Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) within Python native function."
688+
"PySpark supports various UDFs and APIs to allow users to execute Python native functions. See also the latest [Pandas UDFs](https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs) and [Pandas Function APIs](https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-function-apis). For instance, the example below allows users to directly use the APIs in [a pandas Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) within Python native function."
689689
]
690690
},
691691
{
@@ -918,7 +918,7 @@
918918
"\n",
919919
"CSV is straightforward and easy to use. Parquet and ORC are efficient and compact file formats to read and write faster.\n",
920920
"\n",
921-
"There are many other data sources available in PySpark such as JDBC, text, binaryFile, Avro, etc. See also \"Spark SQL, DataFrames and Datasets Guide\" in Apache Spark documentation."
921+
"There are many other data sources available in PySpark such as JDBC, text, binaryFile, Avro, etc. See also the latest [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) in Apache Spark documentation."
922922
]
923923
},
924924
{
@@ -1063,6 +1063,92 @@
10631063
"df.createOrReplaceTempView(\"tableA\")\n",
10641064
"spark.sql(\"SELECT count(*) from tableA\").show()"
10651065
]
1066+
},
1067+
{
1068+
"cell_type": "markdown",
1069+
"metadata": {},
1070+
"source": [
1071+
"In addition, UDFs can be registered and invoked in SQL out of the box:"
1072+
]
1073+
},
1074+
{
1075+
"cell_type": "code",
1076+
"execution_count": 31,
1077+
"metadata": {},
1078+
"outputs": [
1079+
{
1080+
"name": "stdout",
1081+
"output_type": "stream",
1082+
"text": [
1083+
"+-----------+\n",
1084+
"|add_one(v1)|\n",
1085+
"+-----------+\n",
1086+
"| 2|\n",
1087+
"| 3|\n",
1088+
"| 4|\n",
1089+
"| 5|\n",
1090+
"| 6|\n",
1091+
"| 7|\n",
1092+
"| 8|\n",
1093+
"| 9|\n",
1094+
"+-----------+\n",
1095+
"\n"
1096+
]
1097+
}
1098+
],
1099+
"source": [
1100+
"@pandas_udf(\"integer\")\n",
1101+
"def add_one(s: pd.Series) -> pd.Series:\n",
1102+
" return s + 1\n",
1103+
"\n",
1104+
"spark.udf.register(\"add_one\", add_one)\n",
1105+
"spark.sql(\"SELECT add_one(v1) FROM tableA\").show()"
1106+
]
1107+
},
1108+
{
1109+
"cell_type": "markdown",
1110+
"metadata": {},
1111+
"source": [
1112+
"These SQL expressions can directly be mixed and used as PySpark columns."
1113+
]
1114+
},
1115+
{
1116+
"cell_type": "code",
1117+
"execution_count": 32,
1118+
"metadata": {},
1119+
"outputs": [
1120+
{
1121+
"name": "stdout",
1122+
"output_type": "stream",
1123+
"text": [
1124+
"+-----------+\n",
1125+
"|add_one(v1)|\n",
1126+
"+-----------+\n",
1127+
"| 2|\n",
1128+
"| 3|\n",
1129+
"| 4|\n",
1130+
"| 5|\n",
1131+
"| 6|\n",
1132+
"| 7|\n",
1133+
"| 8|\n",
1134+
"| 9|\n",
1135+
"+-----------+\n",
1136+
"\n",
1137+
"+--------------+\n",
1138+
"|(count(1) > 0)|\n",
1139+
"+--------------+\n",
1140+
"| true|\n",
1141+
"+--------------+\n",
1142+
"\n"
1143+
]
1144+
}
1145+
],
1146+
"source": [
1147+
"from pyspark.sql.functions import expr\n",
1148+
"\n",
1149+
"df.selectExpr('add_one(v1)').show()\n",
1150+
"df.select(expr('count(*)') > 0).show()"
1151+
]
10661152
}
10671153
],
10681154
"metadata": {

0 commit comments

Comments
 (0)