[SPARK-32320][PYSPARK] Remove mutable default arguments #29122

Fokko · 2020-07-15T08:27:53Z

This is bad practice, and might lead to unexpected behaviour:
https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/

fokkodriesprong@Fan spark % grep -R "={}" python | grep def

python/pyspark/resource/profile.py:    def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}):
python/pyspark/sql/functions.py:def from_json(col, schema, options={}):
python/pyspark/sql/functions.py:def to_json(col, options={}):
python/pyspark/sql/functions.py:def schema_of_json(json, options={}):
python/pyspark/sql/functions.py:def schema_of_csv(csv, options={}):
python/pyspark/sql/functions.py:def to_csv(col, options={}):
python/pyspark/sql/functions.py:def from_csv(col, schema, options={}):
python/pyspark/sql/avro/functions.py:def from_avro(data, jsonFormatSchema, options={}):

fokkodriesprong@Fan spark % grep -R "=\[\]" python | grep def
python/pyspark/ml/tuning.py:    def __init__(self, bestModel, avgMetrics=[], subModels=None):
python/pyspark/ml/tuning.py:    def __init__(self, bestModel, validationMetrics=[], subModels=None):

What changes were proposed in this pull request?

Removing the mutable default arguments.

Why are the changes needed?

Removing the mutable default arguments, and changing the signature to Optional[...].

Does this PR introduce any user-facing change?

No 👍

How was this patch tested?

Using the Flake8 bugbear code analysis plugin.

python/pyspark/sql/functions.py

python/pyspark/ml/tuning.py

python/pyspark/sql/functions.py

Fokko · 2020-10-20T10:26:25Z

Fixed the voilations:

./python/pyspark/ml/regression.py:1743:40: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
                 quantileProbabilities=list([0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]),
                                       ^
./python/pyspark/ml/regression.py:1761:41: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
                  quantileProbabilities=list([0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]),
                                        ^
./python/pyspark/ml/tuning.py:511:46: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
    def __init__(self, bestModel, avgMetrics=[], subModels=None):
                                             ^
./python/pyspark/ml/tuning.py:871:53: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
    def __init__(self, bestModel, validationMetrics=[], subModels=None):
                                                    ^
./python/pyspark/resource/profile.py:35:63: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
    def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}):
                                                              ^
./python/pyspark/resource/profile.py:35:77: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
    def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}):
                                                                            ^
./python/pyspark/sql/functions.py:2479:36: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def from_json(col, schema, options={}):
                                   ^
./python/pyspark/sql/functions.py:2527:26: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def to_json(col, options={}):
                         ^
./python/pyspark/sql/functions.py:2567:34: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def schema_of_json(json, options={}):
                                 ^
./python/pyspark/sql/functions.py:2597:32: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def schema_of_csv(csv, options={}):
                               ^
./python/pyspark/sql/functions.py:2623:25: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def to_csv(col, options={}):
                        ^
./python/pyspark/sql/functions.py:2934:35: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def from_csv(col, schema, options={}):
                                  ^
./python/pyspark/sql/avro/functions.py:29:47: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def from_avro(data, jsonFormatSchema, options={}):
                                              ^
./dev/sparktestsupport/modules.py:34:97: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
    def __init__(self, name, dependencies, source_file_regexes, build_profile_flags=(), environ={},
                                                                                                ^
14    B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
14
1

Furthermore:

Added flake8-bugbear to the CI. This is a well known additional set of rules next to flake8: https://github.com/pycqa/flake8-bugbear
Added the B006 rule, which is the mutable arguments rule :)

This is bad practice, and might lead to unexpected behaviour: https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/ Add bugbear to check it in the CI

python/pyspark/sql/avro/functions.py

python/pyspark/sql/functions.py

python/pyspark/ml/tuning.py

python/pyspark/ml/regression.py

HyukjinKwon · 2020-10-21T00:40:41Z

ok to test

HyukjinKwon · 2020-10-21T00:41:43Z

cc @huaxingao and @zhengruifeng FYI

SparkQA · 2020-10-21T01:25:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34670/

SparkQA · 2020-10-21T01:48:57Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34670/

SparkQA · 2020-10-21T02:39:30Z

Test build #130061 has finished for PR 29122 at commit e699c49.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-21T18:52:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34710/

SparkQA · 2020-10-21T19:13:34Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34710/

SparkQA · 2020-10-21T19:15:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34712/

SparkQA · 2020-11-23T08:03:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36142/

SparkQA · 2020-11-23T08:05:02Z

Test build #131539 has finished for PR 29122 at commit 1b5a7aa.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-23T08:29:07Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36142/

zero323 · 2020-11-23T13:22:25Z

python/pyspark/sql/functions.py



-def to_json(col, options={}):
+def to_json(col, options=None):


Seems like we still have to modify a few annotations, right?

Probably something like functions.pyi.patch.txt

Good catch, I've added them

Not sure why implicit optional is still allowed 🤔

SparkQA · 2020-11-24T04:30:56Z

Test build #131606 has finished for PR 29122 at commit 1b5a7aa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…2320

SparkQA · 2020-11-24T14:27:45Z

Test build #131658 has finished for PR 29122 at commit 0e372ca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…2320

shaneknapp · 2020-11-24T22:14:45Z

test this please

### What changes were proposed in this pull request? This pull request: - Adds following flags to the main mypy configuration: - [`strict_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-strict_optional) - [`no_implicit_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-no_implicit_optional) - [`disallow_untyped_defs`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-disallow_untyped_calls) These flags are enabled only for public API and disabled for tests and internal modules. Additionally, these PR fixes missing annotations. ### Why are the changes needed? Primary reason to propose this changes is to use standard configuration as used by typeshed project. This will allow us to be more strict, especially when interacting with JVM code. See for example #29122 (review) Additionally, it will allow us to detect cases where annotations have unintentionally omitted. ### Does this PR introduce _any_ user-facing change? Annotations only. ### How was this patch tested? `dev/lint-python`. Closes #30382 from zero323/SPARK-33457. Authored-by: zero323 <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

SparkQA · 2020-11-25T00:55:00Z

Test build #131701 has finished for PR 29122 at commit b628156.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…2320

SparkQA · 2020-11-25T11:57:32Z

Test build #131765 has finished for PR 29122 at commit c75cd57.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ShowColumns(

zhengruifeng

@Fokko please update the description of this PR.

…2320

Fokko · 2020-12-04T10:28:14Z

@zhengruifeng done! :)

SparkQA · 2020-12-04T12:01:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36825/

SparkQA · 2020-12-04T12:23:59Z

Test build #132225 has finished for PR 29122 at commit 11f3790.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-04T12:30:13Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36825/

HyukjinKwon · 2020-12-06T03:29:52Z

Just for the record and FWIW I don't mind merging this PR. I will leave it to other committers here.

zhengruifeng · 2020-12-07T01:49:25Z

retest this please

SparkQA · 2020-12-07T03:54:03Z

Test build #132313 has finished for PR 29122 at commit 11f3790.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-12-07T06:00:02Z

retest this please

SparkQA · 2020-12-07T08:49:08Z

Test build #132338 has finished for PR 29122 at commit 11f3790.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-12-07T12:26:08Z

@zhengruifeng the tests are broken globally for an unknown reason. It's being investigated at #30645 and it's hardly related to the current change. I think it's fine to merge if the test results bother you.

zhengruifeng · 2020-12-08T01:33:04Z

@HyukjinKwon Thanks for your explanation!

zhengruifeng · 2020-12-08T01:37:06Z

Merged to master, thanks all!

probot-autolabeler bot added AVRO ML PYTHON SQL labels Jul 15, 2020

HyukjinKwon reviewed Jul 15, 2020

View reviewed changes

python/pyspark/sql/functions.py Outdated Show resolved Hide resolved

zero323 reviewed Jul 16, 2020

View reviewed changes

python/pyspark/ml/tuning.py Outdated Show resolved Hide resolved

ueshin reviewed Jul 16, 2020

View reviewed changes

python/pyspark/sql/functions.py Outdated Show resolved Hide resolved

huaxingao mentioned this pull request Jul 20, 2020

[SPARK-32310][ML][PySpark][3.0] ML params default value parity #29159

Closed

Fokko force-pushed the SPARK-32320 branch from 9c31d2f to 72e8190 Compare October 20, 2020 10:23

[SPARK-32320][PYSPARK] Remove mutable default arguments

6b0f39f

This is bad practice, and might lead to unexpected behaviour: https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/ Add bugbear to check it in the CI

Fokko force-pushed the SPARK-32320 branch from 72e8190 to 6b0f39f Compare October 20, 2020 12:22