Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Jul 15, 2020

This is bad practice, and might lead to unexpected behaviour:
https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/

fokkodriesprong@Fan spark % grep -R "={}" python | grep def

python/pyspark/resource/profile.py:    def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}):
python/pyspark/sql/functions.py:def from_json(col, schema, options={}):
python/pyspark/sql/functions.py:def to_json(col, options={}):
python/pyspark/sql/functions.py:def schema_of_json(json, options={}):
python/pyspark/sql/functions.py:def schema_of_csv(csv, options={}):
python/pyspark/sql/functions.py:def to_csv(col, options={}):
python/pyspark/sql/functions.py:def from_csv(col, schema, options={}):
python/pyspark/sql/avro/functions.py:def from_avro(data, jsonFormatSchema, options={}):
fokkodriesprong@Fan spark % grep -R "=\[\]" python | grep def
python/pyspark/ml/tuning.py:    def __init__(self, bestModel, avgMetrics=[], subModels=None):
python/pyspark/ml/tuning.py:    def __init__(self, bestModel, validationMetrics=[], subModels=None):

What changes were proposed in this pull request?

Removing the mutable default arguments.

Why are the changes needed?

Removing the mutable default arguments, and changing the signature to Optional[...].

Does this PR introduce any user-facing change?

No 👍

How was this patch tested?

Using the Flake8 bugbear code analysis plugin.

@Fokko
Copy link
Contributor Author

Fokko commented Oct 20, 2020

Fixed the voilations:

./python/pyspark/ml/regression.py:1743:40: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
                 quantileProbabilities=list([0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]),
                                       ^
./python/pyspark/ml/regression.py:1761:41: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
                  quantileProbabilities=list([0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]),
                                        ^
./python/pyspark/ml/tuning.py:511:46: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
    def __init__(self, bestModel, avgMetrics=[], subModels=None):
                                             ^
./python/pyspark/ml/tuning.py:871:53: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
    def __init__(self, bestModel, validationMetrics=[], subModels=None):
                                                    ^
./python/pyspark/resource/profile.py:35:63: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
    def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}):
                                                              ^
./python/pyspark/resource/profile.py:35:77: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
    def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}):
                                                                            ^
./python/pyspark/sql/functions.py:2479:36: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def from_json(col, schema, options={}):
                                   ^
./python/pyspark/sql/functions.py:2527:26: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def to_json(col, options={}):
                         ^
./python/pyspark/sql/functions.py:2567:34: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def schema_of_json(json, options={}):
                                 ^
./python/pyspark/sql/functions.py:2597:32: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def schema_of_csv(csv, options={}):
                               ^
./python/pyspark/sql/functions.py:2623:25: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def to_csv(col, options={}):
                        ^
./python/pyspark/sql/functions.py:2934:35: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def from_csv(col, schema, options={}):
                                  ^
./python/pyspark/sql/avro/functions.py:29:47: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
def from_avro(data, jsonFormatSchema, options={}):
                                              ^
./dev/sparktestsupport/modules.py:34:97: B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
    def __init__(self, name, dependencies, source_file_regexes, build_profile_flags=(), environ={},
                                                                                                ^
14    B006 Do not use mutable data structures for argument defaults.  They are created during function definition time. All calls to the function reuse this one instance of that data structure, persisting changes between them.
14
1

Furthermore:

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

cc @huaxingao and @zhengruifeng FYI

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34670/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34670/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Test build #130061 has finished for PR 29122 at commit e699c49.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34710/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34710/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34712/

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36142/

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Test build #131539 has finished for PR 29122 at commit 1b5a7aa.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36142/



def to_json(col, options={}):
def to_json(col, options=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we still have to modify a few annotations, right?

Probably something like functions.pyi.patch.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've added them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why implicit optional is still allowed 🤔

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131606 has finished for PR 29122 at commit 1b5a7aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131658 has finished for PR 29122 at commit 0e372ca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shaneknapp
Copy link
Contributor

test this please

HyukjinKwon pushed a commit that referenced this pull request Nov 25, 2020
### What changes were proposed in this pull request?

This pull request:

- Adds following flags to the main mypy configuration:
  - [`strict_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-strict_optional)
  - [`no_implicit_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-no_implicit_optional)
  - [`disallow_untyped_defs`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-disallow_untyped_calls)

These flags are enabled only for public API and disabled for tests and internal modules.

Additionally, these PR fixes missing annotations.

### Why are the changes needed?

Primary reason to propose this changes is to use standard configuration as used by typeshed project. This will allow us to be more strict, especially when interacting with JVM code. See for example #29122 (review)

Additionally, it will allow us to detect cases where annotations have unintentionally omitted.

### Does this PR introduce _any_ user-facing change?

Annotations only.

### How was this patch tested?

`dev/lint-python`.

Closes #30382 from zero323/SPARK-33457.

Authored-by: zero323 <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131701 has finished for PR 29122 at commit b628156.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131765 has finished for PR 29122 at commit c75cd57.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ShowColumns(

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko please update the description of this PR.

@Fokko
Copy link
Contributor Author

Fokko commented Dec 4, 2020

@zhengruifeng done! :)

@SparkQA
Copy link

SparkQA commented Dec 4, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36825/

@SparkQA
Copy link

SparkQA commented Dec 4, 2020

Test build #132225 has finished for PR 29122 at commit 11f3790.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 4, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36825/

@HyukjinKwon
Copy link
Member

Just for the record and FWIW I don't mind merging this PR. I will leave it to other committers here.

@zhengruifeng
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Test build #132313 has finished for PR 29122 at commit 11f3790.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Test build #132338 has finished for PR 29122 at commit 11f3790.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@zhengruifeng the tests are broken globally for an unknown reason. It's being investigated at #30645 and it's hardly related to the current change. I think it's fine to merge if the test results bother you.

@zhengruifeng
Copy link
Contributor

@HyukjinKwon Thanks for your explanation!

@zhengruifeng
Copy link
Contributor

Merged to master, thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants