-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14932][SQL] Allow DataFrame.replace() to replace values with None #16225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
2653750
2eac8b9
7949292
0b15c8f
2c532c3
5ab39cc
43fb6bd
b5424d9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1336,8 +1336,8 @@ def replace(self, to_replace, value=None, subset=None): | |
| Value to be replaced. | ||
| If the value is a dict, then `value` is ignored and `to_replace` must be a | ||
| mapping between a value and a replacement. | ||
| :param value: int, long, float, string, or list. | ||
| The replacement value must be an int, long, float, or string. If `value` is a | ||
| :param value: int, long, float, string, list or None. | ||
| The replacement value must be an int, long, float, string or None. If `value` is a | ||
| list, `value` should be of the same length and type as `to_replace`. | ||
| If `value` is a scalar and `to_replace` is a sequence, then `value` is | ||
| used as a replacement for each item in `to_replace`. | ||
|
|
@@ -1356,6 +1356,16 @@ def replace(self, to_replace, value=None, subset=None): | |
| |null| null| null| | ||
| +----+------+-----+ | ||
|
|
||
| >>> df4.na.replace('Alice', None).show() | ||
| +----+------+----+ | ||
| | age|height|name| | ||
| +----+------+----+ | ||
| | 10| 80|null| | ||
| | 5| null| Bob| | ||
| |null| null| Tom| | ||
| |null| null|null| | ||
| +----+------+----+ | ||
|
|
||
| >>> df4.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show() | ||
| +----+------+----+ | ||
| | age|height|name| | ||
|
|
@@ -1391,9 +1401,10 @@ def all_of_(xs): | |
| "to_replace should be a float, int, long, string, list, tuple, or dict. " | ||
| "Got {0}".format(type(to_replace))) | ||
|
|
||
| if not isinstance(value, valid_types) and not isinstance(to_replace, dict): | ||
| if not isinstance(value, valid_types) and value is not None \ | ||
| and not isinstance(to_replace, dict): | ||
| raise ValueError("If to_replace is not a dict, value should be " | ||
| "a float, int, long, string, list, or tuple. " | ||
| "a float, int, long, string, list, tuple or None. " | ||
| "Got {0}".format(type(value))) | ||
|
|
||
| if isinstance(to_replace, (list, tuple)) and isinstance(value, (list, tuple)): | ||
|
|
@@ -1409,7 +1420,7 @@ def all_of_(xs): | |
| if isinstance(to_replace, (float, int, long, basestring)): | ||
| to_replace = [to_replace] | ||
|
|
||
| if isinstance(value, (float, int, long, basestring)): | ||
| if isinstance(value, (float, int, long, basestring)) or value is None: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same as above |
||
| value = [value for _ in range(len(to_replace))] | ||
|
|
||
| if isinstance(to_replace, dict): | ||
|
|
@@ -1423,7 +1434,9 @@ def all_of_(xs): | |
| subset = [subset] | ||
|
|
||
| # Verify we were not passed in mixed type generics." | ||
| if not any(all_of_type(rep_dict.keys()) and all_of_type(rep_dict.values()) | ||
| if not any(all_of_type(rep_dict.keys()) | ||
| and (all_of_type(rep_dict.values()) | ||
| or list(rep_dict.values()).count(None) == len(rep_dict)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the Scala code null is allowed in to be the replacement value for some of the elements but not in the Python code. Is this intentional? If so we should document it clearly and expand on the error message bellow (otherwise we should make it more flexible). |
||
| for all_of_type in [all_of_bool, all_of_str, all_of_numeric]): | ||
| raise ValueError("Mixed type replacements are not supported") | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So slightly jet lagged style question: would this be clearer if we just add type(None) to L1398? I know PEP8 says we should only use
isis notfor checking if something is none rather than depending on the implicit conversion to boolean -- but since really checking the type here we aren't really in danger of that. (This is just a suggestion to make it easier to read - if others think its easier to read this way thats fine :)). @davies ?