-
Notifications
You must be signed in to change notification settings - Fork 29k
StatCounter on NumPy arrays [PYSPARK][SPARK-2012] #1725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- If NumPy is installed, use maximum/minimum/sqry so that StatCounters work on NumPy arrays - Otherwise, fall back on scalar operators
|
QA tests have started for PR 1725. This patch merges cleanly. |
python/pyspark/statcounter.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to have ImportError here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about do in this way:
try:
from numpy import maximum, minimum, sqrt
except ImportError:
maximum = max
minimum = min
sqrt = math.sqrt
This will simplify later codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is much better, updating the PR now...
|
Thanks for contributing this patch, it will be cool to merge it in 1.1 release. PS: code freeze will be happen tonight:) |
|
QA results for PR 1725: |
- Fall back using ImportError - Assign functions to avoid conditionals
|
QA tests have started for PR 1725. This patch merges cleanly. |
|
QA results for PR 1725: |
python/pyspark/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just try to import numpy, this array will overwrite array.array, make other unit tests fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies thanks, good catch, should be fixed now!
|
lgtm @JoshRosen Could you help to take a look at this? |
|
This looks good. At first, I was concerned that element-wise operations might change behavior for calling >>> from numpy import maximum
>>> maximum([1, 0, 1], [4, -1, 4])
array([4, 0, 4])
>>> max([1, 0, 1], [4, -1, 4])
[4, -1, 4]I ran the PySpark tests locally and they passed, so I've merged this. Thanks Jeremy! |
|
@JoshRosen @davies great, thanks guys! |
These changes allow StatCounters to work properly on NumPy arrays, to fix the issue reported here (https://issues.apache.org/jira/browse/SPARK-2012). If NumPy is installed, the NumPy functions ``maximum``, ``minimum``, and ``sqrt``, which work on arrays, are used to merge statistics. If not, we fall back on scalar operators, so it will work on arrays with NumPy, but will also work without NumPy. New unit tests added, along with a check for NumPy in the tests. Author: Jeremy Freeman <the.freeman.lab@gmail.com> Closes apache#1725 from freeman-lab/numpy-max-statcounter and squashes the following commits: fe973b1 [Jeremy Freeman] Avoid duplicate array import in tests 7f0e397 [Jeremy Freeman] Refactored check for numpy 8e764dd [Jeremy Freeman] Explicit numpy imports 875414c [Jeremy Freeman] Fixed indents 1c8a832 [Jeremy Freeman] Unit tests for StatCounter with NumPy arrays 176a127 [Jeremy Freeman] Use numpy arrays in StatCounter
…1727) This PR adds BosonSort handling in JoinSuite so that the tests will pass with Boson. This is the same change as apache#1725 and apache#1726
This PR adds BosonSort handling in JoinSuite so that the tests will pass with Boson.
These changes allow StatCounters to work properly on NumPy arrays, to fix the issue reported here (https://issues.apache.org/jira/browse/SPARK-2012).
If NumPy is installed, the NumPy functions
maximum,minimum, andsqrt, which work on arrays, are used to merge statistics. If not, we fall back on scalar operators, so it will work on arrays with NumPy, but will also work without NumPy.New unit tests added, along with a check for NumPy in the tests.