Skip to content

Conversation

@freeman-lab
Copy link
Contributor

These changes allow StatCounters to work properly on NumPy arrays, to fix the issue reported here (https://issues.apache.org/jira/browse/SPARK-2012).

If NumPy is installed, the NumPy functions maximum, minimum, and sqrt, which work on arrays, are used to merge statistics. If not, we fall back on scalar operators, so it will work on arrays with NumPy, but will also work without NumPy.

New unit tests added, along with a check for NumPy in the tests.

- If NumPy is installed, use maximum/minimum/sqry so that StatCounters
work on NumPy arrays
- Otherwise, fall back on scalar operators
@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA tests have started for PR 1725. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17714/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to have ImportError here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about do in this way:

try:
from numpy import maximum, minimum, sqrt
except ImportError:
maximum = max
minimum = min
sqrt = math.sqrt

This will simplify later codes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is much better, updating the PR now...

@davies
Copy link
Contributor

davies commented Aug 1, 2014

Thanks for contributing this patch, it will be cool to merge it in 1.1 release.

PS: code freeze will be happen tonight:)

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA results for PR 1725:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17714/consoleFull

- Fall back using ImportError
- Assign functions to avoid conditionals
@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA tests have started for PR 1725. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17733/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA results for PR 1725:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17733/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just try to import numpy, this array will overwrite array.array, make other unit tests fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies thanks, good catch, should be fixed now!

@davies
Copy link
Contributor

davies commented Aug 2, 2014

lgtm

@JoshRosen Could you help to take a look at this?

@JoshRosen
Copy link
Contributor

This looks good. At first, I was concerned that element-wise operations might change behavior for calling stats() on an RDD of Python lists of numbers (sc.parallelize([[1, 0, 1], [4, -1, 4]]).stats()), but that currently crashes in Spark 1.0, so this patch won't change users' results.

>>> from numpy import maximum
>>> maximum([1, 0, 1], [4, -1, 4])
array([4, 0, 4])
>>> max([1, 0, 1], [4, -1, 4])
[4, -1, 4]

I ran the PySpark tests locally and they passed, so I've merged this. Thanks Jeremy!

@freeman-lab
Copy link
Contributor Author

@JoshRosen @davies great, thanks guys!

@asfgit asfgit closed this in 4bc3bb2 Aug 2, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
These changes allow StatCounters to work properly on NumPy arrays, to fix the issue reported here  (https://issues.apache.org/jira/browse/SPARK-2012).

If NumPy is installed, the NumPy functions ``maximum``, ``minimum``, and ``sqrt``, which work on arrays, are used to merge statistics. If not, we fall back on scalar operators, so it will work on arrays with NumPy, but will also work without NumPy.

New unit tests added, along with a check for NumPy in the tests.

Author: Jeremy Freeman <the.freeman.lab@gmail.com>

Closes apache#1725 from freeman-lab/numpy-max-statcounter and squashes the following commits:

fe973b1 [Jeremy Freeman] Avoid duplicate array import in tests
7f0e397 [Jeremy Freeman] Refactored check for numpy
8e764dd [Jeremy Freeman] Explicit numpy imports
875414c [Jeremy Freeman] Fixed indents
1c8a832 [Jeremy Freeman] Unit tests for StatCounter with NumPy arrays
176a127 [Jeremy Freeman] Use numpy arrays in StatCounter
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…1727)

This PR adds BosonSort handling in JoinSuite so that the tests will pass with Boson.
This is the same change as apache#1725 and apache#1726
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
This PR adds BosonSort handling in JoinSuite so that the tests will pass with Boson.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants