StatCounter on NumPy arrays [PYSPARK][SPARK-2012] #1725

freeman-lab · 2014-08-01T23:08:59Z

These changes allow StatCounters to work properly on NumPy arrays, to fix the issue reported here (https://issues.apache.org/jira/browse/SPARK-2012).

If NumPy is installed, the NumPy functions maximum, minimum, and sqrt, which work on arrays, are used to merge statistics. If not, we fall back on scalar operators, so it will work on arrays with NumPy, but will also work without NumPy.

New unit tests added, along with a check for NumPy in the tests.

- If NumPy is installed, use maximum/minimum/sqry so that StatCounters work on NumPy arrays - Otherwise, fall back on scalar operators

SparkQA · 2014-08-01T23:14:41Z

QA tests have started for PR 1725. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17714/consoleFull

davies · 2014-08-01T23:50:39Z

python/pyspark/statcounter.py

It's better to have ImportError here.

How about do in this way:

try:
from numpy import maximum, minimum, sqrt
except ImportError:
maximum = max
minimum = min
sqrt = math.sqrt

This will simplify later codes.

Nice! This is much better, updating the PR now...

davies · 2014-08-01T23:57:46Z

Thanks for contributing this patch, it will be cool to merge it in 1.1 release.

PS: code freeze will be happen tonight:)

SparkQA · 2014-08-02T00:16:04Z

QA results for PR 1725:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17714/consoleFull

- Fall back using ImportError - Assign functions to avoid conditionals

SparkQA · 2014-08-02T01:59:15Z

QA tests have started for PR 1725. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17733/consoleFull

SparkQA · 2014-08-02T03:02:16Z

QA results for PR 1725:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17733/consoleFull

davies · 2014-08-02T03:41:32Z

python/pyspark/tests.py

just try to import numpy, this array will overwrite array.array, make other unit tests fail.

@davies thanks, good catch, should be fixed now!

davies · 2014-08-02T05:07:20Z

lgtm

@JoshRosen Could you help to take a look at this?

JoshRosen · 2014-08-02T05:33:58Z

This looks good. At first, I was concerned that element-wise operations might change behavior for calling stats() on an RDD of Python lists of numbers (sc.parallelize([[1, 0, 1], [4, -1, 4]]).stats()), but that currently crashes in Spark 1.0, so this patch won't change users' results.

>>> from numpy import maximum
>>> maximum([1, 0, 1], [4, -1, 4])
array([4, 0, 4])
>>> max([1, 0, 1], [4, -1, 4])
[4, -1, 4]

I ran the PySpark tests locally and they passed, so I've merged this. Thanks Jeremy!

freeman-lab · 2014-08-02T05:39:36Z

@JoshRosen @davies great, thanks guys!

These changes allow StatCounters to work properly on NumPy arrays, to fix the issue reported here (https://issues.apache.org/jira/browse/SPARK-2012). If NumPy is installed, the NumPy functions ``maximum``, ``minimum``, and ``sqrt``, which work on arrays, are used to merge statistics. If not, we fall back on scalar operators, so it will work on arrays with NumPy, but will also work without NumPy. New unit tests added, along with a check for NumPy in the tests. Author: Jeremy Freeman <the.freeman.lab@gmail.com> Closes apache#1725 from freeman-lab/numpy-max-statcounter and squashes the following commits: fe973b1 [Jeremy Freeman] Avoid duplicate array import in tests 7f0e397 [Jeremy Freeman] Refactored check for numpy 8e764dd [Jeremy Freeman] Explicit numpy imports 875414c [Jeremy Freeman] Fixed indents 1c8a832 [Jeremy Freeman] Unit tests for StatCounter with NumPy arrays 176a127 [Jeremy Freeman] Use numpy arrays in StatCounter

…1727) This PR adds BosonSort handling in JoinSuite so that the tests will pass with Boson. This is the same change as apache#1725 and apache#1726

This PR adds BosonSort handling in JoinSuite so that the tests will pass with Boson.

freeman-lab added 4 commits August 1, 2014 18:47

Use numpy arrays in StatCounter

176a127

- If NumPy is installed, use maximum/minimum/sqry so that StatCounters work on NumPy arrays - Otherwise, fall back on scalar operators

Unit tests for StatCounter with NumPy arrays

1c8a832

Fixed indents

875414c

Explicit numpy imports

8e764dd

davies reviewed Aug 1, 2014
View reviewed changes

Refactored check for numpy

7f0e397

- Fall back using ImportError - Assign functions to avoid conditionals

davies reviewed Aug 2, 2014
View reviewed changes

Avoid duplicate array import in tests

fe973b1

asfgit closed this in 4bc3bb2 Aug 2, 2014

freeman-lab mentioned this pull request Aug 23, 2014

data methods not working? thunder-project/thunder#16

Closed

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023

rdar://107596423 Add BosonSort handling in JoinSuite (apache#1725)

bcea387

This PR adds BosonSort handling in JoinSuite so that the tests will pass with Boson.

StatCounter on NumPy arrays [PYSPARK][SPARK-2012] #1725

StatCounter on NumPy arrays [PYSPARK][SPARK-2012] #1725

Uh oh!

Conversation

freeman-lab commented Aug 1, 2014

Uh oh!

SparkQA commented Aug 1, 2014

Uh oh!

davies Aug 1, 2014

Choose a reason for hiding this comment

Uh oh!

davies Aug 1, 2014

Choose a reason for hiding this comment

Uh oh!

freeman-lab Aug 2, 2014

Choose a reason for hiding this comment

Uh oh!

davies commented Aug 1, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

davies Aug 2, 2014

Choose a reason for hiding this comment

Uh oh!

freeman-lab Aug 2, 2014

Choose a reason for hiding this comment

Uh oh!

davies commented Aug 2, 2014

Uh oh!

JoshRosen commented Aug 2, 2014

Uh oh!

freeman-lab commented Aug 2, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants