[PYSPARK] Simpler countByValue using collections' Counter #25449

kamalbanga · 2019-08-14T07:26:32Z

No description provided.

HyukjinKwon · 2019-08-14T09:32:30Z

Can you show the benchmark? Here's a critical path and performance is prioritized. Also please file a JIRA - see https://spark.apache.org/contributing.html

srowen · 2019-08-15T14:06:00Z

This also needs a JIRA. #25429

SparkQA · 2019-08-15T14:39:53Z

Test build #4830 has finished for PR 25449 at commit 08ea5a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kamalbanga · 2019-08-16T11:24:11Z

I benchmarked it and the existing implementation is faster 🤦‍♂

from pyspark import SparkContext, SparkConf
from random import choices
from collections import defaultdict, Counter
from contextlib import contextmanager
from operator import add
import time

MAX_NUM = int(1e9)
RDD_SIZE = int(1e7)
NUM_PARTITIONS = 4

@contextmanager
def timethis(snippet):
    start = time.time()
    yield
    print(f'Duration of {snippet}: {(time.time() - start):.1f} seconds')

def countval1(iterator):
    yield Counter(iterator)

def countval2(iterator):
    counts = defaultdict(int)
    for k in iterator:
        counts[k] += 1
    yield counts

def mergeMaps(m1, m2):
    for k, v in m1.items():
        m2[k] += v
    return m2


random_integers = choices(population=range(MAX_NUM), k=RDD_SIZE)

sc = SparkContext(conf=SparkConf().setAppName('Benchmark'))
random_rdd = sc.parallelize(random_integers, NUM_PARTITIONS)

with timethis('Spark Counter'):
    agg1 = random_rdd.mapPartitions(countval1).reduce(add)

with timethis('Spark defaultdict'):
    agg2 = random_rdd.mapPartitions(countval2).reduce(mergeMaps)

srowen

OK thanks for checking. Sounds like we shouldn't make this change.
If you're into optimizing this... I wonder if it helps to check whether m1 or m2 is larger in mergeMaps, and iterate over the smaller one only? If that's better you can reopen with that change.

Simpler countByValue using collections' Counter

08ea5a0

dongjoon-hyun added the PYSPARK label Aug 14, 2019

dongjoon-hyun changed the title ~~Simpler countByValue using collections' Counter~~ [PYSPARK] Simpler countByValue using collections' Counter Aug 14, 2019

srowen reviewed Aug 16, 2019

View reviewed changes

srowen closed this Aug 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PYSPARK] Simpler countByValue using collections' Counter #25449

[PYSPARK] Simpler countByValue using collections' Counter #25449

Uh oh!

kamalbanga commented Aug 14, 2019 •

edited

Loading

Uh oh!

HyukjinKwon commented Aug 14, 2019

Uh oh!

srowen commented Aug 15, 2019

Uh oh!

SparkQA commented Aug 15, 2019

Uh oh!

kamalbanga commented Aug 16, 2019 •

edited

Loading

Uh oh!

srowen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[PYSPARK] Simpler countByValue using collections' Counter #25449

[PYSPARK] Simpler countByValue using collections' Counter #25449

Uh oh!

Conversation

kamalbanga commented Aug 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 14, 2019

Uh oh!

srowen commented Aug 15, 2019

Uh oh!

SparkQA commented Aug 15, 2019

Uh oh!

kamalbanga commented Aug 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kamalbanga commented Aug 14, 2019 •

edited

Loading

kamalbanga commented Aug 16, 2019 •

edited

Loading