[SPARK-2873] [SQL] using ExternalAppendOnlyMap to resolve OOM when aggregating #1822

guowei2 · 2014-08-07T02:10:16Z

Using ExternalAppendOnlyMap to resolve OOM when aggregating.
Using "spark.shuffle.spill" to open it or not
Hive udaf does not support yet for udaf need Serializable

Join has the same problem. but using ExternalAppendOnlyMap as CoGroupedRDD seems to reduce performance. i try another way by using ExternalAppendOnlyMap. but it needs testing .i will commit it in another batch.

…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items

…nto sql-memory-patch

AmplabJenkins · 2014-08-07T02:12:51Z

Can one of the admins verify this patch?

… partition fix bug of countApproxDistinct() when have more than one partition Author: Davies Liu <[email protected]> Closes #1812 from davies/approx and squashes the following commits: bf757ce [Davies Liu] fix bug of countApproxDistinct() when have more than one partition

Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs). Added factory classes for Algo and Impurity, but made private[mllib]. CC: mengxr dorx Please let me know if there are other changes which would help with API consistency---thanks! Author: Joseph K. Bradley <[email protected]> Closes #1798 from jkbradley/dt-python-consistency and squashes the following commits: 6f7edf8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency a0d7dbe [Joseph K. Bradley] DecisionTree: In Java-friendly train* methods, changed to use JavaRDD instead of RDD. ee1d236 [Joseph K. Bradley] DecisionTree API updates: * Removed train() function in Python API (tree.py) ** Removed corresponding function in Scala/Java API (the ones taking basic types) 00f820e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency fe6dbfa [Joseph K. Bradley] removed unnecessary imports e358661 [Joseph K. Bradley] DecisionTree API change: * Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs). c699850 [Joseph K. Bradley] a few doc comments eaf84c0 [Joseph K. Bradley] Added DecisionTree static train() methods API to match Python, but without default parameters

pwendell · 2014-08-07T06:29:02Z

@guowei2 can you add [SQL] to the title here so it gets sorted correctly? Thanks!

… repos .. and use canonical repo1.maven.org Maven Central repo. (And make sure snapshots are disabled for plugins from Maven Central.) Author: Sean Owen <[email protected]> Closes #1828 from srowen/SPARK-2879.2 and squashes the following commits: 639f495 [Sean Owen] .. and use canonical repo1.maven.org Maven Central repo. (And make sure snapshots are disabled for plugins from Maven Central.)

Added some checks to Strategy to print out meaningful error messages when given invalid DecisionTree parameters. CC mengxr Author: Joseph K. Bradley <[email protected]> Closes #1821 from jkbradley/dt-robustness and squashes the following commits: 4dc449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-robustness 7a61f7b [Joseph K. Bradley] Added some checks to Strategy to print out meaningful error messages when given invalid DecisionTree parameters

This is part of SPARK-2828: 1. separate IDF model from IDF algorithm (which generates a model) 2. separate StandardScaler model from StandardScaler CC: dbtsai Author: Xiangrui Meng <[email protected]> Closes #1814 from mengxr/feature-api-update and squashes the following commits: 40d863b [Xiangrui Meng] move mean and variance to model 48a0fff [Xiangrui Meng] separate Model from StandardScaler algorithm 89f3486 [Xiangrui Meng] update IDF to separate Model from Algorithm

Author: Oleg Danilov <[email protected]> Closes #1835 from dosoft/SPARK-2905 and squashes the following commits: 4df423c [Oleg Danilov] SPARK-2905 Fixed path sbin => bin

The reason for this bug was introduciton of OldDeps project. It had to be excluded to prevent unidocs from trying to put it on "docs compile" classpath. Author: Prashant Sharma <[email protected]> Closes #1830 from ScrapCodes/doc-fix and squashes the following commits: e5d52e6 [Prashant Sharma] SPARK-2899 Doc generation is back to working in new SBT Build.

… no sorting/aggregation and # partitions is small As described in https://issues.apache.org/jira/browse/SPARK-2787, right now sort-based shuffle is more expensive than hash-based for map operations that do no partial aggregation or sorting, such as groupByKey. This is because it has to serialize each data item twice (once when spilling to intermediate files, and then again when merging these files object-by-object). This patch adds a code path to just write separate files directly if the # of output partitions is small, and concatenate them at the end to produce a sorted file. On the unit test side, I added some tests that force or don't force this bypass path to be used, and checked that our tests for other features (e.g. all the operations) cover both cases. Author: Matei Zaharia <[email protected]> Closes #1799 from mateiz/SPARK-2787 and squashes the following commits: 88cf26a [Matei Zaharia] Fix rebase 10233af [Matei Zaharia] Review comments 398cb95 [Matei Zaharia] Fix looking up shuffle manager in conf ca3efd9 [Matei Zaharia] Add docs for shuffle manager properties, and allow short names for them d0ae3c5 [Matei Zaharia] Fix some comments 90d084f [Matei Zaharia] Add code path to bypass merge-sort in ExternalSorter, and tests 31e5d7c [Matei Zaharia] Move existing logic for writing partitioned files into ExternalSorter

Author: Sandy Ryza <[email protected]> Closes #1507 from sryza/sandy-spark-2565 and squashes the following commits: 74dad41 [Sandy Ryza] SPARK-2565. Update ShuffleReadMetrics as blocks are fetched

Author: Kousuke Saruta <[email protected]> Closes #1834 from sarutak/SPARK-2904 and squashes the following commits: 38e7d45 [Kousuke Saruta] Removed non-used variable in SparkSubmitArguments

Author: Erik Erlandson <[email protected]> Closes #1841 from erikerlandson/spark-2911-pr and squashes the following commits: 4699e2f [Erik Erlandson] [SPARK-2911]: provide rdd.parent[T](j) to obtain jth parent RDD

JIRA: https://issues.apache.org/jira/browse/SPARK-2888 Author: Yin Huai <[email protected]> Closes #1817 from yhuai/fixAddColumnMetadataToConf and squashes the following commits: fba728c [Yin Huai] Fix addColumnMetadataToConf.

marmbrus · 2014-08-08T18:04:12Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/AggregatesSuite.scala

Space after :. Can you add some scala doc about what this is checking?

…nto sql-memory-patch Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala

mengxr Remove transform(dataset: RDD[String]) from public API. Author: Liquan Pei <[email protected]> Closes #2010 from Ishiihara/Word2Vec-api and squashes the following commits: 17b1031 [Liquan Pei] remove transform(dataset: RDD[String]) from public API

…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items

…nto sql-memory-patch

guowei2 · 2014-08-18T13:10:57Z

i'm very sorry. i just rebase my brach to spark/master. what should i do to fix this .

…nto sql-memory-patch

guowei2 · 2014-08-18T13:25:05Z

may i close this PR and create a new PR?

witgo · 2014-08-18T13:30:11Z

Try this: git commit -m "Big-ass commit" --allow-empty git rebase -i master, git push origin sql-memory-patch -f

guowei2 · 2014-08-19T01:56:40Z

i have to close this PR and close a new one .

guowei added 11 commits July 3, 2014 12:38

SITUATION: ShuffledDStream run tasks whether dstream has partition it…

749b632

…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items

DStream run tasks only when dstream has partition items

e1f9978

DStream run tasks only when dstream has partition items

b03ad14

DStream run tasks only when dstream has partition items

290b1a1

[SPARK-2873] use ExternalAppendOnlyMap to resolve aggregate's OOM

87627e7

[SPARK-2873] use ExternalAppendOnlyMap to resolve aggregate's OOM

f889700

[SPARK-2873] use ExternalAppendOnlyMap to resolve aggregate's OOM

21b5735

[SPARK-2873] use ExternalAppendOnlyMap to resolve aggregate's OOM

d2be832

[SPARK-2873] use ExternalAppendOnlyMap to resolve aggregate's OOM

e3a88b1

Merge branch 'sql-memory-patch' of https://github.com/guowei2/spark i…

2a4786a

…nto sql-memory-patch

[SPARK-2873] use ExternalAppendOnlyMap to resolve aggregate's OOM

475da9d

davies and others added 2 commits August 6, 2014 21:22

srowen and others added 2 commits August 7, 2014 00:04

guowei2 changed the title ~~[SPARK-2873] using ExternalAppendOnlyMap to resolve OOM when aggregating~~ [SPARK-2873] [SQL] using ExternalAppendOnlyMap to resolve OOM when aggregating Aug 7, 2014

mengxr and others added 8 commits August 7, 2014 11:28

SPARK-2905 Fixed path sbin => bin

80ec5ba

Author: Oleg Danilov <[email protected]> Closes #1835 from dosoft/SPARK-2905 and squashes the following commits: 4df423c [Oleg Danilov] SPARK-2905 Fixed path sbin => bin

SPARK-2565. Update ShuffleReadMetrics as blocks are fetched

4c51098

Author: Sandy Ryza <[email protected]> Closes #1507 from sryza/sandy-spark-2565 and squashes the following commits: 74dad41 [Sandy Ryza] SPARK-2565. Update ShuffleReadMetrics as blocks are fetched

[SPARK-2904] Remove non-used local variable in SparkSubmitArguments

9de6a42

Author: Kousuke Saruta <[email protected]> Closes #1834 from sarutak/SPARK-2904 and squashes the following commits: 38e7d45 [Kousuke Saruta] Removed non-used variable in SparkSubmitArguments

[SPARK-2911]: provide rdd.parent[T](j) to obtain jth parent RDD

9a54de1

Author: Erik Erlandson <[email protected]> Closes #1841 from erikerlandson/spark-2911-pr and squashes the following commits: 4699e2f [Erik Erlandson] [SPARK-2911]: provide rdd.parent[T](j) to obtain jth parent RDD

marmbrus reviewed Aug 8, 2014
View reviewed changes

guowei2 and others added 21 commits August 18, 2014 14:50

Merge branch 'sql-memory-patch' of https://github.com/guowei2/spark i…

b821577

…nto sql-memory-patch Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala

numbers of improves

4df2b6c

SITUATION: ShuffledDStream run tasks whether dstream has partition it…

f611ea9

…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items

DStream run tasks only when dstream has partition items

6463c19

DStream run tasks only when dstream has partition items

9ef744f

DStream run tasks only when dstream has partition items

800e230

Merge branch 'master' of https://github.com/guowei2/spark

46e59c9

SITUATION: ShuffledDStream run tasks whether dstream has partition it…

bb6c6da

…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items

DStream run tasks only when dstream has partition items

60fff2a

Merge branch 'master' of https://github.com/guowei2/spark

492f682

SITUATION: ShuffledDStream run tasks whether dstream has partition it…

49cd405

…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items

DStream run tasks only when dstream has partition items

3a97745

DStream run tasks only when dstream has partition items

013ff03

DStream run tasks only when dstream has partition items

3e9d50a

SITUATION: ShuffledDStream run tasks whether dstream has partition it…

7a77562

…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items

DStream run tasks only when dstream has partition items

efb8545

SITUATION: ShuffledDStream run tasks whether dstream has partition it…

3e36c9b

…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items

DStream run tasks only when dstream has partition items

7309d29

Merge branch 'master' of https://github.com/guowei2/spark

13e435c

Merge branch 'sql-memory-patch' of https://github.com/guowei2/spark i…

f2e9ce6

…nto sql-memory-patch

Merge branch 'sql-memory-patch' of https://github.com/guowei2/spark i…

5ae43f8

…nto sql-memory-patch

fix Big-ass commit

03dea6b

guowei2 closed this Aug 19, 2014

marmbrus mentioned this pull request Sep 3, 2014

SPARK-1627: Support external aggregation by using Aggregator in Spark SQL #867

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-2873] [SQL] using ExternalAppendOnlyMap to resolve OOM when aggregating #1822

[SPARK-2873] [SQL] using ExternalAppendOnlyMap to resolve OOM when aggregating #1822

Uh oh!

guowei2 commented Aug 7, 2014

Uh oh!

AmplabJenkins commented Aug 7, 2014

Uh oh!

pwendell commented Aug 7, 2014

Uh oh!

marmbrus Aug 8, 2014

Uh oh!

guowei2 commented Aug 18, 2014

Uh oh!

guowei2 commented Aug 18, 2014

Uh oh!

witgo commented Aug 18, 2014

Uh oh!

guowei2 commented Aug 19, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[SPARK-2873] [SQL] using ExternalAppendOnlyMap to resolve OOM when aggregating #1822

[SPARK-2873] [SQL] using ExternalAppendOnlyMap to resolve OOM when aggregating #1822

Uh oh!

Conversation

guowei2 commented Aug 7, 2014

Uh oh!

AmplabJenkins commented Aug 7, 2014

Uh oh!

pwendell commented Aug 7, 2014

Uh oh!

marmbrus Aug 8, 2014

Choose a reason for hiding this comment

Uh oh!

guowei2 commented Aug 18, 2014

Uh oh!

guowei2 commented Aug 18, 2014

Uh oh!

witgo commented Aug 18, 2014

Uh oh!

guowei2 commented Aug 19, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants