Skip to content

Conversation

@guowei2
Copy link
Contributor

@guowei2 guowei2 commented Aug 7, 2014

Using ExternalAppendOnlyMap to resolve OOM when aggregating.
Using "spark.shuffle.spill" to open it or not
Hive udaf does not support yet for udaf need Serializable

Join has the same problem. but using ExternalAppendOnlyMap as CoGroupedRDD seems to reduce performance. i try another way by using ExternalAppendOnlyMap. but it needs testing .i will commit it in another batch.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

davies and others added 2 commits August 6, 2014 21:22
… partition

fix bug of countApproxDistinct() when have more than one partition

Author: Davies Liu <[email protected]>

Closes #1812 from davies/approx and squashes the following commits:

bf757ce [Davies Liu] fix bug of countApproxDistinct() when have more than one partition
Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).

Added factory classes for Algo and Impurity, but made private[mllib].

CC: mengxr dorx  Please let me know if there are other changes which would help with API consistency---thanks!

Author: Joseph K. Bradley <[email protected]>

Closes #1798 from jkbradley/dt-python-consistency and squashes the following commits:

6f7edf8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency
a0d7dbe [Joseph K. Bradley] DecisionTree: In Java-friendly train* methods, changed to use JavaRDD instead of RDD.
ee1d236 [Joseph K. Bradley] DecisionTree API updates: * Removed train() function in Python API (tree.py) ** Removed corresponding function in Scala/Java API (the ones taking basic types)
00f820e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency
fe6dbfa [Joseph K. Bradley] removed unnecessary imports
e358661 [Joseph K. Bradley] DecisionTree API change: * Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).
c699850 [Joseph K. Bradley] a few doc comments
eaf84c0 [Joseph K. Bradley] Added DecisionTree static train() methods API to match Python, but without default parameters
@pwendell
Copy link
Contributor

pwendell commented Aug 7, 2014

@guowei2 can you add [SQL] to the title here so it gets sorted correctly? Thanks!

srowen and others added 2 commits August 7, 2014 00:04
… repos

.. and use canonical repo1.maven.org Maven Central repo. (And make sure snapshots are disabled for plugins from Maven Central.)

Author: Sean Owen <[email protected]>

Closes #1828 from srowen/SPARK-2879.2 and squashes the following commits:

639f495 [Sean Owen] .. and use canonical repo1.maven.org Maven Central repo. (And make sure snapshots are disabled for plugins from Maven Central.)
Added some checks to Strategy to print out meaningful error messages when given invalid DecisionTree parameters.
CC mengxr

Author: Joseph K. Bradley <[email protected]>

Closes #1821 from jkbradley/dt-robustness and squashes the following commits:

4dc449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-robustness
7a61f7b [Joseph K. Bradley] Added some checks to Strategy to print out meaningful error messages when given invalid DecisionTree parameters
@guowei2 guowei2 changed the title [SPARK-2873] using ExternalAppendOnlyMap to resolve OOM when aggregating [SPARK-2873] [SQL] using ExternalAppendOnlyMap to resolve OOM when aggregating Aug 7, 2014
mengxr and others added 8 commits August 7, 2014 11:28
This is part of SPARK-2828:

1. separate IDF model from IDF algorithm (which generates a model)
2. separate StandardScaler model from StandardScaler

CC: dbtsai

Author: Xiangrui Meng <[email protected]>

Closes #1814 from mengxr/feature-api-update and squashes the following commits:

40d863b [Xiangrui Meng] move mean and variance to model
48a0fff [Xiangrui Meng] separate Model from StandardScaler algorithm
89f3486 [Xiangrui Meng] update IDF to separate Model from Algorithm
Author: Oleg Danilov <[email protected]>

Closes #1835 from dosoft/SPARK-2905 and squashes the following commits:

4df423c [Oleg Danilov] SPARK-2905 Fixed path sbin => bin
The reason for this bug was introduciton of OldDeps project. It had to be excluded to prevent unidocs from trying to put it on "docs compile" classpath.

Author: Prashant Sharma <[email protected]>

Closes #1830 from ScrapCodes/doc-fix and squashes the following commits:

e5d52e6 [Prashant Sharma] SPARK-2899 Doc generation is back to working in new SBT Build.
… no sorting/aggregation and # partitions is small

As described in https://issues.apache.org/jira/browse/SPARK-2787, right now sort-based shuffle is more expensive than hash-based for map operations that do no partial aggregation or sorting, such as groupByKey. This is because it has to serialize each data item twice (once when spilling to intermediate files, and then again when merging these files object-by-object). This patch adds a code path to just write separate files directly if the # of output partitions is small, and concatenate them at the end to produce a sorted file.

On the unit test side, I added some tests that force or don't force this bypass path to be used, and checked that our tests for other features (e.g. all the operations) cover both cases.

Author: Matei Zaharia <[email protected]>

Closes #1799 from mateiz/SPARK-2787 and squashes the following commits:

88cf26a [Matei Zaharia] Fix rebase
10233af [Matei Zaharia] Review comments
398cb95 [Matei Zaharia] Fix looking up shuffle manager in conf
ca3efd9 [Matei Zaharia] Add docs for shuffle manager properties, and allow short names for them
d0ae3c5 [Matei Zaharia] Fix some comments
90d084f [Matei Zaharia] Add code path to bypass merge-sort in ExternalSorter, and tests
31e5d7c [Matei Zaharia] Move existing logic for writing partitioned files into ExternalSorter
Author: Sandy Ryza <[email protected]>

Closes #1507 from sryza/sandy-spark-2565 and squashes the following commits:

74dad41 [Sandy Ryza] SPARK-2565. Update ShuffleReadMetrics as blocks are fetched
Author: Kousuke Saruta <[email protected]>

Closes #1834 from sarutak/SPARK-2904 and squashes the following commits:

38e7d45 [Kousuke Saruta] Removed non-used variable in SparkSubmitArguments
Author: Erik Erlandson <[email protected]>

Closes #1841 from erikerlandson/spark-2911-pr and squashes the following commits:

4699e2f [Erik Erlandson] [SPARK-2911]: provide rdd.parent[T](j) to obtain jth parent RDD
JIRA: https://issues.apache.org/jira/browse/SPARK-2888

Author: Yin Huai <[email protected]>

Closes #1817 from yhuai/fixAddColumnMetadataToConf and squashes the following commits:

fba728c [Yin Huai] Fix addColumnMetadataToConf.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after :. Can you add some scala doc about what this is checking?

guowei2 and others added 21 commits August 18, 2014 14:50
…nto sql-memory-patch

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
mengxr
Remove  transform(dataset: RDD[String]) from public API.

Author: Liquan Pei <[email protected]>

Closes #2010 from Ishiihara/Word2Vec-api and squashes the following commits:

17b1031 [Liquan Pei] remove transform(dataset: RDD[String]) from public API
…ems or not

FIX: ShuffledDStream run tasks only when dstream has partition items
…ems or not

FIX: ShuffledDStream run tasks only when dstream has partition items
…ems or not

FIX: ShuffledDStream run tasks only when dstream has partition items
…ems or not

FIX: ShuffledDStream run tasks only when dstream has partition items
…ems or not

FIX: ShuffledDStream run tasks only when dstream has partition items
@guowei2
Copy link
Contributor Author

guowei2 commented Aug 18, 2014

i'm very sorry. i just rebase my brach to spark/master. what should i do to fix this .

@guowei2
Copy link
Contributor Author

guowei2 commented Aug 18, 2014

may i close this PR and create a new PR?

@witgo
Copy link
Contributor

witgo commented Aug 18, 2014

Try this: git commit -m "Big-ass commit" --allow-empty git rebase -i master, git push origin sql-memory-patch -f

@guowei2
Copy link
Contributor Author

guowei2 commented Aug 19, 2014

i have to close this PR and close a new one .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.