-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2873] [SQL] using ExternalAppendOnlyMap to resolve OOM when aggregating #1822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items
…nto sql-memory-patch
|
Can one of the admins verify this patch? |
… partition fix bug of countApproxDistinct() when have more than one partition Author: Davies Liu <[email protected]> Closes #1812 from davies/approx and squashes the following commits: bf757ce [Davies Liu] fix bug of countApproxDistinct() when have more than one partition
Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs). Added factory classes for Algo and Impurity, but made private[mllib]. CC: mengxr dorx Please let me know if there are other changes which would help with API consistency---thanks! Author: Joseph K. Bradley <[email protected]> Closes #1798 from jkbradley/dt-python-consistency and squashes the following commits: 6f7edf8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency a0d7dbe [Joseph K. Bradley] DecisionTree: In Java-friendly train* methods, changed to use JavaRDD instead of RDD. ee1d236 [Joseph K. Bradley] DecisionTree API updates: * Removed train() function in Python API (tree.py) ** Removed corresponding function in Scala/Java API (the ones taking basic types) 00f820e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency fe6dbfa [Joseph K. Bradley] removed unnecessary imports e358661 [Joseph K. Bradley] DecisionTree API change: * Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs). c699850 [Joseph K. Bradley] a few doc comments eaf84c0 [Joseph K. Bradley] Added DecisionTree static train() methods API to match Python, but without default parameters
|
@guowei2 can you add |
… repos .. and use canonical repo1.maven.org Maven Central repo. (And make sure snapshots are disabled for plugins from Maven Central.) Author: Sean Owen <[email protected]> Closes #1828 from srowen/SPARK-2879.2 and squashes the following commits: 639f495 [Sean Owen] .. and use canonical repo1.maven.org Maven Central repo. (And make sure snapshots are disabled for plugins from Maven Central.)
Added some checks to Strategy to print out meaningful error messages when given invalid DecisionTree parameters. CC mengxr Author: Joseph K. Bradley <[email protected]> Closes #1821 from jkbradley/dt-robustness and squashes the following commits: 4dc449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-robustness 7a61f7b [Joseph K. Bradley] Added some checks to Strategy to print out meaningful error messages when given invalid DecisionTree parameters
This is part of SPARK-2828: 1. separate IDF model from IDF algorithm (which generates a model) 2. separate StandardScaler model from StandardScaler CC: dbtsai Author: Xiangrui Meng <[email protected]> Closes #1814 from mengxr/feature-api-update and squashes the following commits: 40d863b [Xiangrui Meng] move mean and variance to model 48a0fff [Xiangrui Meng] separate Model from StandardScaler algorithm 89f3486 [Xiangrui Meng] update IDF to separate Model from Algorithm
Author: Oleg Danilov <[email protected]> Closes #1835 from dosoft/SPARK-2905 and squashes the following commits: 4df423c [Oleg Danilov] SPARK-2905 Fixed path sbin => bin
The reason for this bug was introduciton of OldDeps project. It had to be excluded to prevent unidocs from trying to put it on "docs compile" classpath. Author: Prashant Sharma <[email protected]> Closes #1830 from ScrapCodes/doc-fix and squashes the following commits: e5d52e6 [Prashant Sharma] SPARK-2899 Doc generation is back to working in new SBT Build.
… no sorting/aggregation and # partitions is small As described in https://issues.apache.org/jira/browse/SPARK-2787, right now sort-based shuffle is more expensive than hash-based for map operations that do no partial aggregation or sorting, such as groupByKey. This is because it has to serialize each data item twice (once when spilling to intermediate files, and then again when merging these files object-by-object). This patch adds a code path to just write separate files directly if the # of output partitions is small, and concatenate them at the end to produce a sorted file. On the unit test side, I added some tests that force or don't force this bypass path to be used, and checked that our tests for other features (e.g. all the operations) cover both cases. Author: Matei Zaharia <[email protected]> Closes #1799 from mateiz/SPARK-2787 and squashes the following commits: 88cf26a [Matei Zaharia] Fix rebase 10233af [Matei Zaharia] Review comments 398cb95 [Matei Zaharia] Fix looking up shuffle manager in conf ca3efd9 [Matei Zaharia] Add docs for shuffle manager properties, and allow short names for them d0ae3c5 [Matei Zaharia] Fix some comments 90d084f [Matei Zaharia] Add code path to bypass merge-sort in ExternalSorter, and tests 31e5d7c [Matei Zaharia] Move existing logic for writing partitioned files into ExternalSorter
Author: Sandy Ryza <[email protected]> Closes #1507 from sryza/sandy-spark-2565 and squashes the following commits: 74dad41 [Sandy Ryza] SPARK-2565. Update ShuffleReadMetrics as blocks are fetched
Author: Kousuke Saruta <[email protected]> Closes #1834 from sarutak/SPARK-2904 and squashes the following commits: 38e7d45 [Kousuke Saruta] Removed non-used variable in SparkSubmitArguments
Author: Erik Erlandson <[email protected]> Closes #1841 from erikerlandson/spark-2911-pr and squashes the following commits: 4699e2f [Erik Erlandson] [SPARK-2911]: provide rdd.parent[T](j) to obtain jth parent RDD
JIRA: https://issues.apache.org/jira/browse/SPARK-2888 Author: Yin Huai <[email protected]> Closes #1817 from yhuai/fixAddColumnMetadataToConf and squashes the following commits: fba728c [Yin Huai] Fix addColumnMetadataToConf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after :. Can you add some scala doc about what this is checking?
…nto sql-memory-patch Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
mengxr Remove transform(dataset: RDD[String]) from public API. Author: Liquan Pei <[email protected]> Closes #2010 from Ishiihara/Word2Vec-api and squashes the following commits: 17b1031 [Liquan Pei] remove transform(dataset: RDD[String]) from public API
…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items
…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items
…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items
…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items
…ems or not FIX: ShuffledDStream run tasks only when dstream has partition items
…nto sql-memory-patch
|
i'm very sorry. i just rebase my brach to spark/master. what should i do to fix this . |
…nto sql-memory-patch
|
may i close this PR and create a new PR? |
|
Try this: |
|
i have to close this PR and close a new one . |
Using ExternalAppendOnlyMap to resolve OOM when aggregating.
Using "spark.shuffle.spill" to open it or not
Hive udaf does not support yet for udaf need Serializable
Join has the same problem. but using ExternalAppendOnlyMap as CoGroupedRDD seems to reduce performance. i try another way by using ExternalAppendOnlyMap. but it needs testing .i will commit it in another batch.