-
Notifications
You must be signed in to change notification settings - Fork 51
[SPARK-2253] Add logic to decide when to do key/value pre-aggregation on Map side. #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace println with logInfo
|
I noticed in HashShuffleReader, this block of code in .read(), which is reading on the reduce side: How is this being handled because we're turning off map-side-combine even though dep.mapSideCombine is true? |
|
We should write a unit test at the RDD level as well. Use RDD.combineByKey() and verify the results from end-to-end testing are correct, given the partial aggregation logic added here. I think you can just add another test in RDDSuite.scala. |
|
Regarding your first comment, I THINK the answer is:
About the second one: I'll do that soon! |
|
How is it getting to the reducer with (K,C) type is my question. If it gets there as (K,C) however it's definitely okay. I just thought map-side-combine is responsible for also creating the initial Cs, but I could see how the map side would do that without map-side-combine also - just want some locations to look at to know for sure. |
|
ShuffleManager:
ShuffleReader: |
The first iteration of the disable-map-side-combine-on-pre-aggregation feature failed to take into consideration that the reduce side would expect to read values of type [K, C] if mapSideCombine is set to true. This patch makes SortFileWriter and HashFileWriter handle that case by explicitly applying the user's createCombiner function to all rows if they requeted mapSideCombine but we determined it was not the best option in this stage.
Some perf testingCluster resources: Spark Command: ./bin/spark-shell --executor-memory 16G --master spark://master:7077 --conf spark.partialAgg.interval=YY spark.partialAgg.reduction=XX-i MyTest.txt MyTest.txt code: val cc = sc.textFile("hdfs://file-200GB")
val headerAndRows = cc.map(line => (line.split("\\|")))
val pairs = headerAndRows.map(line => (line(0), 1))
val combined = pairs.combineByKey((value:Int) => (value, 1), (x:(Int,Int), value) => (x._1 + value, x._2 + 1), (x:(Int, Int), y:(Int, Int)) => (x._1 + y._1, x._2 + y._2))
combined.countResult of the above code: Long = 30000
Time: 6.5 min
Time: 8.3 min
Time: 8.4 min
Time: 6.3 min |
|
Looks like testing at a glance shows some speedup =) So we have a dataset with 30,000 unique keys, that's what this test shows us, right? How does this compare to the total number of rows in the dataset? Also, what are the implications of the numbers you have found? Can you explain how they line up with the behavior you're expecting? |
|
Yep, there is a speed-up between doing pre-aggregation and not, mainly because there are 30.000 unique keys and over 700 000 000 rows in the test file. |
|
" (~1133 calls to write() from SortShuffleWriter)" How is this influenced by the pre-aggregation logic? |
|
It isn't influenced by the pre-aggregation logic, I've posted that to see for how many times ShuffleAggregationManager class is instantiated and the pre-aggregation logic is applied (iterating through the Map etc.). |
|
I thought this would be influenced depending on how much partial aggregation is done in the map side. For example, if you stop partial aggregation partway through, you have more unique keys and values to write to the SortShuffleWriter, right? This is more just for understanding what the end behavior actually was. |
|
Hmm, actually no. SortShuffleWriter is called with an Iterator to some (K, V) pairs and there it decides if to do the pre-aggregation or not. To decide if the pre-aggregation is done or not, it instantiated ShuffleAggregationManager and iterates through the first spark.partialAll.interval elements of that iterator (and this is why the number of calls to write() function is important - because it allocates a Map and iterates through it every time this happens). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should remove this. I don't see this header on any other files in Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, sorry! :-D
|
CC @punya for a final sign-off / review. Note that this is basically just adapting this thing: apache#1191 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noob question: why Product2[K, V] rather than (K, V)?
|
I've tested the code against a LZO compressed table (with the same Spark code which I wrote in a previous comment here) with 1.000.000.000 rows and 1.000.000.000 unique keys. The table looks like this: key1|vlad1
key2|vlad2
key3|vlad3
...It seems that there is a high performance improvement when disabling pre-aggregation (which our code does). Here are the results:
Conclusion: when pre-aggregation is disabled we obtain 8 minutes instead of 13. Also spark.partialAgg.interval doesn't affect the time very much (Test 2 has a big interval and Test 3 a small one and there isn't a significant difference between them). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iteratedElements could potential oom if we storing in memory 10k large items. Perhaps we should store it in a size tracking collection, and stop sampling when either we hit 10k items, or if the size tracking collection gets too big?
Talking to @mccheah, the other way is to do it "inline", you can talk to him directly if you want some insight about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inline logic is what the original commit did, but that's pretty hard to do. I like the size tracking idea, but size tracking in and of itself isn't free so you should benchmark that. Size tracking is most expensive when you have an RDD of composite objects (i.e. not primitives, think like an RDD of HashSet objects)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Yep, we were aware of that and I think there are 2 possible solutions to solve that:
- I think I have a mathematical algorithm (with probabilities, which assumes that keys are uniformly distributed) which is able to stop iterating after a number of steps (in this case, MAXIMUM 10k). I have to think a little bit more (actually, to remember it :-) ) and I'll post it here.
- Instead of using a MutableList, maybe we can switch to ExternalAppendOnlyMap or ExternalList (which Matt created in some of his previous commits).
I'll be thinking about that :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should just use a size tracking collection. The collection doesn't have to spill, significantly simplifying your implementation. The size tracking collection will be able to report its size in memory, and then when it hits some memory threshold you take that as the sample to conduct your heuristic.
|
Thank you for your comment, Justin! Do you mean to implement a mechanism like in Spillable class (set the initial memory threshold based on spark.shuffle.spill.initialMemoryThreshold and then tryToAcquire() more memory until it reaches myMemoryThreshold)? Or do you think it's ok to do what I've just done in my last commit...? :-) |
|
Np! I think it's okay to just set it to just be as large as initialMemoryThreshold, and then stop if it ever gets that big. This is mainly a safety mechanism, so we don't need to be very smart about how we acquire more! Haha, i think being lazy might be warranted here. |
|
Ok! Thank you for your time! (thumbsup) |
|
This is what SizeTrackingVector actually does - computing once at each N elements added and estimating the size after each addition. |
|
Would be good to get some performance numbers on this, can you do that? |
|
Yes, I can. |
|
I've summarily tested this, it seems there isn't a big perf difference between the method with SizeTracking collection and the older one. I've tested for a file with 10.000.000 rows, each row containing a key - value pair (both strings). Both methods ran locally in about 28 seconds. |
## What changes were proposed in this pull request?
This PR brings the support for chained Python UDFs, for example
```sql
select udf1(udf2(a))
select udf1(udf2(a) + 3)
select udf1(udf2(a) + udf3(b))
```
Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.
For example,
```python
>>> sqlContext.sql("select double(double(1))").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [pythonUDF#10 AS double(double(1))#9]
: +- INPUT
+- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
+- Scan OneRowRelation[]
>>> sqlContext.sql("select double(double(1) + double(2))").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
: +- INPUT
+- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
+- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
+- !BatchPythonEvaluation double(1), [pythonUDF#17]
+- Scan OneRowRelation[]
```
TODO: will support multiple unrelated Python UDFs in one batch (another PR).
## How was this patch tested?
Added new unit tests for chained UDFs.
Author: Davies Liu <davies@databricks.com>
Closes apache#12014 from davies/py_udfs.
* Documentation for the current state of the world. * Adding navigation links from other pages * Address comments, add TODO for things that should be fixed * Address comments, mostly making images section clearer * Virtual runtime -> container runtime
* Documentation for the current state of the world. * Adding navigation links from other pages * Address comments, add TODO for things that should be fixed * Address comments, mostly making images section clearer * Virtual runtime -> container runtime
Disable partial aggregation automatically when reduction factor is low. Once we see enough number of rows in partial aggregation and don't observe any reduction, Aggregator should just turn off partial aggregation. This reduces memory usage for high cardinality aggregations.
This one is for Spark core.