[SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3 #29331

dongjoon-hyun · 2020-08-03T05:46:39Z

What changes were proposed in this pull request?

This PR aims to add StorageLevel.DISK_ONLY_3 as a built-in StorageLevel.

Why are the changes needed?

In a YARN cluster, HDFS uaually provides storages with replication factor 3. So, we can save the result to HDFS to get StorageLevel.DISK_ONLY_3 technically. However, disaggregate clusters or clusters without storage services are rising. Previously, in that situation, the users were able to use similar MEMORY_AND_DISK_2 or a user-created StorageLevel. This PR aims to support those use cases officially for better UX.

Does this PR introduce any user-facing change?

Yes. This provides a new built-in option.

How was this patch tested?

Pass the GitHub Action or Jenkins with the revised test cases.

core/src/main/scala/org/apache/spark/storage/StorageLevel.scala

SparkQA · 2020-08-03T07:05:02Z

Test build #126955 has finished for PR 29331 at commit 0cf67c4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-08-03T08:04:40Z

retest this please

SparkQA · 2020-08-03T10:42:55Z

Test build #126970 has finished for PR 29331 at commit 0cf67c4.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-03T16:21:34Z

Thanks, @HyukjinKwon and @maropu . Yes. The official support is more beneficial to the users. DISK_ONLY_3 is better than some magic code like new StorageLevel(true, false, false, false, 3). Also, this PR includes a test coverage for DISK_ONLY_3 which makes the customer feel safe.

SparkQA · 2020-08-03T19:47:59Z

Test build #126996 has finished for PR 29331 at commit 480a480.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-03T21:39:29Z

Rebased to the master.

SparkQA · 2020-08-03T23:56:01Z

Test build #127006 has finished for PR 29331 at commit cc1a7a3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-04T00:13:46Z

Interesting. Since the last commit only changes R/Python/Doc and one Java file, CachedTableSuite should not be affected . I'll take a look what is the different from last run.

Locally, CachedTableSuite passed. I'm still looking at Jenkins.

core/src/main/scala/org/apache/spark/storage/StorageLevel.scala

dongjoon-hyun · 2020-08-04T04:47:30Z

I found the root cause of random failures in the master branch. Here is the PR.

[SPARK-32524][SQL][TESTS] SharedSparkSession should clean up InMemoryRelation.ser #29344

dongjoon-hyun · 2020-08-04T14:00:43Z

Retest this please

SparkQA · 2020-08-04T16:40:44Z

Test build #127054 has finished for PR 29331 at commit cc1a7a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-04T17:31:43Z

All test passed again. Could you review this, @HyukjinKwon , @maropu , @viirya , @dbtsai , @holdenk ?

srowen

I get the value of 3x replication for persistent data; this is in theory persistence for data that is already recreateable right? cached data? or am I totally forgetting where else this can be used?

If so this doesn't seem as necessary, and even DISK_ONLY_2 feels like overkill.
I suppose one argument we've made in the past is that the 2x replication is to make the cached data available as local data in more places, to improve locality. That could be an argument.

I don't feel strongly about it either way. But would MEMORY_AND_DISK_3 then make sense?

dongjoon-hyun · 2020-08-05T16:32:26Z

I get the value of 3x replication for persistent data; this is in theory persistence for data that is already recreateable right?

Right, @srowen .

cached data? or am I totally forgetting where else this can be used?

Yes. This cuts the lineage and works like HDFS replacement. Previously, this can be achieve when you store the RDD into HDFS back. For now, it's difficult in the disaggregated cluster.

If so this doesn't seem as necessary, and even DISK_ONLY_2 feels like overkill.

It's not overkill. HDFS replication is not only for reliability. 3x HDFS replication improves the throughput 3x times. Are you sure one executor can serve that traffic, @srowen ?

I suppose one argument we've made in the past is that the 2x replication is to make the cached data available as local data in more places, to improve locality. That could be an argument.

Improving locality is just a small fraction. The throughput improvement and reduced FetchFailedException is the real benefit. If we don't have HDFS, this is the only viable option.

But would MEMORY_AND_DISK_3 then make sense?

MEMORY_AND_DISK_3 is not recommended here because it has another assumption to have all the data into the memory. It turned out that has a severe side effect when the memory is not enough on the executors. Why do we need to load the data into the memory if it goes down disk back due to Spark operation.

This PR aims to serving the data like HDFS conceptually inside Spark to support HDFS-service-free ecosystem.

holdenk · 2020-08-05T16:34:26Z

I think here the motivation was to try and deal with a workload with a lot of failures on the executors and avoiding a lot of recomputes more than the locality.

dongjoon-hyun · 2020-08-05T17:32:09Z

Thank you, @holdenk !

Ngone51 · 2020-08-06T14:41:56Z

The throughput improvement and reduced FetchFailedException is the real benefit. If we don't have HDFS, this is the only viable option.

IIUC, FetchFailedException only raised when we try to fetch shuffle blocks, while StorageLevel is only related to RDD blocks. So I have no idea how DISK_ONLY_3 could help reduce FetchFailedException.

And how do we get the throughput improvement by using DISK_ONLY_3? Higher task parallelism? Or something else?

I think here the motivation was to try and deal with a workload with a lot of failures on the executors and avoiding a lot of recomputes more than the locality.

If that's the case, I think we should care more about shuffle data, which only has one single copy in the disk. And shuffle data loss would lead to stage recompute, which is more terrible compares to task recompute caused by RDD block loss.

I don't object to the change here, but just want to figure out what's the real case it tires to improve.

dongjoon-hyun · 2020-08-06T18:07:42Z

Thank you, @Ngone51 . The user scenario looks like this. The job has a very long lineage. In a disaggregated cluster, the executor dies sometime due to various reasons (including maintenance and preemption) and causes bad effects like FetchFailedException and frequently retries (not only the direct parent, but also the ancestor, too). The is the same as you wrote. So, the user is trying to cut the lineage by using cache after the shuffle stage. But, it turns out that cache can cause memory competition as a side-effects. Although Spark can spill the disk, they don't want to load the data into the memory from the beginning. They inevitabliy decided to choose DISK only. In short, they are using DISK_ONLY_1 and DISK_ONLY_2 and currently asking DISK_ONLY_3. It depends on the their decision on the individual dataset.

The rational of DISK_ONLY_3 is they want to have the same concept of the existing HDFS service.

dongjoon-hyun · 2020-08-07T16:15:55Z

Hi, @HyukjinKwon , @maropu , @srowen , @Ngone51 .
Please let me know if you guys have any other concerns. This is just a new alias which doesn't cause any negative effects on the existing Spark eco-system.

dongjoon-hyun · 2020-08-07T16:18:47Z

Retest this please.

SparkQA · 2020-08-07T18:49:17Z

Test build #127208 has finished for PR 29331 at commit cc1a7a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-08-08T01:29:12Z

cc @tgravescs too

dongjoon-hyun · 2020-08-08T06:10:24Z

Thank you, @HyukjinKwon .

dongjoon-hyun · 2020-08-09T19:15:50Z

Retest this please.

SparkQA · 2020-08-09T21:32:44Z

Test build #127242 has finished for PR 29331 at commit cc1a7a3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-09T23:08:21Z

Retest this please.

SparkQA · 2020-08-10T01:19:53Z

Test build #127244 has finished for PR 29331 at commit cc1a7a3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-08-10T02:56:56Z

Retest this please

Ngone51

Thank you for your explanation @dongjoon-hyun . LGTM.

dongjoon-hyun · 2020-08-10T03:43:54Z

Thank you so much, @Ngone51

SparkQA · 2020-08-10T06:27:58Z

Test build #127257 has finished for PR 29331 at commit cc1a7a3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-10T07:42:41Z

Do we consider writing out the DataFrame as parquet/orc/... to reliable storage to cut the RDD lineage?

dongjoon-hyun · 2020-08-10T13:35:53Z

Thank you for review, @cloud-fan . No, this approach doesn't not consider additional reliable storage. This PR depends on pure Spark's features only.

Do we consider writing out the DataFrame as parquet/orc/... to reliable storage to cut the RDD lineage?

cloud-fan · 2020-08-10T14:14:54Z

Since we already have DISK_ONLY_2, I'm fine adding DISK_ONLY_3.

I'm just giving a different proposal for this use case. The RDD lineage model relies on recomputing so that Spark can cache data on unreliable storage. I think caching with multiple copies is diverging from the original idea. If you don't want to trigger recomputing, you can save data to reliable storage, which is usually better than 3 hard copies (object store is cheaper, HDFS has Erasure Coding to save space).

dongjoon-hyun · 2020-08-10T14:30:09Z

Thank you for advice, @cloud-fan . Of course, we know that the architectural goal of Apache Spark is not aiming a reliable storage system. reliable storage will be always considered as the first solution if it's available.

dongjoon-hyun · 2020-08-10T14:32:45Z

Thank you all. This is nothing new, but an alias of the existing Spark feature. This will not make much confusions what Apache Spark aims. Merged to master.

tgravescs

sorry for coming in late here, few questions/nits

tgravescs · 2020-08-10T14:36:05Z

docs/rdd-programming-guide.md

 **Note:** *In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library,
 so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`,
-`MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, and `DISK_ONLY_2`.*
+`MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2`, and `DISK_ONLY_3`.*


it looks like we need to update the table above as well?
it might be nice to say what happens if you specify a level > 1 but you don't have that many executors.

Let me rephrase the request.

Adding both DISK_ONLY_2 and DISK_ONLY_3 to the above table.

Adding a description about the corner case for MEMORY_ONLY_2, MEMORY_AND_DISK_2, DISK_ONLY_2, DISK_ONLY_3

Is there something more I can do, @tgravescs ?

tgravescs · 2020-08-10T14:36:35Z

core/src/test/scala/org/apache/spark/DistributedSuite.scala

    "caching in memory, serialized, replicated" -> StorageLevel.MEMORY_ONLY_SER_2,
-    "caching on disk, replicated" -> StorageLevel.DISK_ONLY_2,
+    "caching on disk, replicated 2" -> StorageLevel.DISK_ONLY_2,
+    "caching on disk, replicated 3" -> StorageLevel.DISK_ONLY_3,


so what happen if there aren't 3 executors? do we have a test that needs updating?

so what happen if there aren't 3 executors?

The number of copies becomes 2 and this test case fail reasonably.

do we have a test that needs updating?

Yes. This test suite is updated at line 41.

This PR aims to add `StorageLevel.DISK_ONLY_3` as a built-in `StorageLevel`. In a YARN cluster, HDFS uaually provides storages with replication factor 3. So, we can save the result to HDFS to get `StorageLevel.DISK_ONLY_3` technically. However, disaggregate clusters or clusters without storage services are rising. Previously, in that situation, the users were able to use similar `MEMORY_AND_DISK_2` or a user-created `StorageLevel`. This PR aims to support those use cases officially for better UX. Yes. This provides a new built-in option. Pass the GitHub Action or Jenkins with the revised test cases. Closes apache#29331 from dongjoon-hyun/SPARK-32517. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit b421bf0) Signed-off-by: Dongjoon Hyun <[email protected]>

probot-autolabeler bot added the CORE label Aug 3, 2020

HyukjinKwon reviewed Aug 3, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/storage/StorageLevel.scala Show resolved Hide resolved

probot-autolabeler bot added DOCS PYTHON R labels Aug 3, 2020

dongjoon-hyun added 2 commits August 3, 2020 14:32

[SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3

e9fcc60

Support Java/Python/R

cc1a7a3

viirya reviewed Aug 4, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/storage/StorageLevel.scala Show resolved Hide resolved

srowen reviewed Aug 5, 2020

View reviewed changes

dongjoon-hyun mentioned this pull request Aug 10, 2020

[SPARK-32575][CORE][TESTS] Bump up timeouts in BlockManagerDecommissionIntegrationSuite to reduce flakyness #29388

Closed

Ngone51 approved these changes Aug 10, 2020

View reviewed changes

dongjoon-hyun closed this in b421bf0 Aug 10, 2020

dongjoon-hyun deleted the SPARK-32517 branch August 10, 2020 14:33

tgravescs reviewed Aug 10, 2020

View reviewed changes

zero323 mentioned this pull request Aug 30, 2020

[SPARK-32517] Add StorageLevel.DISK_ONLY_3 zero323/pyspark-stubs#490

Closed

[SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3 #29331

[SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3 #29331

Uh oh!

Conversation

dongjoon-hyun commented Aug 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

maropu commented Aug 3, 2020

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

dongjoon-hyun commented Aug 3, 2020

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

dongjoon-hyun commented Aug 3, 2020

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

dongjoon-hyun commented Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 4, 2020

Uh oh!

dongjoon-hyun commented Aug 4, 2020

Uh oh!

SparkQA commented Aug 4, 2020

Uh oh!

dongjoon-hyun commented Aug 4, 2020

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holdenk commented Aug 5, 2020

Uh oh!

dongjoon-hyun commented Aug 5, 2020

Uh oh!

Ngone51 commented Aug 6, 2020

Uh oh!

dongjoon-hyun commented Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 7, 2020

Uh oh!

SparkQA commented Aug 7, 2020

Uh oh!

HyukjinKwon commented Aug 8, 2020

Uh oh!

dongjoon-hyun commented Aug 8, 2020

Uh oh!

dongjoon-hyun commented Aug 9, 2020

Uh oh!

SparkQA commented Aug 9, 2020

Uh oh!

dongjoon-hyun commented Aug 9, 2020

Uh oh!

SparkQA commented Aug 10, 2020

Uh oh!

dongjoon-hyun commented Aug 10, 2020

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 10, 2020

Uh oh!

SparkQA commented Aug 10, 2020

dongjoon-hyun commented Aug 3, 2020 •

edited

Loading

dongjoon-hyun commented Aug 4, 2020 •

edited

Loading

dongjoon-hyun commented Aug 5, 2020 •

edited

Loading

dongjoon-hyun commented Aug 6, 2020 •

edited

Loading

dongjoon-hyun commented Aug 7, 2020 •

edited

Loading