-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15355] [CORE] Proactive block replication #14412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
016ea9f
16975b6
beb9eb3
275cbea
cee8e76
212baab
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1130,15 +1130,48 @@ private[spark] class BlockManager( | |
| } | ||
| } | ||
|
|
||
| /** | ||
| * Called for pro-active replenishment of blocks lost due to executor failures | ||
| * | ||
| * @param blockId blockId being replicate | ||
| * @param replicas existing block managers that have a replica | ||
| * @param maxReps maximum replicas needed | ||
| * @return | ||
| */ | ||
| def replicateBlock(blockId: BlockId, replicas: Set[BlockManagerId], maxReps: Int): Boolean = { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about something like this for better readability? def replicateBlock(blockId: BlockId, existingReplicas: Set[BlockManagerId], maxReplicas: Int)Also, is there a reason this returns a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't need to return a boolean. Changing the return type to Unit. Also changing the variable names.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: this still needs fixing
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sameeragarwal This code is being removed as a part of this PR. Code replacing this has this fixed. |
||
| logInfo(s"Pro-actively replicating $blockId") | ||
| val infoForReplication = blockInfoManager.lockForReading(blockId).map { info => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This call acquires a read lock on the block, but when is that lock released? Per the Scaladoc of I think what you want to do is acquire the lock, immediately call
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, I don't think there's a need to have separate
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that makes sense.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice catch! Can we also assert that all locks are released somewhere in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that we can set |
||
| val data = doGetLocalBytes(blockId, info) | ||
| val storageLevel = StorageLevel( | ||
| info.level.useDisk, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Minor nit, but a problem with the Thus I'd probably write each line like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
| info.level.useMemory, | ||
| info.level.useOffHeap, | ||
| info.level.deserialized, | ||
| maxReps) | ||
| (data, storageLevel, info.classTag) | ||
| } | ||
| infoForReplication.foreach { case (data, storageLevel, classTag) => | ||
| replicate(blockId, data, storageLevel, classTag, replicas) | ||
| } | ||
| true | ||
| } | ||
|
|
||
| /** | ||
| * Replicate block to another node. Note that this is a blocking call that returns after | ||
| * the block has been replicated. | ||
| * | ||
| * @param blockId | ||
| * @param data | ||
| * @param level | ||
| * @param classTag | ||
| * @param existingReplicas | ||
|
||
| */ | ||
| private def replicate( | ||
| blockId: BlockId, | ||
| data: ChunkedByteBuffer, | ||
| level: StorageLevel, | ||
| classTag: ClassTag[_]): Unit = { | ||
| blockId: BlockId, | ||
| data: ChunkedByteBuffer, | ||
| level: StorageLevel, | ||
| classTag: ClassTag[_], | ||
| existingReplicas: Set[BlockManagerId] = Set.empty): Unit = { | ||
|
|
||
| val maxReplicationFailures = conf.getInt("spark.storage.maxReplicationFailures", 1) | ||
| val tLevel = StorageLevel( | ||
|
|
@@ -1152,20 +1185,25 @@ private[spark] class BlockManager( | |
|
|
||
| val startTime = System.nanoTime | ||
|
|
||
| var peersReplicatedTo = mutable.HashSet.empty[BlockManagerId] | ||
| var peersReplicatedTo = mutable.HashSet.empty ++ existingReplicas | ||
| var peersFailedToReplicateTo = mutable.HashSet.empty[BlockManagerId] | ||
| var numFailures = 0 | ||
|
|
||
| val initialPeers = { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can this not be just: val initialPeers = getPeers(false).filterNot(existingReplicas.contains(_))
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
| val peers = getPeers(false) | ||
| if(existingReplicas.isEmpty) peers else peers.filter(!existingReplicas.contains(_)) | ||
| } | ||
|
|
||
| var peersForReplication = blockReplicationPolicy.prioritize( | ||
| blockManagerId, | ||
| getPeers(false), | ||
| mutable.HashSet.empty, | ||
| initialPeers, | ||
| peersReplicatedTo, | ||
| blockId, | ||
| numPeersToReplicateTo) | ||
|
|
||
| while(numFailures <= maxReplicationFailures && | ||
| !peersForReplication.isEmpty && | ||
| peersReplicatedTo.size != numPeersToReplicateTo) { | ||
| !peersForReplication.isEmpty && | ||
| peersReplicatedTo.size < numPeersToReplicateTo) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While I think it's still valid to replace the inequality with a strictly-less-than check, but just out of curiosity, can the number of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One scenario I can think of is if an executor with the block being replicated is lost (due to say a delayed heartbeat) and joins back again. The current implementation would recognize the block manager needs to reregister and will report all blocks. The probability of this happening increases with pro-active replication, I think. |
||
| val peer = peersForReplication.head | ||
| try { | ||
| val onePeerStartTime = System.nanoTime | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,6 +22,7 @@ import java.util.{HashMap => JHashMap} | |
| import scala.collection.mutable | ||
| import scala.collection.JavaConverters._ | ||
| import scala.concurrent.{ExecutionContext, Future} | ||
| import scala.util.Random | ||
|
|
||
| import org.apache.spark.SparkConf | ||
| import org.apache.spark.annotation.DeveloperApi | ||
|
|
@@ -188,24 +189,45 @@ class BlockManagerMasterEndpoint( | |
| } | ||
|
|
||
| private def removeBlockManager(blockManagerId: BlockManagerId) { | ||
| val proactivelyReplicate = conf.get("spark.storage.replication.proactive", "false").toBoolean | ||
|
||
|
|
||
| val info = blockManagerInfo(blockManagerId) | ||
|
|
||
| // Remove the block manager from blockManagerIdByExecutor. | ||
| blockManagerIdByExecutor -= blockManagerId.executorId | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why did you move this at the end (i.e., after replicating the blocks and updating |
||
|
|
||
| // Remove it from blockManagerInfo and remove all the blocks. | ||
| blockManagerInfo.remove(blockManagerId) | ||
| val iterator = info.blocks.keySet.iterator | ||
| while (iterator.hasNext) { | ||
| val blockId = iterator.next | ||
| val locations = blockLocations.get(blockId) | ||
| locations -= blockManagerId | ||
| if (locations.size == 0) { | ||
| blockLocations.remove(blockId) | ||
| logWarning(s"No more replicas available for $blockId !") | ||
| } else if ((blockId.isRDD || blockId.isInstanceOf[TestBlockId]) && proactivelyReplicate) { | ||
| // we only need to proactively replicate RDD blocks | ||
|
||
| // we also need to replicate this behavior for test blocks for unit tests | ||
| // we send a message to a randomly chosen executor location to replicate block | ||
| // assuming single executor failure, we find out how many replicas existed before failure | ||
| val maxReplicas = locations.size + 1 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if multiple executors are removed simultaneously? Depending on the invocation sequence, is it possible for
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's a tough one. So the way replication is implemented, the correct storage level is only available with one of the blocks at BlockManager layer (we don't have access to RDD that this block is a part of, so we can't extract information from there). The remaining blocks all have storage levels set to 1. So I use the locations size to get an approximation for the storage level. |
||
|
|
||
| val i = (new Random(blockId.hashCode)).nextInt(locations.size) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need to use a fixed random seed here? Testing? Also, isn't there a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Scala Random api doesn't have a choice method. And Spark Utils class has methods to shuffle, but not a random choice. |
||
| val blockLocations = locations.toSeq | ||
| val candidateBMId = blockLocations(i) | ||
| val blockManager = blockManagerInfo.get(candidateBMId) | ||
| if(blockManager.isDefined) { | ||
|
||
| val remainingLocations = locations.toSeq.filter(bm => bm != candidateBMId) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible for this list to be empty in certain corner-cases? What happens if
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we are at this point, there would be atleast one location with the block which will get chosen as the candidate here. |
||
| val replicateMsg = ReplicateBlock(blockId, remainingLocations, maxReplicas) | ||
| blockManager.get.slaveEndpoint.ask[Boolean](replicateMsg) | ||
| } | ||
| } | ||
| } | ||
| // Remove it from blockManagerInfo and remove all the blocks. | ||
| blockManagerInfo.remove(blockManagerId) | ||
|
|
||
| listenerBus.post(SparkListenerBlockManagerRemoved(System.currentTimeMillis(), blockManagerId)) | ||
| logInfo(s"Removing block manager $blockManagerId") | ||
|
|
||
| } | ||
|
|
||
| private def removeExecutor(execId: String) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can omit this
@returnsince this method doesn't have a return value.