-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-5529][CORE]Add expireDeadHosts in HeartbeatReceiver #4363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
c922cb0
b5c0441
fb5df97
e197e20
c1dfda1
8e77408
bccd515
ce9257e
07952f3
6bab7aa
3e221d9
a858fb5
5bedcb8
52725af
b904aed
7448ac6
d221493
2dc456e
1a042ff
2c9a46a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,6 +22,7 @@ import org.apache.spark.executor.TaskMetrics | |
| import org.apache.spark.storage.BlockManagerId | ||
| import org.apache.spark.scheduler.TaskScheduler | ||
| import org.apache.spark.util.ActorLogReceive | ||
| import org.apache.spark.scheduler.ExecutorLossReason | ||
|
|
||
| /** | ||
| * A heartbeat from executors to the driver. This is a shared message used by several internal | ||
|
|
@@ -32,18 +33,56 @@ private[spark] case class Heartbeat( | |
| taskMetrics: Array[(Long, TaskMetrics)], // taskId -> TaskMetrics | ||
| blockManagerId: BlockManagerId) | ||
|
|
||
| private[spark] case object ExpireDeadHosts | ||
|
|
||
| private[spark] case class HeartbeatResponse(reregisterBlockManager: Boolean) | ||
|
|
||
| /** | ||
| * Lives in the driver to receive heartbeats from executors.. | ||
| */ | ||
| private[spark] class HeartbeatReceiver(scheduler: TaskScheduler) | ||
| private[spark] class HeartbeatReceiver(sc: SparkContext, scheduler: TaskScheduler) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would recommend limit the number of things we pass to this receiver to the following set of smaller things instead of the whole The
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will not easy to understand, on the other hand the SparkContext is use in a lot of place.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but it's bad design pattern to pass in the whole Anyway, not a big deal since this is private. I will merge it as is. |
||
| extends Actor with ActorLogReceive with Logging { | ||
|
|
||
| val executorLastSeen = new mutable.HashMap[String, Long] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add a comment on what the keys and values are: |
||
|
|
||
| import context.dispatcher | ||
| var timeoutCheckingTask = context.system.scheduler.schedule(0.seconds, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. too many spaces after |
||
| 10.milliseconds, self, ExpireDeadHosts) | ||
|
|
||
| val slaveTimeout = sc.conf.getLong("spark.storage.blockManagerSlaveTimeoutMs", | ||
| math.max(sc.conf.getInt("spark.executor.heartbeatInterval", 10000) * 3, 120000)) | ||
|
|
||
| override def receiveWithLogging = { | ||
| case Heartbeat(executorId, taskMetrics, blockManagerId) => | ||
| val response = HeartbeatResponse( | ||
| !scheduler.executorHeartbeatReceived(executorId, taskMetrics, blockManagerId)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not your code, but can you put this in a separate val: |
||
| heartbeatReceived(executorId) | ||
| sender ! response | ||
| case ExpireDeadHosts => | ||
| expireDeadHosts() | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. very minor nit: remove space |
||
| } | ||
|
|
||
| private def heartbeatReceived(executorId: String) = { | ||
| executorLastSeen(executorId) = System.currentTimeMillis() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can just be moved into the one place that it's called from |
||
| } | ||
|
|
||
| private def expireDeadHosts() { | ||
| logTrace("Checking for hosts with no recent heart beats in HeartbeatReceiver.") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: remove space between "heart" and "beats" |
||
| val now = System.currentTimeMillis() | ||
| val minSeenTime = now - slaveTimeout | ||
| for ((executorId, lastSeenMs) <- executorLastSeen) { | ||
| if (lastSeenMs < minSeenTime) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's easier to read if it's the following instead: |
||
| val msg = "Removing Executor " + executorId + " with no recent heart beats: " | ||
| +(now - lastSeenMs) + "ms exceeds " + slaveTimeout + "ms" | ||
| logWarning(msg) | ||
| if (scheduler.isInstanceOf[org.apache.spark.scheduler.TaskSchedulerImpl]) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be better to add a method to |
||
| scheduler.asInstanceOf[org.apache.spark.scheduler.TaskSchedulerImpl] | ||
| .executorLost(executorId, new ExecutorLossReason("")) | ||
| } | ||
| sc.killExecutor(executorId) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can't do this since it's only available in YARN at the moment |
||
| executorLastSeen.remove(executorId) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if a heartbeat from the executor gets delivered after we kill / remove it?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because the akka connection is still alive, we can kill executor by send kill message to applicationMaster.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this code will correctly expire that executor as a dead host after a timeout. |
||
| } | ||
| } | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a header comment that communicates that one of the functions of the HeartbeatReceiver is to expire executors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Sryza, thanks for your review.