Skip to content
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,13 @@

package org.apache.spark.executor

import java.lang.management.ManagementFactory
import java.util.concurrent.ThreadPoolExecutor

import scala.collection.JavaConverters._

import com.codahale.metrics.{Gauge, MetricRegistry}
import com.sun.management.OperatingSystemMXBean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this com.sun class going to be available in all JDKs? Thinking of OpenJDK and IBM JDKs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
This class cannot be loaded at least on IBM JDK as reported here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this is a very good point that I had overlooked. I have now directly checked and this appears to work OK on OpenJDK (and on Oracle JVM of course). In addition, I tested manually with IBM JDK (IBM J9 VM, Java 1.8.0_181, where one would indeed suspect incompatibilities and surprisingly this appears to work in that case too. I believe this may come from recent work by IBM to make com.ibm.lang.management.OperatingSystemMXBean.getProcessCpuTime compatible with com.sun.management.OperatingSystemMXBean.getProcessCpuTime? See also this link

I guess that if this is confirmed, we should be fine with a large fraction of the commonly used JDKs. In addition, we could handle the exception in case getProcessCpuTime is not available on a particular platform where the executor is running, for example returning the value -1 for this gauge in that case. Any thoughts/suggestions on this proposal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safest to a little reflection here to make sure this doesn't cause the whole app to crash every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored the code with a different approach using the BeanServer which should address the comments about avialability of com.sun.management.OperatingSystemMXBean across different JDKs.

import org.apache.hadoop.fs.FileSystem

import org.apache.spark.metrics.source.Source
Expand Down Expand Up @@ -73,6 +75,13 @@ class ExecutorSource(threadPool: ThreadPoolExecutor, executorId: String) extends
registerFileSystemStat(scheme, "write_ops", _.getWriteOps(), 0)
}

// Dropwizard metrics gauge measuring the executor's process (JVM) CPU time.
// The value is returned in nanoseconds, the method return -1 if this operation is not supported.
val osMXBean = ManagementFactory.getOperatingSystemMXBean.asInstanceOf[OperatingSystemMXBean]
metricRegistry.register(MetricRegistry.name("executorCPUTime" ), new Gauge[Long] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little confused with the exsiting cpuTime. How about jvmCpuTime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: name("executorCPUTime" ) -> name("executorCPUTime")

override def getValue: Long = osMXBean.getProcessCpuTime()
Copy link
Member

@maropu maropu Aug 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is useful for users? The task cpu time metric is not enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the proposed metric tracking the executor CPU time is useful and adds additional information and convenience on top of the task CPU metric, as implemented in SPARK-22190. A couple of considerations to support this argument from some of the recent findings and experimentation on this:

  • the process CPU time contains all the CPU consumed by the JVM, notably including the CPU consumed by garbage collection, which can be important in some cases and definitely something we want to measure and analyze
  • the CPU time collected from the tasks is "harder to consume" in a dashboard as the CPU value is only updated at the end of the successful execution of the task, which makes it harder to handle for a dashboard in case of long-running tasks. In contrast, the executor process CPU time "dropwizard gauge" gives an up-to-date value of the CPU consumed by the executor at any time as it takes it directly from the OS.

})

// Expose executor task metrics using the Dropwizard metrics system.
// The list is taken from TaskMetrics.scala
val METRIC_CPU_TIME = metricRegistry.counter(MetricRegistry.name("cpuTime"))
Expand Down