[SPARK-19968][SS] Use a cached instance of `KafkaProducer` instead of creating one every batch. #17308

ScrapCodes · 2017-03-16T05:56:55Z

What changes were proposed in this pull request?

In summary, cost of recreating a KafkaProducer for writing every batch is high as it starts a lot threads and make connections and then closes them. A KafkaProducer instance is promised to be thread safe in Kafka docs. Reuse of KafkaProducer instance while writing via multiple threads is encouraged.

Furthermore, I have performance improvement of 10x in latency, with this patch.

These are times that addBatch took in ms. Without applying this patch

These are times that addBatch took in ms. After applying this patch

How was this patch tested?

Running distributed benchmarks comparing runs with this patch and without it.
Added relevant unit tests.

SparkQA · 2017-03-16T05:57:33Z

Test build #74644 has started for PR 17308 at commit febf387.

ScrapCodes · 2017-03-16T06:04:27Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala

Ideally this should not have been changed. And any implementation of java.util.AbstractMap, has the correct working for hashCode().

Here, they are changed to HashMap, to avoid converting or casting them later. It can actually be just java.util.Map, but then we can not guarantee the outcome of hashCode().

SparkQA · 2017-03-16T08:34:43Z

Test build #74646 has finished for PR 17308 at commit 830d76d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-16T11:32:33Z

Test build #74658 has finished for PR 17308 at commit 0b5e2e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CachedKafkaProducerSuite extends SharedSQLContext with PrivateMethodTester

ScrapCodes · 2017-03-17T06:31:54Z

Please take a look, @tcondie @zsxwing !

ScrapCodes · 2017-03-22T05:12:37Z

@tcondie and @zsxwing any comments on this patch. I would be happy, if this bug is fixed before 2.2 is released.

ScrapCodes · 2017-03-30T06:38:00Z

@tdas ping !

ScrapCodes · 2017-04-03T09:14:11Z

I can further confirm, that in logs, a kafkaproducer instance is created almost every instant.

BenFradet · 2017-04-03T13:44:55Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

Just curious if this a good idea to key the producer map by the hash code of a map for which the values are Objects?

It is not a good idea to do like that.

I had like my understanding to be corrected, as much as I understood. Since in this particular case Spark does not let user specify a key or value serializer/deserializer. So Object can be either a String, int or Long and for these hashcode would work correctly. I am also contemplating a better way to do it, now.

True, my bad I thought KafkaSink was a public API.

Yeah, I don't think this is a good key for the hashmap. There could be collisions. We should either assign a unique ID to the sink and thread that through, or come up with some way to canoncicalize the set of parameters that create the sink. The latter might better since you could maybe reuse the same producer for more than one query.

koeninger · 2017-04-12T16:53:23Z

Just to throw in my two cents, a change like this is definitely needed, as is made clear by the second sentence of the docs

http://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html

"The producer is thread safe and sharing a single producer instance across threads will generally be faster than having multiple instances."

koeninger · 2017-05-04T14:19:00Z

@marmbrus @zsxwing @tdas This needs attention from someone

brkyvz · 2017-05-04T17:29:52Z

Taking a look

marmbrus

I agree this is an important optimization we need to do. I have some concerns about the life cycle of the producer as its implemented here.

marmbrus · 2017-05-04T17:33:10Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

Yeah, I don't think this is a good key for the hashmap. There could be collisions. We should either assign a unique ID to the sink and thread that through, or come up with some way to canoncicalize the set of parameters that create the sink. The latter might better since you could maybe reuse the same producer for more than one query.

marmbrus · 2017-05-04T17:33:48Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSink.scala

This is only closing the producer on the driver, right? Do we even create on there?

That's correct, I have understood, close requires a bit of rethinking, I am unable to see a straight forward way of doing it. If you agree, close related implementation can be taken out from this PR and be taken up in a separate JIRA and PR ?

marmbrus · 2017-05-04T17:34:46Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

nit: Typically we wrap the arguments rather than the return type

brkyvz · 2017-05-04T17:35:08Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

uber nit: maybe we can define a type alias

type Producer = KafkaProducer[Array[Byte], Array[Byte]]

so that we don't have to write that whole thing over and over

brkyvz · 2017-05-04T17:37:55Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

maybe put this in the create function instead

brkyvz · 2017-05-04T17:39:32Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala

don't need the s and $ right?

ScrapCodes · 2017-05-08T12:33:37Z

Hi @marmbrus and @brkyvz, Thanks a lot of taking a look.

@marmbrus You are right, we should have another way to canonicalize kafka params. I can only think of appending a unique id to kafka params and somehow ensuring a particular set of params get the same uid everytime.

SparkQA · 2017-05-12T12:46:39Z

Test build #76868 has finished for PR 17308 at commit 932d563.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-12T15:04:53Z

Test build #76869 has finished for PR 17308 at commit 2a15afe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-12T16:16:35Z

Test build #76872 has finished for PR 17308 at commit ea9592a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…reating one every batch.

SparkQA · 2017-05-13T06:01:09Z

Test build #76890 has finished for PR 17308 at commit 8224596.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ScrapCodes · 2017-05-15T06:12:38Z

SPARK-20737 is created to look into cleanup mechanism in a separate JIRA.

SparkQA · 2017-05-15T06:42:14Z

Test build #76932 has finished for PR 17308 at commit e07e77e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…inks on executor shutdown. Add a standard way of cleanup during shutdown of executors for structured streaming sinks in general and KafkaSink in particular.

ScrapCodes · 2017-05-17T09:55:57Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala

+  def close(sc: SparkContext, kafkaParams: ju.Map[String, Object]): Unit = {
+    sc.parallelize(1 to 10000).foreachPartition { iter =>
+      CachedKafkaProducer.close(kafkaParams)
+    }


This would cause CachedKafkaProducer.close to be executed on each executor. I am thinking of a better way here.
Any help would be appreciated.

AFAIK the KafkaSource also faces the same issue of not being able to close consumers. Can we use a guava cache with a (configurable) timeout? I guess that's the safest way to make sure that they'll eventually get closed.

Using guave cache, we can close if not used for a certain time. Shall we ignore closing them during a shutdown ?
In the particular case of kafka producer, I do not see a direct problem with that. Since we do a producer.flush() on each batch. I was just wondering, with streaming sinks in general - what should be our strategy ?

SparkQA · 2017-05-24T06:16:04Z

Test build #77293 has finished for PR 17308 at commit 039d063.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-05-24T06:56:37Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

+  }
+
+  private val guavaCache: Cache[String, Producer] = CacheBuilder.newBuilder()
+    .recordStats()


Do we use the stats?

viirya · 2017-05-24T07:17:07Z

...fka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/CanonicalizeKafkaParamsSuite.scala

+import java.{util => ju}
+
+import org.apache.kafka.common.serialization.ByteArraySerializer
+import org.scalatest.PrivateMethodTester


Do we use this import?

Ahh, oversight. Thanks !

brkyvz · 2017-05-24T08:00:14Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala


  private val cacheExpireTimeout: Long =
-    System.getProperty("spark.kafka.guava.cache.timeout", "10").toLong
+    System.getProperty("spark.kafka.guava.cache.timeout.minutes", "10").toLong


don't we need to get this from SparkEnv by the way? I don't know if the properties get populated properly.
Also, adding minutes to the conf makes it kinda long right? I think we can also replace guava with producer.
I think it may also be better to use this so that we get rid of minutes and users can actually provide arbitrary durations (hours if they want). I think that's what we generally use for duration type confs.

Thanks, you are right !

SparkQA · 2017-05-24T09:26:43Z

Test build #77297 has finished for PR 17308 at commit 1c9f892.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2017-05-24T18:38:11Z

LGTM! @marmbrus @viirya do you have any more feedback?

zsxwing · 2017-05-24T19:02:19Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

+    .build[String, Producer]()
+
+  ShutdownHookManager.addShutdownHook { () =>
+    clear()


Do we really need to stop producers in a shutdown hook? I'm asking because stopping a producer is a blocking call and may prevent other shutdown hooks to run.

+1, this seems complicated. What exactly does shutdown do? Is it just cleaning up thread pools?

I think it will close connections as well. That's really not necessary since the process is being shut down.

marmbrus

Thanks for working on this. Its a huge latency improvement and I'll be using it next week at Spark Summit.

I just have one suggestion about the cache implementation.

marmbrus · 2017-05-24T23:33:52Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

+  }
+
+  private def createKafkaProducer(
+    producerConfiguration: ju.Map[String, Object]): Producer = {


nit: indent 4 here

marmbrus · 2017-05-24T23:37:47Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

+ * not exist. This is done to ensure, we have only one set of kafka parameters associated with a
+ * unique ID.
+ */
+private[kafka010] object CanonicalizeKafkaParams extends Logging {


This seems kind of complicated also. Since we know these are always coming from Data[Stream/Frame]Writer and that will always give you Map[String, String] (and we expect the number of options here to be small). Could we just make the key for the cache a sorted Seq[(String, String)] rather than invent another GUID?

SparkQA · 2017-05-25T08:53:55Z

Test build #77357 has finished for PR 17308 at commit 588fa03.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

ScrapCodes · 2017-05-25T09:00:17Z

Jenkins, retest this please !

SparkQA · 2017-05-25T09:03:56Z

Test build #77358 has finished for PR 17308 at commit 588fa03.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

ScrapCodes · 2017-05-25T09:05:51Z

Build is failing due to "Our attempt to download sbt locally to build/sbt-launch-0.13.13.jar failed. Please install sbt manually from http://www.scala-sbt.org/"

ScrapCodes · 2017-05-26T05:13:14Z

Jenkins, retest this please !

SparkQA · 2017-05-26T05:36:28Z

Test build #77410 has finished for PR 17308 at commit 588fa03.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ScrapCodes · 2017-05-26T05:44:00Z

@marmbrus Thank you for taking a look again. Surely, shut down hook is not ideal for closing kafka producers. In fact, for the case of kafka sink, it might be correct to skip cleanup step. I have tried to address your comments.

viirya · 2017-05-26T14:46:24Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

+    Option(guavaCache.getIfPresent(paramsSeq)).getOrElse(createKafkaProducer(kafkaParams))
+  }
+
+  def paramsToSeq(kafkaParams: ju.Map[String, Object]): Seq[(String, Object)] = {


nit: seems can be private?

viirya · 2017-05-26T14:53:01Z

...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/CachedKafkaProducerSuite.scala

+
+    CachedKafkaProducer.close(kafkaParams)
+    val map2 = CachedKafkaProducer.invokePrivate(cacheMap())
+    assert(map2.size == 1)


We just know there is one KP by this assert. Seems we should also verify if we close the correct KP?

viirya · 2017-05-26T14:56:03Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

+
+  def paramsToSeq(kafkaParams: ju.Map[String, Object]): Seq[(String, Object)] = {
+    val paramsSeq: Seq[(String, Object)] =
+      kafkaParams.asScala.toSeq.sortBy(x => (x._1, x._2.toString))


nit: as it is a map, seems we can just sort by x._1?

viirya · 2017-05-26T14:56:45Z

LGTM and few minor comments.

zsxwing

I suggest using LoadingCache to simplify the codes. Otherwise, looks good.

zsxwing · 2017-05-26T22:13:19Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala

-      checkForErrors
-      producer.close()
-      checkForErrors
-      producer = null


nit: please keep producer = null for double-close

zsxwing · 2017-05-26T22:22:40Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

+   */
+  private[kafka010] def getOrCreate(kafkaParams: ju.Map[String, Object]): Producer = synchronized {
+    val paramsSeq: Seq[(String, Object)] = paramsToSeq(kafkaParams)
+    Option(guavaCache.getIfPresent(paramsSeq)).getOrElse(createKafkaProducer(kafkaParams))


Remove synchronized and also throw inner exception instead after changing to use LoadingCache, such as

private[kafka010] def getOrCreate(kafkaParams: ju.Map[String, Object]): Producer = { val paramsSeq: Seq[(String, Object)] = paramsToSeq(kafkaParams) try { guavaCache.get(paramsSeq) } catch { case e @ (_: ExecutionException | _: UncheckedExecutionException | _: ExecutionError) if e.getCause != null => throw e.getCause } }

zsxwing · 2017-05-26T22:28:02Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

+  private lazy val guavaCache: Cache[Seq[(String, Object)], Producer] = CacheBuilder.newBuilder()
+    .expireAfterAccess(cacheExpireTimeout, TimeUnit.MILLISECONDS)
+    .removalListener(removalListener)
+    .build[Seq[(String, Object)], Producer]()


nit: Use build(CacheLoader<? super K1, V1> loader) to use LoadingCache, then getOrCreate will be very simple.

ScrapCodes · 2017-05-29T08:06:11Z

Thanks @viirya and @zsxwing. I have tried to address you comments.

SparkQA · 2017-05-29T08:21:30Z

Test build #77499 has finished for PR 17308 at commit a10276a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-05-30T01:11:31Z

LGTM. Merging to master and 2.2. Thanks!

… creating one every batch. ## What changes were proposed in this pull request? In summary, cost of recreating a KafkaProducer for writing every batch is high as it starts a lot threads and make connections and then closes them. A KafkaProducer instance is promised to be thread safe in Kafka docs. Reuse of KafkaProducer instance while writing via multiple threads is encouraged. Furthermore, I have performance improvement of 10x in latency, with this patch. ### These are times that addBatch took in ms. Without applying this patch ![with-out_patch](https://cloud.githubusercontent.com/assets/992952/23994612/a9de4a42-0a6b-11e7-9d5b-7ae18775bee4.png) ### These are times that addBatch took in ms. After applying this patch ![with_patch](https://cloud.githubusercontent.com/assets/992952/23994616/ad8c11ec-0a6b-11e7-8634-2266ebb5033f.png) ## How was this patch tested? Running distributed benchmarks comparing runs with this patch and without it. Added relevant unit tests. Author: Prashant Sharma <[email protected]> Closes #17308 from ScrapCodes/cached-kafka-producer. (cherry picked from commit 96a4d1d) Signed-off-by: Shixiong Zhu <[email protected]>

ScrapCodes commented Mar 16, 2017

View reviewed changes

BenFradet reviewed Apr 3, 2017

View reviewed changes

marmbrus suggested changes May 4, 2017

View reviewed changes

brkyvz reviewed May 4, 2017

View reviewed changes

ScrapCodes force-pushed the cached-kafka-producer branch 2 times, most recently from 932d563 to 2a15afe Compare May 12, 2017 12:42

[SPARK-19968][SS] Use a cached instance of KafkaProducer instead of c…

8224596

…reating one every batch.

ScrapCodes force-pushed the cached-kafka-producer branch from ea9592a to 8224596 Compare May 13, 2017 05:37

Self review and code style improvements.

e07e77e

[SPARK-20737] Mechanism for cleanup hooks, for structured-streaming s…

d6e4088

…inks on executor shutdown. Add a standard way of cleanup during shutdown of executors for structured streaming sinks in general and KafkaSink in particular.

ScrapCodes changed the title ~~[SPARK-19968][SS] Use a cached instance of KafkaProducer instead of creating one every batch.~~ [SPARK-19968][SPARK-20737][SS] Use a cached instance of KafkaProducer instead of creating one every batch. May 17, 2017

ScrapCodes commented May 17, 2017

View reviewed changes

Synchronized access to local hash map.

3ec9981

viirya reviewed May 24, 2017

View reviewed changes

brkyvz reviewed May 24, 2017

View reviewed changes

Using spark conf instead of System.getProperties.

1c9f892

zsxwing reviewed May 24, 2017

View reviewed changes

marmbrus reviewed May 24, 2017

View reviewed changes

ScrapCodes changed the title ~~[SPARK-19968][SPARK-20737][SS] Use a cached instance of KafkaProducer instead of creating one every batch.~~ [SPARK-19968][SS] Use a cached instance of KafkaProducer instead of creating one every batch. May 25, 2017

Michael's feedback!

588fa03

viirya reviewed May 26, 2017

View reviewed changes

zsxwing requested changes May 26, 2017

View reviewed changes

Using Loading cache, and few style related fixes.

a10276a

asfgit closed this in 96a4d1d May 30, 2017

BenFradet mentioned this pull request May 30, 2017

Provide a way to close producers BenFradet/spark-kafka-writer#42

Closed

[SPARK-19968][SS] Use a cached instance of KafkaProducer instead of creating one every batch. #17308

[SPARK-19968][SS] Use a cached instance of KafkaProducer instead of creating one every batch. #17308

Uh oh!

Conversation

ScrapCodes commented Mar 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

These are times that addBatch took in ms. Without applying this patch

These are times that addBatch took in ms. After applying this patch

How was this patch tested?

Uh oh!

SparkQA commented Mar 16, 2017

Uh oh!

ScrapCodes Mar 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 16, 2017

Uh oh!

SparkQA commented Mar 16, 2017

Uh oh!

ScrapCodes commented Mar 17, 2017

Uh oh!

ScrapCodes commented Mar 22, 2017

Uh oh!

ScrapCodes commented Mar 30, 2017

Uh oh!

ScrapCodes commented Apr 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koeninger commented Apr 12, 2017

Uh oh!

koeninger commented May 4, 2017

Uh oh!

brkyvz commented May 4, 2017

Uh oh!

marmbrus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScrapCodes commented May 8, 2017

Uh oh!

SparkQA commented May 12, 2017

Uh oh!

SparkQA commented May 12, 2017

Uh oh!

SparkQA commented May 12, 2017

Uh oh!

SparkQA commented May 13, 2017

Uh oh!

ScrapCodes commented May 15, 2017

Uh oh!

SparkQA commented May 15, 2017

Uh oh!

ScrapCodes May 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

[SPARK-19968][SS] Use a cached instance of `KafkaProducer` instead of creating one every batch. #17308

[SPARK-19968][SS] Use a cached instance of `KafkaProducer` instead of creating one every batch. #17308

ScrapCodes commented Mar 16, 2017 •

edited

Loading

ScrapCodes Mar 16, 2017 •

edited

Loading

ScrapCodes May 17, 2017 •

edited

Loading

ScrapCodes May 22, 2017 •

edited

Loading