Skip to content

Conversation

@ScrapCodes
Copy link
Member

@ScrapCodes ScrapCodes commented Mar 16, 2017

What changes were proposed in this pull request?

In summary, cost of recreating a KafkaProducer for writing every batch is high as it starts a lot threads and make connections and then closes them. A KafkaProducer instance is promised to be thread safe in Kafka docs. Reuse of KafkaProducer instance while writing via multiple threads is encouraged.

Furthermore, I have performance improvement of 10x in latency, with this patch.

These are times that addBatch took in ms. Without applying this patch

with-out_patch

These are times that addBatch took in ms. After applying this patch

with_patch

How was this patch tested?

Running distributed benchmarks comparing runs with this patch and without it.
Added relevant unit tests.

@SparkQA
Copy link

SparkQA commented Mar 16, 2017

Test build #74644 has started for PR 17308 at commit febf387.

Copy link
Member Author

@ScrapCodes ScrapCodes Mar 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should not have been changed. And any implementation of java.util.AbstractMap, has the correct working for hashCode().

Here, they are changed to HashMap, to avoid converting or casting them later. It can actually be just java.util.Map, but then we can not guarantee the outcome of hashCode().

@SparkQA
Copy link

SparkQA commented Mar 16, 2017

Test build #74646 has finished for PR 17308 at commit 830d76d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2017

Test build #74658 has finished for PR 17308 at commit 0b5e2e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CachedKafkaProducerSuite extends SharedSQLContext with PrivateMethodTester

@ScrapCodes
Copy link
Member Author

Please take a look, @tcondie @zsxwing !

@ScrapCodes
Copy link
Member Author

@tcondie and @zsxwing any comments on this patch. I would be happy, if this bug is fixed before 2.2 is released.

@ScrapCodes
Copy link
Member Author

@tdas ping !

@ScrapCodes
Copy link
Member Author

I can further confirm, that in logs, a kafkaproducer instance is created almost every instant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if this a good idea to key the producer map by the hash code of a map for which the values are Objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a good idea to do like that.

I had like my understanding to be corrected, as much as I understood. Since in this particular case Spark does not let user specify a key or value serializer/deserializer. So Object can be either a String, int or Long and for these hashcode would work correctly. I am also contemplating a better way to do it, now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, my bad I thought KafkaSink was a public API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think this is a good key for the hashmap. There could be collisions. We should either assign a unique ID to the sink and thread that through, or come up with some way to canoncicalize the set of parameters that create the sink. The latter might better since you could maybe reuse the same producer for more than one query.

@koeninger
Copy link
Contributor

Just to throw in my two cents, a change like this is definitely needed, as is made clear by the second sentence of the docs

http://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html

"The producer is thread safe and sharing a single producer instance across threads will generally be faster than having multiple instances."

@koeninger
Copy link
Contributor

@marmbrus @zsxwing @tdas This needs attention from someone

@brkyvz
Copy link
Contributor

brkyvz commented May 4, 2017

Taking a look

Copy link
Contributor

@marmbrus marmbrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is an important optimization we need to do. I have some concerns about the life cycle of the producer as its implemented here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think this is a good key for the hashmap. There could be collisions. We should either assign a unique ID to the sink and thread that through, or come up with some way to canoncicalize the set of parameters that create the sink. The latter might better since you could maybe reuse the same producer for more than one query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only closing the producer on the driver, right? Do we even create on there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, I have understood, close requires a bit of rethinking, I am unable to see a straight forward way of doing it. If you agree, close related implementation can be taken out from this PR and be taken up in a separate JIRA and PR ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Typically we wrap the arguments rather than the return type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uber nit: maybe we can define a type alias

type Producer = KafkaProducer[Array[Byte], Array[Byte]]

so that we don't have to write that whole thing over and over

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put this in the create function instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need the s and $ right?

@ScrapCodes
Copy link
Member Author

Hi @marmbrus and @brkyvz, Thanks a lot of taking a look.

@marmbrus You are right, we should have another way to canonicalize kafka params. I can only think of appending a unique id to kafka params and somehow ensuring a particular set of params get the same uid everytime.

@ScrapCodes ScrapCodes force-pushed the cached-kafka-producer branch 2 times, most recently from 932d563 to 2a15afe Compare May 12, 2017 12:42
@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76868 has finished for PR 17308 at commit 932d563.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76869 has finished for PR 17308 at commit 2a15afe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76872 has finished for PR 17308 at commit ea9592a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes ScrapCodes force-pushed the cached-kafka-producer branch from ea9592a to 8224596 Compare May 13, 2017 05:37
@SparkQA
Copy link

SparkQA commented May 13, 2017

Test build #76890 has finished for PR 17308 at commit 8224596.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes
Copy link
Member Author

SPARK-20737 is created to look into cleanup mechanism in a separate JIRA.

@SparkQA
Copy link

SparkQA commented May 15, 2017

Test build #76932 has finished for PR 17308 at commit e07e77e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…inks on executor shutdown.

Add a standard way of cleanup during shutdown of executors for structured streaming sinks in general and KafkaSink in particular.
@ScrapCodes ScrapCodes changed the title [SPARK-19968][SS] Use a cached instance of KafkaProducer instead of creating one every batch. [SPARK-19968][SPARK-20737][SS] Use a cached instance of KafkaProducer instead of creating one every batch. May 17, 2017
def close(sc: SparkContext, kafkaParams: ju.Map[String, Object]): Unit = {
sc.parallelize(1 to 10000).foreachPartition { iter =>
CachedKafkaProducer.close(kafkaParams)
}
Copy link
Member Author

@ScrapCodes ScrapCodes May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would cause CachedKafkaProducer.close to be executed on each executor. I am thinking of a better way here.
Any help would be appreciated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the KafkaSource also faces the same issue of not being able to close consumers. Can we use a guava cache with a (configurable) timeout? I guess that's the safest way to make sure that they'll eventually get closed.

Copy link
Member Author

@ScrapCodes ScrapCodes May 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using guave cache, we can close if not used for a certain time. Shall we ignore closing them during a shutdown ?
In the particular case of kafka producer, I do not see a direct problem with that. Since we do a producer.flush() on each batch. I was just wondering, with streaming sinks in general - what should be our strategy ?

@SparkQA
Copy link

SparkQA commented May 24, 2017

Test build #77293 has finished for PR 17308 at commit 039d063.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

private val guavaCache: Cache[String, Producer] = CacheBuilder.newBuilder()
.recordStats()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use the stats?

import java.{util => ju}

import org.apache.kafka.common.serialization.ByteArraySerializer
import org.scalatest.PrivateMethodTester
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use this import?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, oversight. Thanks !


private val cacheExpireTimeout: Long =
System.getProperty("spark.kafka.guava.cache.timeout", "10").toLong
System.getProperty("spark.kafka.guava.cache.timeout.minutes", "10").toLong
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to get this from SparkEnv by the way? I don't know if the properties get populated properly.
Also, adding minutes to the conf makes it kinda long right? I think we can also replace guava with producer.
I think it may also be better to use this so that we get rid of minutes and users can actually provide arbitrary durations (hours if they want). I think that's what we generally use for duration type confs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, you are right !

@SparkQA
Copy link

SparkQA commented May 24, 2017

Test build #77297 has finished for PR 17308 at commit 1c9f892.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor

brkyvz commented May 24, 2017

LGTM! @marmbrus @viirya do you have any more feedback?

.build[String, Producer]()

ShutdownHookManager.addShutdownHook { () =>
clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to stop producers in a shutdown hook? I'm asking because stopping a producer is a blocking call and may prevent other shutdown hooks to run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this seems complicated. What exactly does shutdown do? Is it just cleaning up thread pools?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will close connections as well. That's really not necessary since the process is being shut down.

Copy link
Contributor

@marmbrus marmbrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Its a huge latency improvement and I'll be using it next week at Spark Summit.

I just have one suggestion about the cache implementation.

}

private def createKafkaProducer(
producerConfiguration: ju.Map[String, Object]): Producer = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent 4 here

* not exist. This is done to ensure, we have only one set of kafka parameters associated with a
* unique ID.
*/
private[kafka010] object CanonicalizeKafkaParams extends Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems kind of complicated also. Since we know these are always coming from Data[Stream/Frame]Writer and that will always give you Map[String, String] (and we expect the number of options here to be small). Could we just make the key for the cache a sorted Seq[(String, String)] rather than invent another GUID?

@ScrapCodes ScrapCodes changed the title [SPARK-19968][SPARK-20737][SS] Use a cached instance of KafkaProducer instead of creating one every batch. [SPARK-19968][SS] Use a cached instance of KafkaProducer instead of creating one every batch. May 25, 2017
@SparkQA
Copy link

SparkQA commented May 25, 2017

Test build #77357 has finished for PR 17308 at commit 588fa03.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes
Copy link
Member Author

Jenkins, retest this please !

@SparkQA
Copy link

SparkQA commented May 25, 2017

Test build #77358 has finished for PR 17308 at commit 588fa03.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes
Copy link
Member Author

Build is failing due to "Our attempt to download sbt locally to build/sbt-launch-0.13.13.jar failed. Please install sbt manually from http://www.scala-sbt.org/"

@ScrapCodes
Copy link
Member Author

Jenkins, retest this please !

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77410 has finished for PR 17308 at commit 588fa03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes
Copy link
Member Author

@marmbrus Thank you for taking a look again. Surely, shut down hook is not ideal for closing kafka producers. In fact, for the case of kafka sink, it might be correct to skip cleanup step. I have tried to address your comments.

Option(guavaCache.getIfPresent(paramsSeq)).getOrElse(createKafkaProducer(kafkaParams))
}

def paramsToSeq(kafkaParams: ju.Map[String, Object]): Seq[(String, Object)] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems can be private?


CachedKafkaProducer.close(kafkaParams)
val map2 = CachedKafkaProducer.invokePrivate(cacheMap())
assert(map2.size == 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just know there is one KP by this assert. Seems we should also verify if we close the correct KP?


def paramsToSeq(kafkaParams: ju.Map[String, Object]): Seq[(String, Object)] = {
val paramsSeq: Seq[(String, Object)] =
kafkaParams.asScala.toSeq.sortBy(x => (x._1, x._2.toString))
Copy link
Member

@viirya viirya May 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as it is a map, seems we can just sort by x._1?

@viirya
Copy link
Member

viirya commented May 26, 2017

LGTM and few minor comments.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using LoadingCache to simplify the codes. Otherwise, looks good.

checkForErrors
producer.close()
checkForErrors
producer = null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please keep producer = null for double-close

*/
private[kafka010] def getOrCreate(kafkaParams: ju.Map[String, Object]): Producer = synchronized {
val paramsSeq: Seq[(String, Object)] = paramsToSeq(kafkaParams)
Option(guavaCache.getIfPresent(paramsSeq)).getOrElse(createKafkaProducer(kafkaParams))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove synchronized and also throw inner exception instead after changing to use LoadingCache, such as

  private[kafka010] def getOrCreate(kafkaParams: ju.Map[String, Object]): Producer = {
    val paramsSeq: Seq[(String, Object)] = paramsToSeq(kafkaParams)
    try {
      guavaCache.get(paramsSeq)
    } catch {
      case e @ (_: ExecutionException | _: UncheckedExecutionException | _: ExecutionError)
        if e.getCause != null =>
          throw e.getCause
    }
  }

private lazy val guavaCache: Cache[Seq[(String, Object)], Producer] = CacheBuilder.newBuilder()
.expireAfterAccess(cacheExpireTimeout, TimeUnit.MILLISECONDS)
.removalListener(removalListener)
.build[Seq[(String, Object)], Producer]()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Use build(CacheLoader<? super K1, V1> loader) to use LoadingCache, then getOrCreate will be very simple.

@ScrapCodes
Copy link
Member Author

Thanks @viirya and @zsxwing. I have tried to address you comments.

@SparkQA
Copy link

SparkQA commented May 29, 2017

Test build #77499 has finished for PR 17308 at commit a10276a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented May 30, 2017

LGTM. Merging to master and 2.2. Thanks!

asfgit pushed a commit that referenced this pull request May 30, 2017
… creating one every batch.

## What changes were proposed in this pull request?

In summary, cost of recreating a KafkaProducer for writing every batch is high as it starts a lot threads and make connections and then closes them. A KafkaProducer instance is promised to be thread safe in Kafka docs. Reuse of KafkaProducer instance while writing via multiple threads is encouraged.

Furthermore, I have performance improvement of 10x in latency, with this patch.

### These are times that addBatch took in ms. Without applying this patch
![with-out_patch](https://cloud.githubusercontent.com/assets/992952/23994612/a9de4a42-0a6b-11e7-9d5b-7ae18775bee4.png)
### These are times that addBatch took in ms. After applying this patch
![with_patch](https://cloud.githubusercontent.com/assets/992952/23994616/ad8c11ec-0a6b-11e7-8634-2266ebb5033f.png)

## How was this patch tested?
Running distributed benchmarks comparing runs with this patch and without it.
Added relevant unit tests.

Author: Prashant Sharma <[email protected]>

Closes #17308 from ScrapCodes/cached-kafka-producer.

(cherry picked from commit 96a4d1d)
Signed-off-by: Shixiong Zhu <[email protected]>
@asfgit asfgit closed this in 96a4d1d May 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants