-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
@slaunay
with my local laptop(mac OS 10.15.7), I think I produced the hang problem again :). I suppose that your fix do work and cover many scenarios which make hang problem's occurring more and more harder
Thanks for providing those details.
I am not sure if this is linked to the current issue (deadlock regression) but something is not right indeed and we should probably create another issue for that particular scenario.
Here are some things I found:
- the
brokerProduceris blocked sending a success:
3 @ 0x1038d16 0x1006925 0x10064dd 0x1301574 0x130038a 0x1326142 0x12ff72f 0x12ff65f 0x12fe85f 0x132e9de 0x10688a1
# 0x1301573 github.com/Shopify/sarama.(*asyncProducer).returnSuccesses+0xb3 /Users/tiger/go/pkg/mod/github.com/slaunay/[email protected]/async_producer.go:1155
# 0x1300389 github.com/Shopify/sarama.(*brokerProducer).handleSuccess.func1+0x469 /Users/tiger/go/pkg/mod/github.com/slaunay/[email protected]/async_producer.go:974
# 0x1326141 github.com/Shopify/sarama.(*produceSet).eachPartition+0x101 /Users/tiger/go/pkg/mod/github.com/slaunay/[email protected]/produce_set.go:211
# 0x12ff72e github.com/Shopify/sarama.(*brokerProducer).handleSuccess+0x8e /Users/tiger/go/pkg/mod/github.com/slaunay/[email protected]/async_producer.go:950
# 0x12ff65e github.com/Shopify/sarama.(*brokerProducer).handleResponse+0x3e /Users/tiger/go/pkg/mod/github.com/slaunay/[email protected]/async_producer.go:938
# 0x12fe85e github.com/Shopify/sarama.(*brokerProducer).run+0x1be /Users/tiger/go/pkg/mod/github.com/slaunay/[email protected]/async_producer.go:872
# 0x132e9dd github.com/Shopify/sarama.withRecover+0x3d /Users/tiger/go/pkg/mod/github.com/slaunay/[email protected]/utils.go:43
- the
syncProducersuccessesgoroutine is blocked forwarding a success (and therefore blocking thebrokerProducer):
1 @ 0x1038d16 0x10065f6 0x10064dd 0x132da0d 0x132e9de 0x10688a1
# 0x132da0c github.com/Shopify/sarama.(*syncProducer).handleSuccesses+0x8c /Users/tiger/go/pkg/mod/github.com/slaunay/[email protected]/sync_producer.go:132
# 0x132e9dd github.com/Shopify/sarama.withRecover+0x3d /Users/tiger/go/pkg/mod/github.com/slaunay/[email protected]/utils.go:43
- one producer goroutine is blocked waiting for the current success/error:
8 @ 0x1038d16 0x10077cc 0x10071f8 0x132d910 0x1334d50 0x1334a0c 0x10688a1
# 0x132d90f github.com/Shopify/sarama.(*syncProducer).SendMessage+0x8f /Users/tiger/go/pkg/mod/github.com/slaunay/[email protected]/sync_producer.go:96
# 0x1334d4f main.(*SyncProducer).SendMessage+0x22f /Users/tiger/Downloads/working/kafka-cluster/producer/producer.go:41
# 0x1334a0b main.main.func3+0x1cb /Users/tiger/Downloads/working/kafka-cluster/producer/main.go:59
Now what is really interesting is that the expectation field on a ProducerMessage used by the syncProducer is a channel that is always buffered with a capacity of 1.
So it should never block the the syncProducer successes goroutine in theory but it seems to be the case.
The null key null value record you see in the topic makes me think that:
- a mostly "empty"
ProducerMessageends up being sent to the remote broker - the same message with a
nilexpectationfield ends up being sent as a success tosyncProducersuccessesgoroutine - sending to a
nilchannel blocks forever therefore blocking thebrokerProducerand preventing more records to be produced
Now such "empty" ProducerMessage used by the AsyncProducer can be:
- a
shutdownmessage but those do not traverse thedispatcher:
https://github.com/Shopify/sarama/blob/f1bc44e541eecf45f935b97db6a457740aaa073e/async_producer.go#L337-L340 - a
synmessage but should not end up in aproduceSet:
https://github.com/Shopify/sarama/blob/f1bc44e541eecf45f935b97db6a457740aaa073e/async_producer.go#L818-L827 - a
finmessage and those could end up in aproduceSetsomehow I suppose:
https://github.com/Shopify/sarama/blob/f1bc44e541eecf45f935b97db6a457740aaa073e/async_producer.go#L829-L861
As fin messages are used during retries, it might the root cause of the hanging if somehow they escape the AsyncProducer and ends up on the broker and in a success channel.
It would be great to confirm this is the case and ideally have a simple unit test for that scenario.
If you reduce connections.max.idle.ms and Net.ReadTimeout you might be able to reproduce it faster.
@hxiaodon Would you mind creating another issue?
Originally posted by @slaunay in #2133 (comment)