Faster OneToOneConcurrentArrayQueue#360
Conversation
|
Some perf data for the old queue: And for the new queue It is obvious that the number of instructions retired is a lot higher and the FasterOneToOneConcurrentArrayQueueBenchmark is returning 4 instructions per second while the old queue was doing 0.47 instructions per cycle. So the old queue is mostly stalling, while the new queue is reaching peak efficiency. If we normalize the number of cache misses per 1000 cycle we get: The old queue is running into almost 20x more local cache misses than the new queue. This is caused by contention; a different core owning the cacheline. |
| throw new NullPointerException("Null is not a valid element"); | ||
| } | ||
|
|
||
| final long currentTail = UnsafeApi.getLongOpaque(this, TAIL_OFFSET); |
There was a problem hiding this comment.
The only point of interaction with mutable parts of the queue is a slot in the buffer.
For the exchange of the item in the queue, there is a proper release store performed by the producer, and and acquire load performed by the consumer.
For the exchange of the null slot, there is just an opaque store and an opaque load because the surrounding loads/stores do not need to be ordered.
|
Can't belive... |
|
The miss on the head (contention) is also visible when running with JMH '-prof perfasm' and looking at the generated assembly: The relevant section So first the load is executed and then incremented and the increment needs to wait for the load of the head to complete. |
|
In Akka/Pekko, there is a SPSC queue too. I think we can leverage this optimization too, the current one I think is quite fit for Akka/Pekko stream, where the buffer size is always sepcified sized. |
|
Places where the OneToOneConcurrentArrayQueue is used:
|
aae955e to
1c5a4f4
Compare
|
I have added some JCStress tests to ensure that the queue implementation is correct. More tests will follow. |
1c5a4f4 to
e1a6fb8
Compare
The key change is dropping the head cache. The only thing the producer needs to know is if the slot it needs to publish to, is empty. It never needs to know about the current head and therefor therefor there will never be any contention on the head.
As long as the producer and consumer are active on different parts of the queue, there won't be any contention with the new queue. With the old queue, there will always be contention when the producer depletes the head cache and needs to get the next batch.
Apart from dropping the head cache, also minimized memory ordering.
Note: I didn't change the code of the original queue to make it easier to benchmark the different implementations.
Initial JMH results (throughput mode)
I'll post more performance data tomorrow including perf results.
The performance of the old queue is very unreliable. In the warmup it start with high throughput and then during the warmup iterations, it starts to collapse. The new queue doesn't suffer from this problem.
Although the OneToOneConcurrentArrayQueue isn't on the hot path in Agrona, it is part of a larger effort to minimize memory ordering/contention to optimize Agrona/Aeron to run faster on architectures with a weaker memory order like ARM.