Faster OneToOneConcurrentArrayQueue by pveentjer · Pull Request #360 · aeron-io/agrona

pveentjer · 2026-04-19T17:20:46Z

The key change is dropping the head cache. The only thing the producer needs to know is if the slot it needs to publish to, is empty. It never needs to know about the current head and therefor therefor there will never be any contention on the head.

As long as the producer and consumer are active on different parts of the queue, there won't be any contention with the new queue. With the old queue, there will always be contention when the producer depletes the head cache and needs to get the next batch.

Apart from dropping the head cache, also minimized memory ordering.

Note: I didn't change the code of the original queue to make it easier to benchmark the different implementations.

Initial JMH results (throughput mode)


Benchmark                                         (impl)   Mode  Cnt    Score    Error   Units
OneToOneConcurrentArrayQueueBenchmark.spsc           old  thrpt   30   89.645 ±  7.133  ops/us
OneToOneConcurrentArrayQueueBenchmark.spsc:offer     old  thrpt   30   44.855 ±  3.566  ops/us
OneToOneConcurrentArrayQueueBenchmark.spsc:poll      old  thrpt   30   44.790 ±  3.567  ops/us
OneToOneConcurrentArrayQueueBenchmark.spsc           new  thrpt   30  813.109 ± 10.895  ops/us
OneToOneConcurrentArrayQueueBenchmark.spsc:offer     new  thrpt   30  406.586 ±  5.447  ops/us
OneToOneConcurrentArrayQueueBenchmark.spsc:poll      new  thrpt   30  406.523 ±  5.447  ops/us

I'll post more performance data tomorrow including perf results.

The performance of the old queue is very unreliable. In the warmup it start with high throughput and then during the warmup iterations, it starts to collapse. The new queue doesn't suffer from this problem.

Although the OneToOneConcurrentArrayQueue isn't on the hot path in Agrona, it is part of a larger effort to minimize memory ordering/contention to optimize Agrona/Aeron to run faster on architectures with a weaker memory order like ARM.

pveentjer · 2026-04-20T03:17:38Z

Some perf data for the old queue:

sudo perf stat -p $(pgrep -f ForkedMain) -e \
cycles,instructions,\
ls_any_fills_from_sys.int_cache,\
l2_cache_misses_from_dc_misses \
-- sleep 30

 Performance counter stats for process id '385978':

   263,664,341,114      cycles                                                                
   124,625,368,086      instructions                     #    0.47  insn per cycle            
     1,970,688,756      ls_any_fills_from_sys.int_cache                                       
     1,951,070,317      l2_cache_misses_from_dc_misses                                        

      30.001810798 seconds time elapsed

And for the new queue

sudo perf stat -p $(pgrep -f ForkedMain) -e cycles,instructions,ls_any_fills_from_sys.int_cache,l2_cache_misses_from_dc_misses -- sleep 30

 Performance counter stats for process id '386441':

   235,237,650,857      cycles                                                                
   945,180,780,358      instructions                     #    4.02  insn per cycle            
       757,590,912      ls_any_fills_from_sys.int_cache                                       
       754,058,042      l2_cache_misses_from_dc_misses                                        

      30.002165093 seconds time elapsed

It is obvious that the number of instructions retired is a lot higher and the FasterOneToOneConcurrentArrayQueueBenchmark is returning 4 instructions per second while the old queue was doing 0.47 instructions per cycle. So the old queue is mostly stalling, while the new queue is reaching peak efficiency.

If we normalize the number of cache misses per 1000 cycle we get:

                            old queue        new queue
int_cache                        15.8             0.80
l2_misses_from_dc                15.7             0.80

The old queue is running into almost 20x more local cache misses than the new queue. This is caused by contention; a different core owning the cacheline.

pveentjer · 2026-04-20T03:26:44Z

+            throw new NullPointerException("Null is not a valid element");
+        }
+
+        final long currentTail = UnsafeApi.getLongOpaque(this, TAIL_OFFSET);


The only point of interaction with mutable parts of the queue is a slot in the buffer.

For the exchange of the item in the queue, there is a proper release store performed by the producer, and and acquire load performed by the consumer.

For the exchange of the null slot, there is just an opaque store and an opaque load because the surrounding loads/stores do not need to be ordered.

He-Pin · 2026-04-20T03:29:49Z

Can't belive...

pveentjer · 2026-04-20T03:36:57Z

The miss on the head (contention) is also visible when running with JMH '-prof perfasm' and looking at the generated assembly:

mov    QWORD PTR [rbx+0x58], rcx         ;*invokevirtual putReferenceRelease
 0.75%   mov    ecx, DWORD PTR [rbx+0xf0]
 0.56%   dec    r8d
 0.11%   lea    r9, [r12+rcx*8]                   ;*getfield buffer
 0.09%   movsxd r8, r8d
 2.18%   and    r8, rbp
 0.04%   lea    r13, [r9+r8*4+0x10]
 0.37%   cmp    BYTE PTR [r15+0x38], 0x0
 0.10%   jne    <slow>
 0.03%   mov    DWORD PTR [r13+0x0], 0xfff8f859   ;   {oop(Integer 42)}
 4.00%   add    rbp, 0x1
 0.04%   movabs r8, 0x7ffc7c2c8
 0.98%   mov    r9, r13
         xor    r8, r9
 0.09%   shr    r8, 0x16
         test   r8, r8
 0.05%   je     <skip card>
         shr    r9, 0x9
         movabs rdi, 0x74bbed600000
 1.24%   add    rdi, r9
         cmp    BYTE PTR [rdi], 0x4
 5.40%   jne    <card mark slow>
         mov    QWORD PTR [rbx+0x50], rbp         ;*ifne
 0.18%   movzx  r9d, BYTE PTR [rdx+0x94]          ;*invokevirtual putReferenceRelease
 0.01%   mov    r8, QWORD PTR [r15+0x348]
 0.03%   add    r14, 0x1
         test   DWORD PTR [r8], eax               ;   {poll}
 0.31%   test   r9d, r9d
 0.02%   jne    <return>
 2.06%   movzx  r8d, BYTE PTR [r11+0x91]          ;*getfield stopMeasurement
 0.01%   test   r8d, r8d
         jne    <exit>
 0.07%   mov    r9d, DWORD PTR [r10+0x10]         ;*getfield queue
 0.01%   mov    r8d, DWORD PTR [r12+r9*8+0x8]
 0.08%   cmp    r8d, 0xc271d0                     ;   {OneToOneConcurrentArrayQueue}
 0.23%   jne    <type miss>
         lea    rbx, [r12+r9*8]                   ;*invokeinterface offer
 0.20%   mov    rbp, QWORD PTR [rbx+0x50]         ;*getfield tail
 0.25%   mov    r8d, DWORD PTR [rbx+0xec]         ;*getfield capacity
 2.13%   movsxd r9, r8d
 0.02%   mov    rcx, r9
 0.11%   add    rcx, QWORD PTR [rbx+0x58]         ;*getfield tail (headCache+capacity)
 0.62%   cmp    rbp, rcx
 0.78%   jl     <fast: write slot>
 0.01%   mov    rcx, QWORD PTR [rbx+0xa8]         ;*getfield head
14.18%   add    r9, rcx
 0.16%   cmp    rbp, r9
 0.15%   jl     <write slot>
         pause                                    ;*invokevirtual putReferenceRelease
 2.07%   mov    r8, QWORD PTR [r15+0x348]
 8.24%   test   DWORD PTR [r8], eax               ;   {poll}
 0.90%   movzx  r8d, BYTE PTR [r11+0x91]          ;*getfield stopMeasurement
 0.04%   test   r8d, r8d
         jne    <exit>
         jmp    <retry>

The relevant section

 0.01%   mov    rcx, QWORD PTR [rbx+0xa8]         ;*getfield head
14.18%   add    r9, rcx

So first the load is executed and then incremented and the increment needs to wait for the load of the head to complete.

He-Pin · 2026-04-20T03:40:53Z

In Akka/Pekko, there is a SPSC queue too. I think we can leverage this optimization too, the current one I think is quite fit for Akka/Pekko stream, where the buffer size is always sepcified sized.

pveentjer · 2026-04-28T05:16:14Z

Places where the OneToOneConcurrentArrayQueue is used:

Aeron MediaDriver receiverCommandQueue, senderCommandQueue, asyncTaskQueue
Sequencer ReplayService commandQueue

pveentjer · 2026-05-02T03:02:03Z

I have added some JCStress tests to ensure that the queue implementation is correct. More tests will follow.

Faster OneToOneConcurrentArrayQueue

524b0c2

pveentjer commented Apr 20, 2026

View reviewed changes

pveentjer force-pushed the performance/FasterOneToOneConcurrentArrayQueue branch from aae955e to 1c5a4f4 Compare May 2, 2026 02:59

Added some jcstress tests

e1a6fb8

pveentjer force-pushed the performance/FasterOneToOneConcurrentArrayQueue branch from 1c5a4f4 to e1a6fb8 Compare May 2, 2026 03:26

Merged the faster queue into the main implementation

e83b25c

pveentjer changed the title ~~[WIP] Faster OneToOneConcurrentArrayQueue~~ Faster OneToOneConcurrentArrayQueue May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster OneToOneConcurrentArrayQueue#360

Faster OneToOneConcurrentArrayQueue#360
pveentjer wants to merge 3 commits into
aeron-io:masterfrom
pveentjer:performance/FasterOneToOneConcurrentArrayQueue

pveentjer commented Apr 19, 2026 •

edited

Loading

Uh oh!

pveentjer commented Apr 20, 2026 •

edited

Loading

Uh oh!

pveentjer Apr 20, 2026

Uh oh!

He-Pin commented Apr 20, 2026

Uh oh!

pveentjer commented Apr 20, 2026 •

edited

Loading

Uh oh!

He-Pin commented Apr 20, 2026

Uh oh!

pveentjer commented Apr 28, 2026

Uh oh!

pveentjer commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pveentjer commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pveentjer commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pveentjer Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

He-Pin commented Apr 20, 2026

Uh oh!

pveentjer commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

He-Pin commented Apr 20, 2026

Uh oh!

pveentjer commented Apr 28, 2026

Uh oh!

pveentjer commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pveentjer commented Apr 19, 2026 •

edited

Loading

pveentjer commented Apr 20, 2026 •

edited

Loading

pveentjer commented Apr 20, 2026 •

edited

Loading