Skip to content

Faster OneToOneConcurrentArrayQueue#360

Open
pveentjer wants to merge 3 commits into
aeron-io:masterfrom
pveentjer:performance/FasterOneToOneConcurrentArrayQueue
Open

Faster OneToOneConcurrentArrayQueue#360
pveentjer wants to merge 3 commits into
aeron-io:masterfrom
pveentjer:performance/FasterOneToOneConcurrentArrayQueue

Conversation

@pveentjer
Copy link
Copy Markdown
Contributor

@pveentjer pveentjer commented Apr 19, 2026

The key change is dropping the head cache. The only thing the producer needs to know is if the slot it needs to publish to, is empty. It never needs to know about the current head and therefor therefor there will never be any contention on the head.

As long as the producer and consumer are active on different parts of the queue, there won't be any contention with the new queue. With the old queue, there will always be contention when the producer depletes the head cache and needs to get the next batch.

Apart from dropping the head cache, also minimized memory ordering.

Note: I didn't change the code of the original queue to make it easier to benchmark the different implementations.

Initial JMH results (throughput mode)


Benchmark                                         (impl)   Mode  Cnt    Score    Error   Units
OneToOneConcurrentArrayQueueBenchmark.spsc           old  thrpt   30   89.645 ±  7.133  ops/us
OneToOneConcurrentArrayQueueBenchmark.spsc:offer     old  thrpt   30   44.855 ±  3.566  ops/us
OneToOneConcurrentArrayQueueBenchmark.spsc:poll      old  thrpt   30   44.790 ±  3.567  ops/us
OneToOneConcurrentArrayQueueBenchmark.spsc           new  thrpt   30  813.109 ± 10.895  ops/us
OneToOneConcurrentArrayQueueBenchmark.spsc:offer     new  thrpt   30  406.586 ±  5.447  ops/us
OneToOneConcurrentArrayQueueBenchmark.spsc:poll      new  thrpt   30  406.523 ±  5.447  ops/us

I'll post more performance data tomorrow including perf results.

The performance of the old queue is very unreliable. In the warmup it start with high throughput and then during the warmup iterations, it starts to collapse. The new queue doesn't suffer from this problem.

Although the OneToOneConcurrentArrayQueue isn't on the hot path in Agrona, it is part of a larger effort to minimize memory ordering/contention to optimize Agrona/Aeron to run faster on architectures with a weaker memory order like ARM.

@pveentjer
Copy link
Copy Markdown
Contributor Author

pveentjer commented Apr 20, 2026

Some perf data for the old queue:

sudo perf stat -p $(pgrep -f ForkedMain) -e \
cycles,instructions,\
ls_any_fills_from_sys.int_cache,\
l2_cache_misses_from_dc_misses \
-- sleep 30

 Performance counter stats for process id '385978':

   263,664,341,114      cycles                                                                
   124,625,368,086      instructions                     #    0.47  insn per cycle            
     1,970,688,756      ls_any_fills_from_sys.int_cache                                       
     1,951,070,317      l2_cache_misses_from_dc_misses                                        

      30.001810798 seconds time elapsed

And for the new queue

sudo perf stat -p $(pgrep -f ForkedMain) -e cycles,instructions,ls_any_fills_from_sys.int_cache,l2_cache_misses_from_dc_misses -- sleep 30

 Performance counter stats for process id '386441':

   235,237,650,857      cycles                                                                
   945,180,780,358      instructions                     #    4.02  insn per cycle            
       757,590,912      ls_any_fills_from_sys.int_cache                                       
       754,058,042      l2_cache_misses_from_dc_misses                                        

      30.002165093 seconds time elapsed

It is obvious that the number of instructions retired is a lot higher and the FasterOneToOneConcurrentArrayQueueBenchmark is returning 4 instructions per second while the old queue was doing 0.47 instructions per cycle. So the old queue is mostly stalling, while the new queue is reaching peak efficiency.

If we normalize the number of cache misses per 1000 cycle we get:

                            old queue        new queue
int_cache                        15.8             0.80
l2_misses_from_dc                15.7             0.80

The old queue is running into almost 20x more local cache misses than the new queue. This is caused by contention; a different core owning the cacheline.

throw new NullPointerException("Null is not a valid element");
}

final long currentTail = UnsafeApi.getLongOpaque(this, TAIL_OFFSET);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only point of interaction with mutable parts of the queue is a slot in the buffer.

For the exchange of the item in the queue, there is a proper release store performed by the producer, and and acquire load performed by the consumer.

For the exchange of the null slot, there is just an opaque store and an opaque load because the surrounding loads/stores do not need to be ordered.

@He-Pin
Copy link
Copy Markdown

He-Pin commented Apr 20, 2026

Can't belive...

@pveentjer
Copy link
Copy Markdown
Contributor Author

pveentjer commented Apr 20, 2026

The miss on the head (contention) is also visible when running with JMH '-prof perfasm' and looking at the generated assembly:

mov    QWORD PTR [rbx+0x58], rcx         ;*invokevirtual putReferenceRelease
 0.75%   mov    ecx, DWORD PTR [rbx+0xf0]
 0.56%   dec    r8d
 0.11%   lea    r9, [r12+rcx*8]                   ;*getfield buffer
 0.09%   movsxd r8, r8d
 2.18%   and    r8, rbp
 0.04%   lea    r13, [r9+r8*4+0x10]
 0.37%   cmp    BYTE PTR [r15+0x38], 0x0
 0.10%   jne    <slow>
 0.03%   mov    DWORD PTR [r13+0x0], 0xfff8f859   ;   {oop(Integer 42)}
 4.00%   add    rbp, 0x1
 0.04%   movabs r8, 0x7ffc7c2c8
 0.98%   mov    r9, r13
         xor    r8, r9
 0.09%   shr    r8, 0x16
         test   r8, r8
 0.05%   je     <skip card>
         shr    r9, 0x9
         movabs rdi, 0x74bbed600000
 1.24%   add    rdi, r9
         cmp    BYTE PTR [rdi], 0x4
 5.40%   jne    <card mark slow>
         mov    QWORD PTR [rbx+0x50], rbp         ;*ifne
 0.18%   movzx  r9d, BYTE PTR [rdx+0x94]          ;*invokevirtual putReferenceRelease
 0.01%   mov    r8, QWORD PTR [r15+0x348]
 0.03%   add    r14, 0x1
         test   DWORD PTR [r8], eax               ;   {poll}
 0.31%   test   r9d, r9d
 0.02%   jne    <return>
 2.06%   movzx  r8d, BYTE PTR [r11+0x91]          ;*getfield stopMeasurement
 0.01%   test   r8d, r8d
         jne    <exit>
 0.07%   mov    r9d, DWORD PTR [r10+0x10]         ;*getfield queue
 0.01%   mov    r8d, DWORD PTR [r12+r9*8+0x8]
 0.08%   cmp    r8d, 0xc271d0                     ;   {OneToOneConcurrentArrayQueue}
 0.23%   jne    <type miss>
         lea    rbx, [r12+r9*8]                   ;*invokeinterface offer
 0.20%   mov    rbp, QWORD PTR [rbx+0x50]         ;*getfield tail
 0.25%   mov    r8d, DWORD PTR [rbx+0xec]         ;*getfield capacity
 2.13%   movsxd r9, r8d
 0.02%   mov    rcx, r9
 0.11%   add    rcx, QWORD PTR [rbx+0x58]         ;*getfield tail (headCache+capacity)
 0.62%   cmp    rbp, rcx
 0.78%   jl     <fast: write slot>
 0.01%   mov    rcx, QWORD PTR [rbx+0xa8]         ;*getfield head
14.18%   add    r9, rcx
 0.16%   cmp    rbp, r9
 0.15%   jl     <write slot>
         pause                                    ;*invokevirtual putReferenceRelease
 2.07%   mov    r8, QWORD PTR [r15+0x348]
 8.24%   test   DWORD PTR [r8], eax               ;   {poll}
 0.90%   movzx  r8d, BYTE PTR [r11+0x91]          ;*getfield stopMeasurement
 0.04%   test   r8d, r8d
         jne    <exit>
         jmp    <retry>

The relevant section

 0.01%   mov    rcx, QWORD PTR [rbx+0xa8]         ;*getfield head
14.18%   add    r9, rcx

So first the load is executed and then incremented and the increment needs to wait for the load of the head to complete.

@He-Pin
Copy link
Copy Markdown

He-Pin commented Apr 20, 2026

In Akka/Pekko, there is a SPSC queue too. I think we can leverage this optimization too, the current one I think is quite fit for Akka/Pekko stream, where the buffer size is always sepcified sized.

@pveentjer
Copy link
Copy Markdown
Contributor Author

Places where the OneToOneConcurrentArrayQueue is used:

  • Aeron MediaDriver receiverCommandQueue, senderCommandQueue, asyncTaskQueue
  • Sequencer ReplayService commandQueue

@pveentjer pveentjer force-pushed the performance/FasterOneToOneConcurrentArrayQueue branch from aae955e to 1c5a4f4 Compare May 2, 2026 02:59
@pveentjer
Copy link
Copy Markdown
Contributor Author

I have added some JCStress tests to ensure that the queue implementation is correct. More tests will follow.

@pveentjer pveentjer force-pushed the performance/FasterOneToOneConcurrentArrayQueue branch from 1c5a4f4 to e1a6fb8 Compare May 2, 2026 03:26
@pveentjer pveentjer changed the title [WIP] Faster OneToOneConcurrentArrayQueue Faster OneToOneConcurrentArrayQueue May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants