Skip to content

Conversation

@daljit46
Copy link
Member

@daljit46 daljit46 commented Oct 3, 2025

This PR fixes the longstanding performance issues in the viewer on macOS when visualising tractography streamlines (using pseudotubes).
The root cause turned out not to be geometry-shader emulation per se, but the cost of submitting many draws via glMultiDrawArrays. On Apple’s stack, glMultiDraw* does not seem to guarantee hardware batching, and it can result in significant CPU/driver overhead. Details below.

After working on #3095, I decided to investigate the performance issues raised in #2247. As discussed there, the culprit seemed to be the use of geometry shaders in our OpenGL code. This was a logical suspicion since the OpenGL drivers on macOS are built upon Metal, which doesn't support geometry shaders. I spent a significant amount of time tinkering with the code, but it was unreasonable to assume that emulation of geometry shaders (via compute shaders) could be so slow that the app would either completely freeze or even crash.

Investigating with the Xcode profiler, I noticed that the application essentially just hung when calling glMultiDrawArrays:

1.54 s 40.0% 400.00 µs GLDContextRec::flushContext(bool) 
1.47 s 38.3% 600.00 µs -[_MTLCommandQueue submitCommandBuffer:] 
1.47 s 38.3% 400.00 µs _dispatch_lane_barrier_sync_invoke_and_complete 
1.47 s 38.3% 300.00 µs _dispatch_client_callout 
1.47 s 38.3% 400.00 µs __40-[_MTLCommandQueue submitCommandBuffer:]_block_invoke 
1.47 s 38.3% 400.00 µs -[_MTLCommandQueue _submitAvailableCommandBuffers] 
1.46 s 38.0% 264.54 µs -[IOGPUMetalCommandQueue submitCommandBuffers:count:] 
1.46 s 38.0% 700.00 µs -[IOGPUMetalCommandQueue _submitCommandBuffers:count:] 
1.40 s 36.5% 700.00 µs IOGPUCommandQueueSubmitCommandBuffers 
1.40 s 36.5% 0 s IOConnectCallMethod 
1.40 s 36.5% 3.20 ms io_connect_method 
1.40 s 36.4% 300.00 µs mach_msg2_internal 
1.40 s 36.3% 
1.40 s mach_msg2_trap

which made me realise that most likely we are CPU-bound on command submission. The OpenGL on Metal implementation
seems to be submitting too many draw calls, which results in a hang.
Looking around, I found this great article which mentions:

Even functions like glMultiDrawArrays and glMultiDrawElements don’t always help because the graphics hardware may not implement these functions directly, and so OpenGL essentially has to convert them to multiple calls to glDrawArrays internally anyway.

glMultiDrawsArrays is the command that we use in tractogram.cpp to instruct the GPU to draw a chunk of streamlines. In theory, the command should be hardware-batched, but as the article mentioned, this is upon the GPU driver's implementation, and it's not guaranteed (as it seems to be the case on OpenGL on Metal).

The article also offers a better alternative to ensure the commands are batched. The idea is to supply an index buffer that is one long sequence of vertex indices, with a special sentinel value between line strips (a "primitive-restart index") that tells the GPU where each track starts or terminates. Once the index buffer is built, we can submit everything in one go via glDrawElements(GL_LINE_STRIP, num_elements, GL_UNSIGNED_INT, nullptr).

Suppose we have 3 tracks with 4 elements each: [0 1 2 3] [4 5 6 7] [8 9 10 11]. With glMultiDrawArrays we do:

starting_indices = {0, 4, 8};
element_counts = {4, 4, 4};
glMultiDrawArrays(GL_LINE_STRIP, firsts, counts, 3);

while with the indexed approach:

// Use 0xFFFFFFFF as sentinel values
indices = [0,1,2,3, 0xFFFFFFFF, 4,5,6,7, 0xFFFFFFFF, 8,9,10,11, 0xFFFFFFFF]
glEnable(GL_PRIMITIVE_RESTART);
glPrimitiveRestartIndex(0xFFFFFFFFu);
glDrawElements(GL_LINE_STRIP, indices.size(), GL_UNSIGNED_INT, nullptr);

I implemented this idea in the code, and it fixed the performance issues while visualising tractograms on macOS! On my MacBook M2, I now see framerates between 100 - 150 (dropping to 60-90 with volume rendering) when displaying a track file with 200k streamlines. Previously, mrview would even struggle to load the file.
As a bonus on Linux, with my modest laptop with Intel 11th i5-1135G7, I'm also seeing up to 25% improvements in framerates.

Notes:

  • There is a slight disadvantage in using the new approach, as the index buffer consumes some additional memory.
  • I have only enabled the index buffer code path for pseudotubes and lines. I preferred to keep the code changes to a minimum, as this PR is targeting master and its primary aim is to fix the performance issues on macOS. But we can enable this for points too if we want that (either in this PR or a separate one, perhaps targeting dev).

@MRtrix3/mrtrix3-devs I believe this change is correct and safe, but I'd appreciate it if you could test this on your own machines to see if you encounter any issues and maybe compare with mrview built from master to see if you notice any performance improvements. Also, this PR is strictly not a correctness fix, so I'd be happy if we want to target dev instead (I think other easy opportunities may potentially improve performance further).

Instead of using `glMultiDrawArrays`, which doesn't guarantee that draw 
calls will be executed in a batched fashion, we use `glDrawElements` 
with an index buffer. This fixes the performance issues seen on MacOS 
when visualising tractograms.

See https://programming4.us/multimedia/8302.aspx
@daljit46 daljit46 requested a review from a team October 3, 2025 06:49
@daljit46 daljit46 self-assigned this Oct 3, 2025
@Lestropie
Copy link
Member

This performance issue is a common complaint, so I've no issue with it targeting master even if it's technically not fixing erroneous behaviour.
Would not be concerned with the extra memory usage of 32 bits per streamline.

The tractography data as stored on the filesystem already use a delimiter approach to separate streamlines.
I'm sure we would have attempted at some point including the delimiters in the vertex data and drawing accordingly.
I've not seen delimiters used in a GL indexing array like that, or glPrimitiveRestartIndex() specifically.
It's elegant, but given knowledge of this capability, persisting with the index array that is just incrementing integers with the occasional delimiter seems wasteful... is there not a way to compute and pass an array of delimiting indices? The TRX storage format already basically does that.

Will provide testing on Linux, MSYS2, WSL2.

@daljit46
Copy link
Member Author

daljit46 commented Oct 3, 2025

Would not be concerned with the extra memory usage of 32 bits per streamline.

It's actually worse than that. It's an extra 4 bytes per vertex (for each index) in a given streamline. I think generally speaking, the extra memory consumption is still dwarfed by position data (and other factors). From my testing on my Intel laptop, I see that for loading a tck file with 10 million streamlines, this PR increases the GPU memory usage from 6.0GB to 7.1GB, so a nearly 20% increase.
One way to mitigate this is to use 16-bit indices for the index whenever we know that the index count is below 2^16-1, but then we would need to limit the "chunk" upload size (which currently is set to 32mb) to get the benefit. I'm not sure about the performance impact here, but it may be worth trying.

is there not a way to compute and pass an array of delimiting indices?

I think there may be with newer OpenGL versions (>4.3) with something like glMultiDrawArraysIndirect, but we can't use the on macOS.

One thing we could try is to skip the index buffer entirely (by calling glDrawElements on the entire chunk of streamlines) and instead use something like a is_beginning attribute per vertex in the geometry shader to discard bridge segments (those connecting different streamlines). If this attribute is of type GL_UNSIGNED_BYTE, then we could reduce the extra memory consumption 4x. Although I haven't really tried this approach to see if it can work.

bjeurissen
bjeurissen previously approved these changes Oct 4, 2025
Copy link
Member

@bjeurissen bjeurissen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works and fixes the issue! Tested on macOS 26.0.1 (25A362) on Apple M1

@bjeurissen bjeurissen mentioned this pull request Oct 4, 2025
13 tasks
@bjeurissen
Copy link
Member

Would be good to check if the same fix applies to #2379. At least the symptoms look identical, so I would think there is a good chance.

@bjeurissen
Copy link
Member

bjeurissen commented Oct 6, 2025

Also works on macOS 15.6.1 (24G90) on Apple M3 Max, so it works on the two last iterations of macOS.

@daljit46
Copy link
Member Author

daljit46 commented Oct 6, 2025

Would be good to check if the same fix applies to #2379. At least the symptoms look identical, so I would think there is a good chance.

Yes, I think it's the same issue. I'll add the same change for that too.

@daljit46
Copy link
Member Author

daljit46 commented Oct 6, 2025

Updated the PR with changes to fix #2379. @bjeurissen can you check if this fixes the issue on your machines?

@bjeurissen
Copy link
Member

Updated the PR with changes to fix #2379. @bjeurissen can you check if this fixes the issue on your machines?

Yes, it also fixes the fixel rendering issue!

@jdtournier
Copy link
Member

👍 Works fine on Ubuntu 22.04, running on a 12GB NVIDIA GeForce RTX 4070. Get roughly 50 FPS for 1 million streamlines with streamtube lighting in volume render mode.

I'll try it on my home system when I get home, just to check with an AMD GPU.

Use index buffer + glDrawElements instead of drawMultiArrays
to ensure that the rendering calls are hardware batched.
Motivation is the same as in e9899a1.
@jdtournier
Copy link
Member

Also works fine on my AMD-based home system 👍

@jdtournier jdtournier added this pull request to the merge queue Oct 8, 2025
Merged via the queue into master with commit b4b44cf Oct 8, 2025
5 checks passed
@jdtournier jdtournier deleted the fix_track_perf branch October 8, 2025 09:27
@jdtournier jdtournier mentioned this pull request Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants