Fix tractography rendering performance on macOS #3197

daljit46 · 2025-10-03T06:49:49Z

This PR fixes the longstanding performance issues in the viewer on macOS when visualising tractography streamlines (using pseudotubes).
The root cause turned out not to be geometry-shader emulation per se, but the cost of submitting many draws via glMultiDrawArrays. On Apple’s stack, glMultiDraw* does not seem to guarantee hardware batching, and it can result in significant CPU/driver overhead. Details below.

After working on #3095, I decided to investigate the performance issues raised in #2247. As discussed there, the culprit seemed to be the use of geometry shaders in our OpenGL code. This was a logical suspicion since the OpenGL drivers on macOS are built upon Metal, which doesn't support geometry shaders. I spent a significant amount of time tinkering with the code, but it was unreasonable to assume that emulation of geometry shaders (via compute shaders) could be so slow that the app would either completely freeze or even crash.

Investigating with the Xcode profiler, I noticed that the application essentially just hung when calling glMultiDrawArrays:

1.54 s 40.0% 400.00 µs GLDContextRec::flushContext(bool) 
1.47 s 38.3% 600.00 µs -[_MTLCommandQueue submitCommandBuffer:] 
1.47 s 38.3% 400.00 µs _dispatch_lane_barrier_sync_invoke_and_complete 
1.47 s 38.3% 300.00 µs _dispatch_client_callout 
1.47 s 38.3% 400.00 µs __40-[_MTLCommandQueue submitCommandBuffer:]_block_invoke 
1.47 s 38.3% 400.00 µs -[_MTLCommandQueue _submitAvailableCommandBuffers] 
1.46 s 38.0% 264.54 µs -[IOGPUMetalCommandQueue submitCommandBuffers:count:] 
1.46 s 38.0% 700.00 µs -[IOGPUMetalCommandQueue _submitCommandBuffers:count:] 
1.40 s 36.5% 700.00 µs IOGPUCommandQueueSubmitCommandBuffers 
1.40 s 36.5% 0 s IOConnectCallMethod 
1.40 s 36.5% 3.20 ms io_connect_method 
1.40 s 36.4% 300.00 µs mach_msg2_internal 
1.40 s 36.3% 
1.40 s mach_msg2_trap

which made me realise that most likely we are CPU-bound on command submission. The OpenGL on Metal implementation
seems to be submitting too many draw calls, which results in a hang.
Looking around, I found this great article which mentions:

Even functions like glMultiDrawArrays and glMultiDrawElements don’t always help because the graphics hardware may not implement these functions directly, and so OpenGL essentially has to convert them to multiple calls to glDrawArrays internally anyway.

glMultiDrawsArrays is the command that we use in tractogram.cpp to instruct the GPU to draw a chunk of streamlines. In theory, the command should be hardware-batched, but as the article mentioned, this is upon the GPU driver's implementation, and it's not guaranteed (as it seems to be the case on OpenGL on Metal).

The article also offers a better alternative to ensure the commands are batched. The idea is to supply an index buffer that is one long sequence of vertex indices, with a special sentinel value between line strips (a "primitive-restart index") that tells the GPU where each track starts or terminates. Once the index buffer is built, we can submit everything in one go via glDrawElements(GL_LINE_STRIP, num_elements, GL_UNSIGNED_INT, nullptr).

Suppose we have 3 tracks with 4 elements each: [0 1 2 3] [4 5 6 7] [8 9 10 11]. With glMultiDrawArrays we do:

starting_indices = {0, 4, 8};
element_counts = {4, 4, 4};
glMultiDrawArrays(GL_LINE_STRIP, firsts, counts, 3);

while with the indexed approach:

// Use 0xFFFFFFFF as sentinel values
indices = [0,1,2,3, 0xFFFFFFFF, 4,5,6,7, 0xFFFFFFFF, 8,9,10,11, 0xFFFFFFFF]
glEnable(GL_PRIMITIVE_RESTART);
glPrimitiveRestartIndex(0xFFFFFFFFu);
glDrawElements(GL_LINE_STRIP, indices.size(), GL_UNSIGNED_INT, nullptr);

I implemented this idea in the code, and it fixed the performance issues while visualising tractograms on macOS! On my MacBook M2, I now see framerates between 100 - 150 (dropping to 60-90 with volume rendering) when displaying a track file with 200k streamlines. Previously, mrview would even struggle to load the file.
As a bonus on Linux, with my modest laptop with Intel 11th i5-1135G7, I'm also seeing up to 25% improvements in framerates.

Notes:

There is a slight disadvantage in using the new approach, as the index buffer consumes some additional memory.
I have only enabled the index buffer code path for pseudotubes and lines. I preferred to keep the code changes to a minimum, as this PR is targeting master and its primary aim is to fix the performance issues on macOS. But we can enable this for points too if we want that (either in this PR or a separate one, perhaps targeting dev).

@MRtrix3/mrtrix3-devs I believe this change is correct and safe, but I'd appreciate it if you could test this on your own machines to see if you encounter any issues and maybe compare with mrview built from master to see if you notice any performance improvements. Also, this PR is strictly not a correctness fix, so I'd be happy if we want to target dev instead (I think other easy opportunities may potentially improve performance further).

Instead of using `glMultiDrawArrays`, which doesn't guarantee that draw calls will be executed in a batched fashion, we use `glDrawElements` with an index buffer. This fixes the performance issues seen on MacOS when visualising tractograms. See https://programming4.us/multimedia/8302.aspx

Lestropie · 2025-10-03T07:00:58Z

This performance issue is a common complaint, so I've no issue with it targeting master even if it's technically not fixing erroneous behaviour.
Would not be concerned with the extra memory usage of 32 bits per streamline.

The tractography data as stored on the filesystem already use a delimiter approach to separate streamlines.
I'm sure we would have attempted at some point including the delimiters in the vertex data and drawing accordingly.
I've not seen delimiters used in a GL indexing array like that, or glPrimitiveRestartIndex() specifically.
It's elegant, but given knowledge of this capability, persisting with the index array that is just incrementing integers with the occasional delimiter seems wasteful... is there not a way to compute and pass an array of delimiting indices? The TRX storage format already basically does that.

Will provide testing on Linux, MSYS2, WSL2.

daljit46 · 2025-10-03T08:58:29Z

Would not be concerned with the extra memory usage of 32 bits per streamline.

It's actually worse than that. It's an extra 4 bytes per vertex (for each index) in a given streamline. I think generally speaking, the extra memory consumption is still dwarfed by position data (and other factors). From my testing on my Intel laptop, I see that for loading a tck file with 10 million streamlines, this PR increases the GPU memory usage from 6.0GB to 7.1GB, so a nearly 20% increase.
One way to mitigate this is to use 16-bit indices for the index whenever we know that the index count is below 2^16-1, but then we would need to limit the "chunk" upload size (which currently is set to 32mb) to get the benefit. I'm not sure about the performance impact here, but it may be worth trying.

is there not a way to compute and pass an array of delimiting indices?

I think there may be with newer OpenGL versions (>4.3) with something like glMultiDrawArraysIndirect, but we can't use the on macOS.

One thing we could try is to skip the index buffer entirely (by calling glDrawElements on the entire chunk of streamlines) and instead use something like a is_beginning attribute per vertex in the geometry shader to discard bridge segments (those connecting different streamlines). If this attribute is of type GL_UNSIGNED_BYTE, then we could reduce the extra memory consumption 4x. Although I haven't really tried this approach to see if it can work.

bjeurissen

Works and fixes the issue! Tested on macOS 26.0.1 (25A362) on Apple M1

bjeurissen · 2025-10-04T19:36:08Z

Would be good to check if the same fix applies to #2379. At least the symptoms look identical, so I would think there is a good chance.

bjeurissen · 2025-10-06T08:19:25Z

Also works on macOS 15.6.1 (24G90) on Apple M3 Max, so it works on the two last iterations of macOS.

daljit46 · 2025-10-06T09:35:12Z

Would be good to check if the same fix applies to #2379. At least the symptoms look identical, so I would think there is a good chance.

Yes, I think it's the same issue. I'll add the same change for that too.

daljit46 · 2025-10-06T12:36:38Z

Updated the PR with changes to fix #2379. @bjeurissen can you check if this fixes the issue on your machines?

bjeurissen · 2025-10-06T13:08:11Z

Updated the PR with changes to fix #2379. @bjeurissen can you check if this fixes the issue on your machines?

Yes, it also fixes the fixel rendering issue!

jdtournier · 2025-10-06T14:02:35Z

👍 Works fine on Ubuntu 22.04, running on a 12GB NVIDIA GeForce RTX 4070. Get roughly 50 FPS for 1 million streamlines with streamtube lighting in volume render mode.

I'll try it on my home system when I get home, just to check with an AMD GPU.

Use index buffer + glDrawElements instead of drawMultiArrays to ensure that the rendering calls are hardware batched. Motivation is the same as in e9899a1.

jdtournier · 2025-10-07T18:45:57Z

Also works fine on my AMD-based home system 👍

daljit46 requested a review from a team October 3, 2025 06:49

daljit46 self-assigned this Oct 3, 2025

daljit46 added bug MacOSX labels Oct 3, 2025

Use MR::vector instead of std::vector

ab43e2e

bjeurissen previously approved these changes Oct 4, 2025

View reviewed changes

bjeurissen mentioned this pull request Oct 4, 2025

Support for arm64-apple #2247

Closed

13 tasks

bjeurissen mentioned this pull request Oct 6, 2025

mrview: fixel rendering crash on Apple M1 with crop-to-slab disabled #2379

Closed

daljit46 dismissed bjeurissen’s stale review via fdba968 October 6, 2025 12:35

bjeurissen approved these changes Oct 6, 2025

View reviewed changes

Replace drawMultiArrays in base fixel class

22e94fc

Use index buffer + glDrawElements instead of drawMultiArrays to ensure that the rendering calls are hardware batched. Motivation is the same as in e9899a1.

daljit46 force-pushed the fix_track_perf branch from fdba968 to 22e94fc Compare October 6, 2025 14:17

daljit46 added the gui label Oct 7, 2025

Merge branch 'master' into fix_track_perf

0375204

jdtournier added this pull request to the merge queue Oct 8, 2025

Merged via the queue into master with commit b4b44cf Oct 8, 2025
5 checks passed

jdtournier deleted the fix_track_perf branch October 8, 2025 09:27

jdtournier mentioned this pull request Oct 20, 2025

Next release 3.0.8 #3222

Merged

daljit46 mentioned this pull request Nov 11, 2025

mrview: glitch with streamline rendering #3230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix tractography rendering performance on macOS #3197

Fix tractography rendering performance on macOS #3197

Uh oh!

daljit46 commented Oct 3, 2025 •

edited

Loading

Uh oh!

Lestropie commented Oct 3, 2025

Uh oh!

daljit46 commented Oct 3, 2025 •

edited

Loading

Uh oh!

bjeurissen left a comment

Uh oh!

bjeurissen commented Oct 4, 2025

Uh oh!

bjeurissen commented Oct 6, 2025 •

edited

Loading

Uh oh!

daljit46 commented Oct 6, 2025

Uh oh!

daljit46 commented Oct 6, 2025

Uh oh!

bjeurissen commented Oct 6, 2025

Uh oh!

jdtournier commented Oct 6, 2025

Uh oh!

jdtournier commented Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix tractography rendering performance on macOS #3197

Fix tractography rendering performance on macOS #3197

Uh oh!

Conversation

daljit46 commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lestropie commented Oct 3, 2025

Uh oh!

daljit46 commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjeurissen left a comment

Choose a reason for hiding this comment

Uh oh!

bjeurissen commented Oct 4, 2025

Uh oh!

bjeurissen commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daljit46 commented Oct 6, 2025

Uh oh!

daljit46 commented Oct 6, 2025

Uh oh!

bjeurissen commented Oct 6, 2025

Uh oh!

jdtournier commented Oct 6, 2025

Uh oh!

jdtournier commented Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

daljit46 commented Oct 3, 2025 •

edited

Loading

daljit46 commented Oct 3, 2025 •

edited

Loading

bjeurissen commented Oct 6, 2025 •

edited

Loading