-
Notifications
You must be signed in to change notification settings - Fork 191
Fix tractography rendering performance on macOS #3197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Instead of using `glMultiDrawArrays`, which doesn't guarantee that draw calls will be executed in a batched fashion, we use `glDrawElements` with an index buffer. This fixes the performance issues seen on MacOS when visualising tractograms. See https://programming4.us/multimedia/8302.aspx
|
This performance issue is a common complaint, so I've no issue with it targeting The tractography data as stored on the filesystem already use a delimiter approach to separate streamlines. Will provide testing on Linux, MSYS2, WSL2. |
It's actually worse than that. It's an extra 4 bytes per vertex (for each index) in a given streamline. I think generally speaking, the extra memory consumption is still dwarfed by position data (and other factors). From my testing on my Intel laptop, I see that for loading a tck file with 10 million streamlines, this PR increases the GPU memory usage from 6.0GB to 7.1GB, so a nearly 20% increase.
I think there may be with newer OpenGL versions (>4.3) with something like glMultiDrawArraysIndirect, but we can't use the on macOS. One thing we could try is to skip the index buffer entirely (by calling |
bjeurissen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works and fixes the issue! Tested on macOS 26.0.1 (25A362) on Apple M1
|
Would be good to check if the same fix applies to #2379. At least the symptoms look identical, so I would think there is a good chance. |
|
Also works on macOS 15.6.1 (24G90) on Apple M3 Max, so it works on the two last iterations of macOS. |
Yes, I think it's the same issue. I'll add the same change for that too. |
|
Updated the PR with changes to fix #2379. @bjeurissen can you check if this fixes the issue on your machines? |
Yes, it also fixes the fixel rendering issue! |
|
👍 Works fine on Ubuntu 22.04, running on a 12GB NVIDIA GeForce RTX 4070. Get roughly 50 FPS for 1 million streamlines with streamtube lighting in volume render mode. I'll try it on my home system when I get home, just to check with an AMD GPU. |
Use index buffer + glDrawElements instead of drawMultiArrays to ensure that the rendering calls are hardware batched. Motivation is the same as in e9899a1.
fdba968 to
22e94fc
Compare
|
Also works fine on my AMD-based home system 👍 |
This PR fixes the longstanding performance issues in the viewer on macOS when visualising tractography streamlines (using pseudotubes).
The root cause turned out not to be geometry-shader emulation per se, but the cost of submitting many draws via
glMultiDrawArrays. On Apple’s stack,glMultiDraw*does not seem to guarantee hardware batching, and it can result in significant CPU/driver overhead. Details below.After working on #3095, I decided to investigate the performance issues raised in #2247. As discussed there, the culprit seemed to be the use of geometry shaders in our OpenGL code. This was a logical suspicion since the OpenGL drivers on macOS are built upon Metal, which doesn't support geometry shaders. I spent a significant amount of time tinkering with the code, but it was unreasonable to assume that emulation of geometry shaders (via compute shaders) could be so slow that the app would either completely freeze or even crash.
Investigating with the Xcode profiler, I noticed that the application essentially just hung when calling
glMultiDrawArrays:which made me realise that most likely we are CPU-bound on command submission. The OpenGL on Metal implementation
seems to be submitting too many draw calls, which results in a hang.
Looking around, I found this great article which mentions:
glMultiDrawsArraysis the command that we use in tractogram.cpp to instruct the GPU to draw a chunk of streamlines. In theory, the command should be hardware-batched, but as the article mentioned, this is upon the GPU driver's implementation, and it's not guaranteed (as it seems to be the case on OpenGL on Metal).The article also offers a better alternative to ensure the commands are batched. The idea is to supply an index buffer that is one long sequence of vertex indices, with a special sentinel value between line strips (a "primitive-restart index") that tells the GPU where each track starts or terminates. Once the index buffer is built, we can submit everything in one go via
glDrawElements(GL_LINE_STRIP, num_elements, GL_UNSIGNED_INT, nullptr).Suppose we have 3 tracks with 4 elements each: [0 1 2 3] [4 5 6 7] [8 9 10 11]. With
glMultiDrawArrayswe do:while with the indexed approach:
I implemented this idea in the code, and it fixed the performance issues while visualising tractograms on macOS! On my MacBook M2, I now see framerates between 100 - 150 (dropping to 60-90 with volume rendering) when displaying a track file with 200k streamlines. Previously,
mrviewwould even struggle to load the file.As a bonus on Linux, with my modest laptop with Intel 11th i5-1135G7, I'm also seeing up to 25% improvements in framerates.
Notes:
masterand its primary aim is to fix the performance issues on macOS. But we can enable this for points too if we want that (either in this PR or a separate one, perhaps targetingdev).@MRtrix3/mrtrix3-devs I believe this change is correct and safe, but I'd appreciate it if you could test this on your own machines to see if you encounter any issues and maybe compare with
mrviewbuilt frommasterto see if you notice any performance improvements. Also, this PR is strictly not a correctness fix, so I'd be happy if we want to targetdevinstead (I think other easy opportunities may potentially improve performance further).