Skip to content

Conversation

@imreddyTeja
Copy link
Member

@imreddyTeja imreddyTeja commented Nov 24, 2025

When a low resolution simulation is ran, the solver, which runs one thread per column, will not saturate the gpu. With the current launch config, a portion of the multiprocessors remain unused. This adds a function to catch that case, and use smaller block sizes.

Threads for stencils and pointwise operations were being launched by converting a single block and single grid index into a linear index, and then converting that into a cartesian index of (I, J, V, H), but the data is stored as (VIJFH). This adds special cases for v=63 and v=64, when i=j=4, that uses multiple block and grid dimensions, and indexes so adjacent threads are likely in the same column. Profiling shows this massively increases cache access patterns.
Before: 2.0 of the 32 bytes transmitted per sector are utilized (L2)
After: 26.2 of the 32 bytes transmitted per sector are utilized (L2)

TODO:
Test with more ClimaAtmos and coupled simulations.

  • Code follows the style guidelines OR N/A.
  • Unit tests are included OR N/A.
  • Code is exercised in an integration test OR N/A.
  • Documentation has been added/updated OR N/A.

@imreddyTeja imreddyTeja force-pushed the tr/mem-access-patterns branch from bde8907 to 33a1c4b Compare December 3, 2025 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants