Skip to content

Conversation

@lroberts36
Copy link
Collaborator

@lroberts36 lroberts36 commented Jun 24, 2025

PR Summary

Does something similar to #1196.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@lroberts36 lroberts36 changed the title WIP: Add more teams per communication buffer Add more teams per communication buffer Jun 26, 2025
@lroberts36 lroberts36 requested review from bprather and pgrete June 26, 2025 22:32
@lroberts36
Copy link
Collaborator Author

@pgrete: This should now be at the point you can test and see what sort of a performance impact this has for you.

@lroberts36
Copy link
Collaborator Author

Based on the tests, it looks like there is some issue on gpu. I will take a look next week.

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this improves performance as we learned in the hackathon, LGTM

@pgrete
Copy link
Collaborator

pgrete commented Jul 4, 2025

I'll test this next week. I finally have the benchmark/performance pipeline setup, which will make this easier

@pgrete
Copy link
Collaborator

pgrete commented Jul 9, 2025

Alright, here are the first numbers and they're looking good!

image

Without any finetuning (randomly picking 8 teams) the total performance (including hydro) went up by 50% in the single block case and is still up by 10% for the previously optimal case (with 8 blocks per device).

@lroberts36
Copy link
Collaborator Author

@pgrete: I didn't think carefully about nteams_per_boundary_buffer when I added it. Perhaps it would make more sense to have a parameter minimum_number_of_teams_for_boundary_kernel and then use that to calculate nteams_per_boundary_buffer.

@pgrete
Copy link
Collaborator

pgrete commented Jul 9, 2025

Here's even more detailed timing info (based on the 4.5 year old WIP PR #388) 😄

image

Apart from the obvious changes, I'm also surprised about the impact of the reduction across many blocks.

Given that I now have a test env, I'll do some more test to get a sense of what's most relevant (e.g., with regard to the choice on whether to set a fixed number or a minumum number as you say).
It'd also be good to get some info from other codes as this seems to be performance critical.

@lroberts36
Copy link
Collaborator Author

Apart from the obvious changes, I'm also surprised about the impact of the reduction across many blocks.

Do you mean the reduction in the SendBounds kernel?

@lroberts36
Copy link
Collaborator Author

@pgrete: What is the status of your review here?

@pgrete
Copy link
Collaborator

pgrete commented Jul 28, 2025

Apart from the obvious changes, I'm also surprised about the impact of the reduction across many blocks.

Do you mean the reduction in the SendBounds kernel?

Sorry, this was not clear. I was referring to the EstimateTimestep piece, that gets about 2-3x more expensive when going from a single to 64 blocks per device.
So there's probably also some room for improvement (completely independent of this PR).

@pgrete
Copy link
Collaborator

pgrete commented Jul 28, 2025

@pgrete: What is the status of your review here?

I'm in favor of some "min number of teams" variable (as it's more generally applicable) and we may even be able to tie it to some hardware info that we query from Kokkos.
I'll ask in the Kokkos Slack about this.
Regarding the "min number of teams" change, do you want to go for it or should I make the changes to this branch?

@lroberts36
Copy link
Collaborator Author

@pgrete: What is the status of your review here?

I'm in favor of some "min number of teams" variable (as it's more generally applicable) and we may even be able to tie it to some hardware info that we query from Kokkos. I'll ask in the Kokkos Slack about this. Regarding the "min number of teams" change, do you want to go for it or should I make the changes to this branch?

@pgrete: If you want to take a crack at it, that sounds good to me.

@lroberts36
Copy link
Collaborator Author

@pgrete: What is the status of this PR? Are you still thinking about changing how you specify the number of teams per buffer?

@pgrete
Copy link
Collaborator

pgrete commented Sep 5, 2025

@pgrete: What is the status of this PR? Are you still thinking about changing how you specify the number of teams per buffer?

Yes, just didn't get to it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants