-
Notifications
You must be signed in to change notification settings - Fork 773
{ai}[foss/2024a] PyTorch v2.7.1 w/ CUDA 12.6.0 #23923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{ai}[foss/2024a] PyTorch v2.7.1 w/ CUDA 12.6.0 #23923
Conversation
|
Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @boegel |
|
|
Test report by @boegel |
|
Seemingly changed by mistake. Fixed |
|
Test report by @boegel |
|
Test report by @boegel |
|
The H100 failures are mostly from With the (now) default of 10 allowed failures that should be enough to pass As for the V100: I already had more failures on A100 suggesting they don't test on "older" GPUs anymore... If you can attach the log of the test step I'll take a look at the failures |
|
Test report by @Flamefire SUCCESS on rerun but upload failed due to expired token: |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @boegel |
|
4 (of 8) failures are in test_cpu_select_algorithm and test_select_algorithm which I assume have the same cause. However the errors are not in the gist, so can't tell Is it possibly this one?
Then I have a patch for that. In any case: I remove the allowed failures = 6, which now uses the default of 10 which would make your run pass. |
|
Issue is this:
Can I see the full log? |
Unfortunately the log file wasn't retained... I can trigger it again and make sure the log file is retained. |
|
@boegelbot please test @ jsc-zen3-a100 |
|
@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 3725616778 processed Message to humans: this is just bookkeeping information for me, |
|
If you still have access to the XML files (test_reports folder) I can check why it associates |
|
Test report by @boegel |
Mostly |
|
@Flamefire Found this in the log: I'll share the whole log with you (via Slack) |
|
Weird it looks like that failure was written to the XML file: But still:
So it seems the parser didn't pick up the failure. Almost all the rest seem to be specific to V100, and its non-support of BF16 (I reported it at pytorch/pytorch#172085):
If we can fix the missed
Would still be good to know why the parser didn't find it to avoid similar issues. Could be related though: The faulty close might cause the writing of the XML file fail Patching the failing tests is also possible if this does indeed return false for V100s (could use a pip-installed pytorch to test): |
|
Test report by @Flamefire |
|
Test report by @boegelbot |
|
Test report by @Flamefire |
Maybe we should make this a warning only |
Warning by default, but with a way to make it a hard error perhaps? @Flamefire In any case, I don't think we need to block this PR any further, what do you think? |
An EC option In my report the cause is a timeout after which the test process gets killed without writing an XML entry. However the test has "rerun" entries, so we could use that: If a test only shows up as "rerun" but not as "success" it is an error.
Do we want to increase the allowed failures to allow your previous build to pass? Or let people see those errors for old-ish GPUs? |
@Flamefire I'm in favor of allowed some more failures by default, maybe even up to The issues about not finding the result of a test should be less fatal too, but that's work for the easyblock, so doesn't need to block this PR. |
|
Test report by @Flamefire |
|
My builds are currently on day 7+ of running PyTorch tests. Do you have any suggestions to make them run faster? Should I just always build PyTorch on a full node? |
|
7 days is certainly too much. With 2.9.1 I identified an issue that caused an infinite hang. But that exact issue is not present in 2.7. Maybe check if any sub-process has been hanging for days or if the tests are just very slow on your machine. I do indeed use a full node. |
|
Oh, now the first one has finished. So this is an issue for the EasyBlock I guess. The slow test is probably which has been running since Jan 13 in the another build. Anyway, I'm happy with the state of this PR, so go ahead and merge when you are happy with it. |
|
This here is an issue worth checking:
Can you attach the full log and ideally the |
boegel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
It's high time that we get this merged.
There will probably be follow-up PRs (especially for the PyTorch easyblock), but this has been proven to be mature across a variety of systems.
@Flamefire Thanks a lot for all the effort on this!
|
Going in, thanks @Flamefire! |
(created using
eb --new-pr)Requires:
Bundlegeneric easyblock to support use of post-install patches easybuild-easyblocks#3887I included the easyconfigs here for convenience