Skip to content

Conversation

@lexming
Copy link
Contributor

@lexming lexming commented Mar 2, 2022

(created using eb --new-pr)

Workaround for issue pytorch/pytorch#72516

Some tests fail despite working as intended because the assertion by highlight fails to properly catch the highlighted part in the error message. Replacing those assertions with a simple regex match solves the issue.

@lexming
Copy link
Contributor Author

lexming commented Mar 2, 2022

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@lexming: Request for testing this PR well received on login1

PR test command 'EB_PR=15073 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_15073 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 8213

Test results coming soon (I hope)...

- notification for comment with ID 1057494985 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/4a739d215a3ac59e6524e8f7447a26be for a full test report.

@casparvl
Copy link
Contributor

casparvl commented Mar 7, 2022

Looks reasonable to me. Any idea why we haven't hit this with the original PR: does it just happen on certain systems and/or under certain circumstances?

Out of personal interest: do you happen to know if this is still in PyTorch-1.11 as well? I've created an EasyConfig for 1.11-rc4 that I'd like to contribute once the final release is out. Just wondering if I should include this patch right away :)

Edit: to answer my own question: I looked at 1.11.0-rc5 and at the LOC that your patch changed. It's still relevant, also for PT 1.11.0. But, I did not hit this issue myself when installing based on my 1.11-rc4 EasyConfig. So I guess that juts leaves my first question: any idea when one hits this particular problem (and when not)? :)

@casparvl
Copy link
Contributor

casparvl commented Mar 9, 2022

Test report by @casparvl
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
software2.lisa.surfsara.nl - Linux debian 10.11, x86_64, Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 4 x NVIDIA NVIDIA TITAN V, 470.103.01, Python 2.7.16
See https://gist.github.com/a41714cc77d8b49fa58038ad2fc3f29f for a full test report.

@lexming
Copy link
Contributor Author

lexming commented Mar 10, 2022

@casparvl this is a good a question, but I'm afraid I do not have a clear answer. I also wonder how nobody else hit this issue before me. On our side, these failures are systematic, the same tests fail in every build. I have the suspicion that it might be related to using a temp dir in a shared filesystem. But I have no real proof. I'll try to run a couple of tests to shed some more light.

@boegel boegel added this to the next release (4.5.4?) milestone Mar 16, 2022
@casparvl
Copy link
Contributor

@lexming Any updates on this? To me, the patch seems quite low impact (it's "only" about a test, and my understanding is that it still checks the results). At the same time, I saw the original issue on the PyTorch issue tracker, and saw the devs weren't too keen on making this change themselves - though I couldn't 100% understand their argumentation.

@boegel boegel changed the title PyTorch v1.10.0: replace assertions by highlight in JIT tests to simple regex matches replace assertions by highlight in JIT tests to simple regex matches in PyTorch v1.10.0 tests Mar 28, 2022
@boegel
Copy link
Member

boegel commented Mar 28, 2022

Test report by @boegel
SUCCESS
Build succeeded for 25 out of 25 (2 easyconfigs in total)
node3303.joltik.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 510.47.03, Python 3.6.8
See https://gist.github.com/10c53d53aa64cd5cd97c9866c78caf04 for a full test report.

@boegel
Copy link
Member

boegel commented Mar 28, 2022

Test report by @boegel
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in total)
node3900.accelgor.os - Linux RHEL 8.4, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 510.47.03, Python 3.6.8
See https://gist.github.com/ac5f3313d590bf093b1a3d42a6250af6 for a full test report.

@lexming
Copy link
Contributor Author

lexming commented Apr 1, 2022

I ran a few more tests and I can confirm that these tests only fail whenever the build is carried out on a shared filesystem (GPFS). So, another workaround is to execute eb with --tmpdir in a local filesystem.

UPDATE: my previous test worked because it got patched with this patch actually. The installation from current develop does fail on a local temp directory, you can see in the snipped below an example failure where the installation directory is in a shared filesystem, but --tmpdir is in /tmp

======================================================================
ERROR: test_del (jit.test_builtins.TestBuiltins)
----------------------------------------------------------------------
RuntimeError: 
undefined value a:
  File "/theia/scratch/brussel/vo/000/bvo00005/vsc10122/easybuild/install/skylake/build/PyTorch/1.10.0/foss-2021a/pytorch/test/jit/test_builtins.py", line 94
                a = x ** 2
                del a
                return a
                       ~ <--- HERE


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/theia/scratch/brussel/vo/000/bvo00005/vsc10122/easybuild/install/skylake/build/PyTorch/1.10.0/foss-2021a/pytorch/test/jit/test_builtins.py", line 91, in test_del
    def fn(x):
  File "/tmp/eb-HcdqX4/tmp8BKhmT/lib/python3.9/site-packages/torch/testing/_internal/jit_utils.py", line 92, in __exit__
    FileCheck().check_source_highlighted(self.highlight).run(str(value))
RuntimeError: Expected to find "a"highlighted but it is not.

undefined value a:
  File "/theia/scratch/brussel/vo/000/bvo00005/vsc10122/easybuild/install/skylake/build/PyTorch/1.10.0/foss-2021a/pytorch/test/jit/test_builtins.py", line 94
                a = x ** 2

Full build log: https://gist.github.com/lexming/72be3627f389cabdd4ebc95361d1f79c

So my suspicions about the storage were not correct. I'll provide more info in the issue upstream, but this patch is still mandatory for us.

@boegel boegel modified the milestones: 4.5.5, release after 4.5.5 Jun 4, 2022
@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 10 out of 12 (2 easyconfigs in total)
taurusi8011 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/ad64041f1808e9e201812dceb96bbb78 for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 11 out of 12 (2 easyconfigs in total)
taurusa14 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/c8ea0b28cd860527b37ce0cea33f8d86 for a full test report.

@lexming
Copy link
Contributor Author

lexming commented Jan 17, 2023

@lexming lexming closed this Jan 17, 2023
@lexming lexming deleted the 20220303000127_new_pr_PyTorch1100 branch January 17, 2023 13:39
@Flamefire
Copy link
Contributor

Maybe rebase this after #15904 was merged which got the other tests working

@boegel boegel modified the milestones: next release (4.7.1?), 4.x Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants