Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Sep 25, 2020

Contains security fixes
Removes the superflous keras-applications package and scipy package

edit (@boegel): OK because TensorFlow 2.3.0 easyconfigs have only been merged very recently into develop via #11040...

Contains security fixes
Removes the superflous keras-applications package and scipy package
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in this PR)
taurusml3 - Linux RHEL 7.6, POWER, 8335-GTX, Python 2.7.5
See https://gist.github.com/cfd01771b6106380b7f6f37875fdcc44 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in this PR)
taurusi5196.taurus.hrsk.tu-dresden.de - Linux RHEL 7.8, x86_64, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, Python 2.7.5
See https://gist.github.com/3ac6fe27d1d954b0ad727964feffd717 for a full test report.

@boegel
Copy link
Member

boegel commented Sep 25, 2020

@Flamefire Again trouble on POWER?

@boegel boegel changed the title Update TensorFlow 2.3.1 and it's dependencies in-place update of TensorFlow 2.3.0 easyconfigs to version 2.3.1 Sep 25, 2020
@Flamefire
Copy link
Contributor Author

Yes -.- Although it looks like a flake in the filesystem. Restarted the build but it's been running for 2 hrs now

@boegel boegel added the update label Sep 25, 2020
@boegel boegel added this to the next release (4.3.1) milestone Sep 25, 2020
@lexming
Copy link
Contributor

lexming commented Sep 25, 2020

Test report by @lexming
FAILED
Build succeeded for 6 out of 9 (3 easyconfigs in this PR)
node375.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/6774d91ddf80b76e8885b23fe77443bf for a full test report.

@boegel
Copy link
Member

boegel commented Sep 25, 2020

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11375 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11375 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 7966

Test results coming soon (I hope)...

- notification for comment with ID 699013665 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@Flamefire
Copy link
Contributor Author

@lexming please rebuild double-conversion

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in this PR)
taurusml21 - Linux RHEL 7.6, POWER, 8335-GTX, Python 2.7.5
See https://gist.github.com/ea1fa002035b7dee585a72be9bbb4986 for a full test report.

@boegel
Copy link
Member

boegel commented Sep 25, 2020

Test report by @boegel
FAILED
Build succeeded for 9 out of 12 (3 easyconfigs in this PR)
node3162.skitty.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/f721507b109546e0009afd0e44fac959 for a full test report.

@boegel
Copy link
Member

boegel commented Sep 25, 2020

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in this PR)
node3407.kirlia.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 2.7.5
See https://gist.github.com/1baa102fa06b0afc4d985b4c6148334e for a full test report.

@boegel
Copy link
Member

boegel commented Sep 25, 2020

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in this PR)
node2706.swalot.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (haswell), Python 2.7.5
See https://gist.github.com/d57da72dce96f15e3ee67aef7c12a3d2 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in this PR)
generoso-x-2 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/209154df9926efb4b21a79815ad726d6 for a full test report.

@lexming
Copy link
Contributor

lexming commented Sep 26, 2020

Test report by @lexming
FAILED
Build succeeded for 1 out of 3 (3 easyconfigs in this PR)
node157.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, Python 2.7.5
See https://gist.github.com/8838d994f39f4eb2a319fb642b47306d for a full test report.

@Flamefire
Copy link
Contributor Author

@lexming

ModuleNotFoundError: No module named 'google.protobuf'

That is strange. You can try loading the build environment and check that this works? If it doesn't there is likely another google folder in your PYTHONPATH(=/user/brussel/101/vsc10122/.local/easybuild-ivybridge/software/TensorFlow/2.3.1-foss-2019b-Python-3.7.4/lib/python3.7/site-packages:/user/brussel/101/vsc10122/.local/easybuild-ivybridge/software/TensorFlow/2.3.1-foss-2019b-Python-3.7.4/lib/python3.7/site-packages:/apps/brussel/CO7/ivybridge-ib/software/protobuf-python/3.10.0-foss-2019b-Python-3.7.4/lib/python3.7/site-packages:/apps/brussel/CO7/ivybridge-ib/software/h5py/2.10.0-foss-2019b-Python-3.7.4/lib/python3.7/site-packages:/apps/brussel/CO7/ivybridge-ib/software/SciPy-bundle/2019.10-foss-2019b-Python-3.7.4/lib/python3.7/site-packages:/apps/brussel/CO7/ivybridge-ib/software/Python/3.7.4-GCCcore-8.3.0/easybuild/python)

Check each of those for a google/__init__.py which is what mostly prevents finding other google packages

@boegel
Copy link
Member

boegel commented Sep 30, 2020

@Flamefire Is this related to #11143?

@lexming Try re-installing protobuf-python-3.10.0-fosscuda-2019b-Python-3.7.4.eb and protobuf-python-3.10.0-foss-2019b-Python-3.7.4.eb with latest develop branch or EasyBuild v4.3.0?

@Flamefire
Copy link
Contributor Author

Ah, yes exactly. That was the reason for the breaking change with the downloaded archive

@lexming
Copy link
Contributor

lexming commented Sep 30, 2020

@Flamefire @boegel thanks for the feedback, testing again

Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Horovod fails to detect NCCL in our system

     -- Linking against static NCCL library
     CMake Error at /theia/home/apps/CO7/skylake/software/CMake/3.15.3-GCCcore-8.3.0/share/cmake-3.15/Modules/FindPackageHandleStandardArgs    .cmake:137 (message):
       Could NOT find NCCL (missing: NCCL_INCLUDE_DIR NCCL_LIBRARY)
     Call Stack (most recent call first):
       /theia/home/apps/CO7/skylake/software/CMake/3.15.3-GCCcore-8.3.0/share/cmake-3.15/Modules/FindPackageHandleStandardArgs.cmake:378 (_    FPHSA_FAILURE_MESSAGE)
       cmake/Modules/FindNCCL.cmake:42 (find_package_handle_standard_args)
       CMakeLists.txt:174 (find_package)
  
     
     -- Configuring incomplete, errors occurred!

I fixed it by adding pkg-config as a build dependency of Horovod

@Flamefire
Copy link
Contributor Author

Flamefire commented Oct 1, 2020

Hm, that doesn't make sense, that module doesn't use pkg-config at all so adding it shouldn't change anything.

Can you check ml show NCCL/2.4.8-gcccuda-2019b that it contains CMAKE_PREFIX_PATH? Can you also check what loading pkg-config changes? Maybe append the output of >ml show on that and from the succeeding build check the Found NCCL line and paste that (should have a location)

PS: Just seen that there have been 2 bugfix releases to Horovod by now. @boegel Can I updated that too in this PR or a follow up? In particular it includes horovod/horovod#2272 which seems to make certain use cases working again, otherwise an import of a horovod file will fail

@lexming
Copy link
Contributor

lexming commented Oct 1, 2020

Test report by @lexming
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in this PR)
node375.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/a28d152c7d3eee7e39f9a28d9ec07956 for a full test report.

@lexming
Copy link
Contributor

lexming commented Oct 1, 2020

Test report by @lexming
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in this PR)
node101.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, Python 2.7.5
See https://gist.github.com/95b7aa64e77dc9b1d0f9174f0a276856 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 2, 2020

@Flamefire I would update Horovod in a follow-up PR, let's try and get this one merged first...

@boegel
Copy link
Member

boegel commented Oct 2, 2020

@lexming The failing test report on node375 is due to a missing $CMAKE_PREFIX_PATH update in the NCCL module?

@lexming
Copy link
Contributor

lexming commented Oct 2, 2020

@Flamefire @boegel the real issue is indeed the lack of $CMAKE_PREFIX_PATH in our deployment of NCCL. It should be completely fixed now. Submitting one final test to counter the failed test.

@lexming
Copy link
Contributor

lexming commented Oct 2, 2020

Test report by @lexming
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in this PR)
node358.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/8a0c0dd41bbccbc594c348771d10bc71 for a full test report.

Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lexming
Copy link
Contributor

lexming commented Oct 2, 2020

Going in, thanks @Flamefire !

@lexming lexming merged commit 8a83d16 into easybuilders:develop Oct 2, 2020
@Flamefire Flamefire deleted the tf231 branch October 5, 2020 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants