Skip to content

Conversation

@akesandgren
Copy link
Contributor

@akesandgren akesandgren commented Aug 11, 2018

(created using eb --new-pr)
New version of TensorFlow (1.10.0) for fosscuda 2018b. Depends on easybuilders/easybuild-easyblocks#1453

@akesandgren
Copy link
Contributor Author

Test report by @akesandgren
SUCCESS
Build succeeded for 10 out of 10 (4 easyconfigs in this PR)
b-an03.hpc2n.umu.se - Linux ubuntu 16.04, Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz, Python 2.7.12
See https://gist.github.com/4534e21ff4e3369778b75671c99d86fe for a full test report.

@akesandgren
Copy link
Contributor Author

Looks like they fixed the AVX512 NaN problem in 1.10, so don'y merge this just yet. I'm rerunning tests without the preconfigopts setting.

description = "An open-source software library for Machine Intelligence"

toolchain = {'name': 'fosscuda', 'version': '2018b'}
toolchainopts = {'opt': True}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akesandgren opt is enabled by default, so you can drop this?

('cuDNN', '7.1.4.18'),
]

preconfigopts = 'export CC_OPT_FLAGS="$CC_OPT_FLAGS -mno-avx512f" && '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akesandgren This can be removed if the issues on AVX512 are indeed fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to verify the AVX512 problem using an intel based build but boringssl/grpc/libssl are causing problems there. Will try a bit more to get that build to work first.

@boegel
Copy link
Member

boegel commented Aug 13, 2018

@akesandgren You'll need to resolve a conflict that was introduced by merging #6643 (Bazel) just now, I overlooked that a patch was included in your PR for that easyconfig...

sources = ['v%(version)s.tar.gz']
patches = [
'%(name)s-%(version)s_remove-msse-hardcoding.patch',
'%(name)s-%(version)s_dont_expand_cuda_cudnn_path.patch',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akesandgren Please add back TensorFlow-1.5.0_swig-env.patch, it seems to be required to fix a problem I'm running into without this, i.e. the build failing with swig failed: error executing command

… toolchainopts. Drop -mno-avx512f since v1.10 doesn't seem to have problems with AVX512 any longer (at least not with foss).
@boegel
Copy link
Member

boegel commented Aug 15, 2018

@akesandgren Do you mind submitting another test report after your last changes?

boegel
boegel previously approved these changes Aug 15, 2018
@vanzod
Copy link
Member

vanzod commented Aug 16, 2018

@akesandgren I noticed that in the bazel parameters to build TF there is --config=mkl even if this is built with foss. Is that ok?

@vanzod
Copy link
Member

vanzod commented Aug 16, 2018

Test report by @vanzod
FAILED
Build succeeded for 3 out of 4 (4 easyconfigs in this PR)
cermis - Linux debian 9.4, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, Python 2.7.13
See https://gist.github.com/1df71e2e3938fb3f709b088e33460598 for a full test report.

@vanzod
Copy link
Member

vanzod commented Aug 16, 2018

I may have ran out of disk space. I'll test it again tomorrow

@akesandgren
Copy link
Contributor Author

Yes it is for using the mkldnn, not mkl as such, and with TF you want mkldnn at all times.

@akesandgren
Copy link
Contributor Author

Test report by @akesandgren
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
b-cn1522.hpc2n.umu.se - Linux ubuntu 16.04, Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz, Python 2.7.12
See https://gist.github.com/f9100d35e10f53716ac1e599e0948753 for a full test report.

@vanzod
Copy link
Member

vanzod commented Aug 16, 2018

Test report by @vanzod
FAILED
Build succeeded for 3 out of 4 (4 easyconfigs in this PR)
cermis - Linux debian 9.4, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, Python 2.7.13
See https://gist.github.com/94eed07cf15fd1fd7b244ccb3643044c for a full test report.

@verdurin
Copy link
Member

Test report by @verdurin
FAILED
Build succeeded for 11 out of 13 (4 easyconfigs in this PR)
easybuild.novalocal - Linux centos linux 7.5.1804, Intel Xeon E312xx (Sandy Bridge), Python 2.7.5
See https://gist.github.com/cd6798e272e5f8da220b005d5450bba6 for a full test report.

@boegel
Copy link
Member

boegel commented Aug 16, 2018

@verdurin cudnn-9.2-linux-x64-v7.1.4.18.tgz needs to be downloaded manually

@verdurin Any idea what's causing the build problem in @vanzod's test report?


# The full version of the library can be found using
# strings -a cuda/lib64/libcudnn_static.a | grep cudnn_version_
# Download and rename.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akesandgren Please make the renaming instructions explicit for the tarball

@akesandgren
Copy link
Contributor Author

@verdurin Huh?? What do you mean?

@verdurin
Copy link
Member

@akesandgren I mean there are Bazel processes still running from my last attempt to build TF:

[centos@easybuild easyconfigs]$ ps aufx | grep bazel
centos   15150  0.0  0.0 112708   980 pts/3    S+   09:37   0:00      \_ grep --color=auto bazel
centos    7226  2.5 10.7 10009500 1751600 ?    Ssl  07:29   3:18 bazel(tensorflow-1.10.0) -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/eb-zLPYkb/tmpLBKefm-bazel-build -Xverify:none -Djava.util.logging.config.file=/tmp/eb-zLPYkb/tmpLBKefm-bazel-build/javalog.properties -Djava.library.path=/home/centos/.cache/bazel/_bazel_centos/install/1daa549c2b1a5b1e314e20c494df8276/_embedded_binaries/ -Dfile.encoding=ISO-8859-1 -jar /home/centos/.cache/bazel/_bazel_centos/install/1daa549c2b1a5b1e314e20c494df8276/_embedded_binaries/A-server.jar --max_idle_secs=10800 --connect_timeout_secs=30 --output_user_root=/home/centos/.cache/bazel/_bazel_centos --install_base=/home/centos/.cache/bazel/_bazel_centos/install/1daa549c2b1a5b1e314e20c494df8276 --install_md5=1daa549c2b1a5b1e314e20c494df8276 --output_base=/tmp/eb-zLPYkb/tmpLBKefm-bazel-build --workspace_directory=/home/centos/.local/easybuild/build/TensorFlow/1.10.0/fosscuda-2018b-Python-2.7.15/tensorflow-1.10.0 --default_system_javabase=/home/centos/.local/easybuild/software/Java/1.8.0_172 --deep_execroot --expand_configs_in_place --noexperimental_oom_more_eagerly --experimental_oom_more_eagerly_threshold=100 --write_command_log --nowatchfs --nofatal_event_bus_exceptions --client_debug=false --product_name=Bazel --option_sources=output_Ubase:

@akesandgren
Copy link
Contributor Author

Ok.... don't remember seeing that lately. Although my builds haven't failed in a while, at least not hard.

@boegel
Copy link
Member

boegel commented Aug 17, 2018

@vanzod Can you provide a full (debug) log for your last test report, so we can try and figure out this collect2: error: ld returned 1 exit status problem? It's unclear what the actual problem is now...

@vanzod
Copy link
Member

vanzod commented Aug 17, 2018

Here is the full log.

easybuild-TensorFlow-1.10.0-20180816.084137.cieiw.log.gz

@akesandgren
Copy link
Contributor Author

@vanzod Still no trace of the actual error. I think you need to run that link command by hand.
Since bazel resets the environment it should be doable.

@boegel
Copy link
Member

boegel commented Aug 20, 2018

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
nic170 - Linux centos linux 7.5.1804, Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, Python 2.7.5
See https://gist.github.com/9fe214ac47f05fe67cfd1c5e35d0e675 for a full test report.

@vanzod
Copy link
Member

vanzod commented Aug 20, 2018

Test report by @vanzod
FAILED
Build succeeded for 3 out of 4 (4 easyconfigs in this PR)
cermis - Linux debian 9.4, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, Python 2.7.13
See https://gist.github.com/7e1b5a5608705eb2e086028f8dba83bf for a full test report.

@vanzod
Copy link
Member

vanzod commented Aug 20, 2018

Based on the log, I have manually repeated the linking of _pywrap_tensorflow_internal.so. Unfortunately the error message is even more cryptic:

$ exec env - LD_LIBRARY_PATH=/opt/easybuild/software/Core/Java/1.8.0_172/lib:/opt/easybuild/software/MPI/GCC-CUDA/7.3.0-2.30-9.2.88/OpenMPI/3.1.1/cuDNN/7.1.4.18/lib64:/opt/easybuild/software/MPI/GCC-CUDA/7.3.0-2.30-9.2.88/OpenMPI/3.1.1/Python/2.7.15/lib/python2.7/site-packages/numpy-1.14.5-py2.7-linux-x86_64.egg/numpy/core/lib:/opt/easybuild/software/MPI/GCC-CUDA/7.3.0-2.30-9.2.88/OpenMPI/3.1.1/Python/2.7.15/lib:/opt/easybuild/software/Compiler/GCCcore/7.3.0/libffi/3.2.1/lib64:/opt/easybuild/software/Compiler/GCCcore/7.3.0/libffi/3.2.1/lib:/opt/easybuild/software/Compiler/GCCcore/7.3.0/GMP/6.1.2/lib:/opt/easybuild/software/Compiler/GCCcore/7.3.0/SQLite/3.24.0/lib:/opt/easybuild/software/Compiler/GCCcore/7.3.0/Tcl/8.6.8/lib:/opt/easybuild/software/Compiler/GCCcore/7.3.0/libreadline/7.0/lib:/opt/easybuild/software/Compiler/GCCcore/7.3.0/ncurses/6.1/lib:/opt/easybuild/software/Compiler/GCCcore/7.3.0/bzip2/1.0.6/lib:/opt/easybuild/software/MPI/GCC-CUDA/7.3.0-2.30-9.2.88/OpenMPI/3.1.1/ScaLAPACK/2.0.2-OpenBLAS-0.3.1/lib:/opt/easybuild/software/MPI/GCC-CUDA/7.3.0-2.30-9.2.88/OpenMPI/3.1.1/FFTW/3.3.8/lib:/opt/easybuild/software/Compiler/GCC/7.3.0-2.30/OpenBLAS/0.3.1/lib:/opt/easybuild/software/Compiler/GCC-CUDA/7.3.0-2.30-9.2.88/OpenMPI/3.1.1/lib:/opt/easybuild/software/Compiler/GCCcore/7.3.0/hwloc/1.11.10/lib:/opt/easybuild/software/Compiler/GCCcore/7.3.0/numactl/2.0.11/lib:/opt/easybuild/software/Compiler/GCCcore/7.3.0/zlib/1.2.11/lib:/opt/easybuild/software/Compiler/GCC/7.3.0-2.30/CUDA/9.2.88/extras/CUPTI/lib64:/opt/easybuild/software/Compiler/GCC/7.3.0-2.30/CUDA/9.2.88/lib64:/opt/easybuild/software/Compiler/GCCcore/7.3.0/binutils/2.30/lib:/opt/easybuild/software/Core/GCCcore/7.3.0/lib/gcc/x86_64-pc-linux-gnu/7.3.0:/opt/easybuild/software/Core/GCCcore/7.3.0/lib64:/opt/easybuild/software/Core/GCCcore/7.3.0/lib PATH=/opt/easybuild/software/MPI/GCC-CUDA/7.3.0-2.30-9.2.88/OpenMPI/3.1.1/wheel/0.31.1-Python-2.7.15/bin:/opt/easybuild/software/Compiler/GCCcore/7.3.0/Bazel/0.16.0/bin:/opt/easybuild/software/Core/Java/1.8.0_172:/opt/easybuild/software/Core/Java/1.8.0_172/bin:/opt/easybuild/software/MPI/GCC-CUDA/7.3.0-2.30-9.2.88/OpenMPI/3.1.1/Python/2.7.15/bin:/opt/easybuild/software/Compiler/GCCcore/7.3.0/SQLite/3.24.0/bin:/opt/easybuild/software/Compiler/GCCcore/7.3.0/Tcl/8.6.8/bin:/opt/easybuild/software/Compiler/GCCcore/7.3.0/libreadline/7.0/bin:/opt/easybuild/software/Compiler/GCCcore/7.3.0/ncurses/6.1/bin:/opt/easybuild/software/Compiler/GCCcore/7.3.0/bzip2/1.0.6/bin:/opt/easybuild/software/MPI/GCC-CUDA/7.3.0-2.30-9.2.88/OpenMPI/3.1.1/FFTW/3.3.8/bin:/opt/easybuild/software/Compiler/GCC/7.3.0-2.30/OpenBLAS/0.3.1/bin:/opt/easybuild/software/Compiler/GCC-CUDA/7.3.0-2.30-9.2.88/OpenMPI/3.1.1/bin:/opt/easybuild/software/Compiler/GCCcore/7.3.0/hwloc/1.11.10/sbin:/opt/easybuild/software/Compiler/GCCcore/7.3.0/hwloc/1.11.10/bin:/opt/easybuild/software/Compiler/GCCcore/7.3.0/numactl/2.0.11/bin:/opt/easybuild/software/Compiler/GCC/7.3.0-2.30/CUDA/9.2.88:/opt/easybuild/software/Compiler/GCC/7.3.0-2.30/CUDA/9.2.88/bin:/opt/easybuild/software/Compiler/GCCcore/7.3.0/binutils/2.30/bin:/opt/easybuild/software/Core/GCCcore/7.3.0/bin:/opt/easybuild/EB-develop/easybuild-framework:/opt/ssh-ident:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games PWD=/proc/self/cwd external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -shared -o bazel-out/host/bin/tensorflow/python/_pywrap_tensorflow_internal.so '-Wl,-rpath,$ORIGIN/../../_solib_local/_U_S_Stensorflow_Spython_C_Upywrap_Utensorflow_Uinternal.so___Utensorflow' '-Wl,-rpath,$ORIGIN/../../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccublas___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib' '-Wl,-rpath,$ORIGIN/../../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccusolver___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib' '-Wl,-rpath,$ORIGIN/../../_solib_local/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexternal_Smkl_Ulinux_Slib' '-Wl,-rpath,$ORIGIN/../../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib' -Lbazel-out/host/bin/_solib_local/_U_S_Stensorflow_Spython_C_Upywrap_Utensorflow_Uinternal.so___Utensorflow -Lbazel-out/host/bin/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccublas___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib -Lbazel-out/host/bin/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccusolver___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib -Lbazel-out/host/bin/_solib_local/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexternal_Smkl_Ulinux_Slib -Lbazel-out/host/bin/_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib -Wl,--version-script bazel-out/host/bin/tensorflow/python/pywrap_tensorflow_internal_versionscript.lds '-Wl,-rpath,$ORIGIN/,-rpath,$ORIGIN/..' -Wl,-soname,_pywrap_tensorflow_internal.so -Wl,-z,muldefs -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -pthread -Wl,-rpath,../local_config_cuda/cuda/lib64 -Wl,-rpath,../local_config_cuda/cuda/extras/CUPTI/lib64 -Wl,-S -Wl,-no-as-needed -Wl,-z,relro,-z,now '-Wl,--build-id=md5' '-Wl,--hash-style=gnu' -no-canonical-prefixes '-B/opt/easybuild/software/Compiler/GCCcore/7.3.0/binutils/2.30/bin/ -L/opt/easybuild/software/Core/GCCcore/7.3.0/lib64/ -L/opt/easybuild/software/Compiler/GCC/7.3.0-2.30/CUDA/9.2.88/lib64/' -Wl,--gc-sections -Wl,@bazel-out/host/bin/tensorflow/python/_pywrap_tensorflow_internal.so-2.params
collect2: error: ld returned 1 exit status

@wpoely86
Copy link
Member

Test report by @wpoely86
FAILED
Build succeeded for 12 out of 13 (4 easyconfigs in this PR)
nic170 - Linux centos linux 7.5.1804, Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, Python 2.7.5
See https://gist.github.com/c45097596f4c46496daa4abce6445d0c for a full test report.

@akesandgren
Copy link
Contributor Author

@vanzod then I suggest add -v and redirect to file, and then take it step by step forward to see exactly what goes wrong.

@boegel
Copy link
Member

boegel commented Aug 20, 2018

@vanzod Maybe the actual error is being sent to stdout for some reason (causing it to disappear into a black hole, I've seen that problem before, see bazelbuild/bazel#4053).

How about redirecting all output to stderr via >&2?

@vanzod
Copy link
Member

vanzod commented Aug 21, 2018

I solved the issue on my system that was preventing the build to complete successfully. I can now correctly build all the software in this PR.
Unfortunately the compute capability of the GPU I have on my test system is too low for this version of TF and the mnist_with_summaries.py test fails.

LGTM

vanzod
vanzod previously approved these changes Aug 21, 2018
@boegel
Copy link
Member

boegel commented Aug 21, 2018

@vanzod You mean the GPU in that system is too old to support even compute capability 3.0?

@mstud
Copy link
Contributor

mstud commented Aug 22, 2018

@akesandgren: I just wanted to add that I've built this with MPI, and according to one of our users who tested it, it works just fine. So if one added

toolchainopts = {'usempi': True}

before this gets merged, I wouldn't be unhappy ;) But I can understand if more testing is needed first.

@akesandgren
Copy link
Contributor Author

@mstud Yeah, I can live with stalling the merge a while and do some testing myself.
Do you have a example that works with MPI in the first place?

I haven't run that many cases so I'm not 100% sure I'm doing things correctly.

@boegel
Copy link
Member

boegel commented Aug 22, 2018

@akesandgren How about making the change, and then issue a follow up PR in case things are broken?

I'd like to get this in ASAP, would be nice to have this included in the upcoming EasyBuild release...

@akesandgren
Copy link
Contributor Author

@boegel I'm just about to run a basic test. If it works I'll update.... Should be ready soon after lunch i think.

@akesandgren akesandgren dismissed stale reviews from vanzod and verdurin via fb0438c August 22, 2018 10:35
@akesandgren
Copy link
Contributor Author

Turning on MPI seems to not be a problem, since one has to change the train.Server to use grpc+mpi to activate it anyhow.

@akesandgren
Copy link
Contributor Author

Test report by @akesandgren
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
b-cn1501.hpc2n.umu.se - Linux ubuntu 16.04, Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz, Python 2.7.12
See https://gist.github.com/f0b3073a09a507575b662cd776f1e7a4 for a full test report.

@boegel
Copy link
Member

boegel commented Aug 22, 2018

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in this PR)
nic170 - Linux centos linux 7.5.1804, Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, Python 2.7.5
See https://gist.github.com/0742ef1dc2b1d33cbafcf38a7e1e139c for a full test report.

@boegel
Copy link
Member

boegel commented Aug 23, 2018

Going in, thanks @akesandgren!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants