-
Notifications
You must be signed in to change notification settings - Fork 772
{lib}[fosscuda/2018b] TensorFlow v1.10.0 #6677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{lib}[fosscuda/2018b] TensorFlow v1.10.0 #6677
Conversation
|
Test report by @akesandgren |
|
Looks like they fixed the AVX512 NaN problem in 1.10, so don'y merge this just yet. I'm rerunning tests without the preconfigopts setting. |
| description = "An open-source software library for Machine Intelligence" | ||
|
|
||
| toolchain = {'name': 'fosscuda', 'version': '2018b'} | ||
| toolchainopts = {'opt': True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akesandgren opt is enabled by default, so you can drop this?
| ('cuDNN', '7.1.4.18'), | ||
| ] | ||
|
|
||
| preconfigopts = 'export CC_OPT_FLAGS="$CC_OPT_FLAGS -mno-avx512f" && ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akesandgren This can be removed if the issues on AVX512 are indeed fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to verify the AVX512 problem using an intel based build but boringssl/grpc/libssl are causing problems there. Will try a bit more to get that build to work first.
|
@akesandgren You'll need to resolve a conflict that was introduced by merging #6643 ( |
| sources = ['v%(version)s.tar.gz'] | ||
| patches = [ | ||
| '%(name)s-%(version)s_remove-msse-hardcoding.patch', | ||
| '%(name)s-%(version)s_dont_expand_cuda_cudnn_path.patch', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akesandgren Please add back TensorFlow-1.5.0_swig-env.patch, it seems to be required to fix a problem I'm running into without this, i.e. the build failing with swig failed: error executing command
… toolchainopts. Drop -mno-avx512f since v1.10 doesn't seem to have problems with AVX512 any longer (at least not with foss).
|
@akesandgren Do you mind submitting another test report after your last changes? |
|
@akesandgren I noticed that in the |
|
Test report by @vanzod |
|
I may have ran out of disk space. I'll test it again tomorrow |
|
Yes it is for using the mkldnn, not mkl as such, and with TF you want mkldnn at all times. |
|
Test report by @akesandgren |
|
Test report by @vanzod |
|
Test report by @verdurin |
|
|
||
| # The full version of the library can be found using | ||
| # strings -a cuda/lib64/libcudnn_static.a | grep cudnn_version_ | ||
| # Download and rename. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akesandgren Please make the renaming instructions explicit for the tarball
|
@verdurin Huh?? What do you mean? |
|
@akesandgren I mean there are |
|
Ok.... don't remember seeing that lately. Although my builds haven't failed in a while, at least not hard. |
|
@vanzod Can you provide a full (debug) log for your last test report, so we can try and figure out this |
|
Here is the full log. |
|
@vanzod Still no trace of the actual error. I think you need to run that link command by hand. |
|
Test report by @boegel |
|
Test report by @vanzod |
|
Based on the log, I have manually repeated the linking of |
|
Test report by @wpoely86 |
|
@vanzod then I suggest add -v and redirect to file, and then take it step by step forward to see exactly what goes wrong. |
|
@vanzod Maybe the actual error is being sent to How about redirecting all output to stderr via |
|
I solved the issue on my system that was preventing the build to complete successfully. I can now correctly build all the software in this PR. LGTM |
|
@vanzod You mean the GPU in that system is too old to support even compute capability 3.0? |
|
@akesandgren: I just wanted to add that I've built this with MPI, and according to one of our users who tested it, it works just fine. So if one added
before this gets merged, I wouldn't be unhappy ;) But I can understand if more testing is needed first. |
|
@mstud Yeah, I can live with stalling the merge a while and do some testing myself. I haven't run that many cases so I'm not 100% sure I'm doing things correctly. |
|
@akesandgren How about making the change, and then issue a follow up PR in case things are broken? I'd like to get this in ASAP, would be nice to have this included in the upcoming EasyBuild release... |
|
@boegel I'm just about to run a basic test. If it works I'll update.... Should be ready soon after lunch i think. |
|
Turning on MPI seems to not be a problem, since one has to change the train.Server to use grpc+mpi to activate it anyhow. |
|
Test report by @akesandgren |
|
Test report by @boegel |
|
Going in, thanks @akesandgren! |
(created using
eb --new-pr)New version of TensorFlow (1.10.0) for fosscuda 2018b. Depends on
easybuilders/easybuild-easyblocks#1453