Extends drjit.wrap to TensorFlow#301
Conversation
|
Dear @jhoydis (cc @merlinND) -- this looks amazing! I would love to review this ASAP, but I am about to leave for a two-week vacation. I will take a thorough look when I am back. Regarding the issues reported to TensorFlow, I am wondering if it makes sense for some of them to be resolved before merging? (For example, the restriction to |
|
Hi @wjakob, I will make a first review so that hopefully it's almost ready when you're back 👍 Regarding the int32 limitation in TF, it's actually unrelated to import drjit as dr
import tensorflow as tf
with tf.device('gpu'):
x_tf = tf.constant([0, 1, 2], tf.int32)
x_dr = dr.detail.import_tensor(x_tf, ad=False)
print("TF device:", x_tf.device)
print("DrJit type:", type(x_dr))
# TF device: /job:localhost/replica:0/task:0/device:GPU:0
# DrJit type: <class 'drjit.llvm.TensorXi'>(reproducer from @jhoydis). So I would say that we don't need to wait on a bugfix for this one, since it doesn't prevent AD interop at all and the current TF behavior is even part of a TF unit test. |
drjit/interop.py
Outdated
| if h.requires_grad: | ||
| dr.enable_grad(r) | ||
| if source == 'tf' and enable_grad: | ||
| dr.enable_grad(r) |
There was a problem hiding this comment.
In the PyTorch case above, we check whether the tensor has grads enabled before enabling them on the DrJit side.
If there's really no way to do an equivalent check, I think it would be good to document for which cases grads will get enabled unconditionally (maybe in the @dr.wrap docstring which already describes the other special cases).
There was a problem hiding this comment.
I could not find a way to figure out programmatically if a TF tensor is watched by a GradientTape (unless it is a Variable). I slightly improved this case as follows:
if source == 'tf' and enable_grad:
# There is no TF equivalent to h.requires_grad.
# We hence enable gradients for all trainable variables
# and tensors of floating dtype.
if tf_var_check(h):
if h.trainable:
dr.enable_grad(r)
elif h.dtype.is_floating:
dr.enable_grad(r)
There was a problem hiding this comment.
I think that's correct, as far as I know there is no better way to get this information from TF.
drjit/interop.py
Outdated
|
|
||
| # Evaluate the function using another array programming framework | ||
| self.out = func(*self.args, **self.kwargs) | ||
| tensor = flatten(self.args)[1] |
There was a problem hiding this comment.
Could this fail if there are no array-typed args?
There was a problem hiding this comment.
I have rewritten this part. There is now a new helper function find_first_tf_tensor that finds the first TF tensor instance of a PyTree and returns it. TF will then use this tensor's device for all computations
There was a problem hiding this comment.
I have rebased this PR to the master branch and also tested with TF2.18 locally.
What is missing?
|
The CI failure with GCC 9 is a bit mysterious. It seems to segfault after running |
|
I am not sure about the CI failure, but another high-level comment: Is it correct to assume that the wrapping of Dr.Jit into a TF computation will not support For what it's worth, I never had much luck trying to get |
I have never tried wrapping Dr.Jit into TF via |
|
Regarding the CI:
|
|
Just to follow-up on the CI:
|
|
Dear @jhoydis, let me apologize in taking a really long time to get to this. I would like to review and swiftly merge this PR. Could I ask you to force-push this PR? That should register the PR with the build system. (It will still need to approve it, but right now I can't even do that since the changes are on your branch). Thanks, |
|
I have force-pushed the last commit without any changes. It seems to have triggered the build system. Let me know if anything else is required from my side. |
|
I approved those builds, future ones should run directly (assuming I set this up correctly). It looks like there are some failures -- you should be able to see the details by clicking the "guest login" button. In any case, here is the macOS failure: The version of TensorFlow installed on the macOS build machine is: On windows (with the latest PyPI version of |
|
The macOS tensorflow install was non-functional. I fixed it and will restart the build. |
|
On the amd64/ GCC 9 build, the test suite runs to completion but then segfaults during interpreter shutdown when Tensorflow was used. There are several warning messages of the type "SystemError: Py_AddPendingCall: cannot add pending calls (Python shutting down)". It sounds to me like some cleanup of tensors is delayed so much that it's happening at the same time as Python shutting down. I restarted the macOS build. |
|
The macOS CI machine can now properly import tensorflow. It segfaults while running the TensorFlow tests: |
Before TensorFlow was installed, it also segfaulted after running |
|
What platform (OS, Compiler) did you use for testing? Did you try running on one other OS besides that? (That's usually a good way of flushing out those kinds of issues) |
|
Before TensorFlow was installed on more CI targets, the GCC9 CI build was already segfaulting. Regarding the macOS and Windows failures, I don't think these platforms had been tested yet (although maybe @jhoydis tested on macOS already?). |
|
I had tested it for the initial PR on macOS but will do it again now. |
|
I have just run all tests successfully on my Mac:
|
|
I just tried running this on a Linux machine and ran into a crash involving an exception raised by Tensorflow (within pybind11 code). I believe what happens here is that libstdc++ traverses the call stack containing Clang-compiled functions using libc++ exceptions. There are some ABI incompatibilities here that have tripped us up in the past. I got things to pass when compiling with GCC, which is curiously the opposite of what the CI complained about. I believe that we should skip the TF tests on Linux (only Linux, that issue isn't present on macOS), if I was able to run things sucessfully on macOS (TF ver 2.16.2) on my Laptop. In contrast, the CI box was running a much older TF version (2.9.2) since that was the newest one available on Python 3.9. I suspect that this is the cause of the crashes we've been seeing. I will look into upgrading Python. But that brings a related question: what's the minimum TF version required for the AD interop to work? While testing on Linux, I noticed two issues with memory leaks. Both |
|
Update: the reason why I can't install a more recent version on the macOS CI runner is that the version of macOS is too old. This will take some time to fix (likely not until my return to EPFL). So I've uninstalled TF from this machine. |
|
What the hell. Apparently tensorflow does not support the use of GPUs on Windows since v2.10 (only through WSL2, i.e., the linux version). I believe that the crashes here occur from trying to run CUDA tests together with a Tensorflow build that cannot deal with GPU tensors. |
|
@jhoydis, following the above observations, could you look into the following?
|
I spent quite some hours investigating the leaks from
The variables that get leaked are the function inputs of the XLA config of the test. More specifically, the If there's no eager-mode config enabled, then the variables get correctly cleaned up at the very end once the Stack traceWhen there's both an Eager and an XLA config running, I've tried manually clearing various caches I could access from the XLA-compiled function, but no luck: test_fn.func._function_cache.clear()
test_fn.func._function_captures.clear()
test_fn.func._python_function = NoneTensorFlow users have reported XLA leaking memory before: tensorflow/tensorflow#80753, tensorflow/tensorflow#73457 |
I've pushed a change to skip TF tests in cases where DrJit supports CUDA, but TF doesn't detect a GPU device. It should be slightly more general than detecting Windows + TF. |
|
I've rebased, and added:
|
1af146f to
7298a9f
Compare
Both forward and backward propagation (with some limitations). + unit tests
Specifically, the case where DrJit supports CUDA but TensorFlow doesn't, e.g. on native Windows. + explicitly call the garbage collector to avoid some potential TF-side leaks.
+ minor improvements
|
Has anyone been using this? Is seems really unreliable. Out of 50 runs of |
|
@njroussel Is your tensorflow version up-to-date? |
|
Yes, it is. I initially noticed this in a CI run (link). |
|
I will try to debug the flaky tests on my machine. Edit: I can reproduce the test failures locally. Interestingly, tests do no fail when running with |
|
@merlinND I think you're on the right track. I only briefly tried to debug it yesterday, but noticed that some carefully placed prints made the issue impossible to reproduce. We must have some sort of a race condition right now. |
The differentiable bridge between Dr.Jit (
drjit.wrap) is currently limited to PyTorch and JAX.This PR extends the support to TensorFlow in both directions, i.e.,
'tf->drjit'and'drjit->tf'.The PR introduces a new wrapper for
'tf->drjit'indrjit.wrap()and adds new cases for'drjit->tf'indrjit.WrapADopin forward and backward mode.For
'drjit->tf'everything works like for'drjit->torch'with a limitation related to tf.int32 input types (for which an issue was filed). Like for JAX, the TF code can be compiled using the@tf.functionand@tf.function(jit_compile=True)decorators.For
'tf->drjit', forward-mode AD is not fully supported due to limitations of TF functions with custom gradients. Such functions cannot have non-tensor input structures and we discovered and reported a bug for functions with keyword arguments.The existing unit tests were extended to TensorFlow (in eager, graph and XLA modes).
The function
skip_oninconftest.pywas extended to allow for a custom message which can be different from the information provided by the raised exception.The docstring of
drjit.wrapwas modified to account for the new extensions and related limitations described above.