DLPack: sync CUDA stream when exporting to TensorFlow by merlinND · Pull Request #377 · mitsuba-renderer/drjit

merlinND · 2025-04-03T13:32:50Z

This PR attempts to fix synchronization issues that come up in the unit tests of the new TF interop feature (#301): #301 (comment)

Since TensorFlow uses non-default CUDA streams for compute and data movement, I believe that we need to synchronize the stream used by DrJit before exporting a tensor to TF.

merlinND · 2025-04-03T13:33:56Z

@njroussel After running the relevant unit tests in a loop on my machine for a while, this seems to resolve the synchronization issue. I wouldn't want to introduce unnecessary synchronization points though; do you also think this is the right solution given that TF uses separate non-default streams?

wjakob · 2025-04-04T06:55:42Z

@merlinND, I'm wondering if this change has the potential to affect the conversion of other DLPack tensors negatively? Synchronization is always bad and best avoided. For example, should we perhaps only interpret 2 as a force flush when the thread being active is not the main application thread?

merlinND · 2025-04-04T10:26:21Z

Hi Wenzel,

I agree that we should definitely avoid adding unnecessary synchronizations.

Checking the documentation again, I see that:

stream is provided by the consumer to the producer to instruct the producer to ensure that operations can safely be performed on the array (e.g., by inserting a dependency between streams via “wait for event”).
(...)

stream = 2 is the per-thread default stream

So essentially, when passing stream, the consumer (caller of __dlpack__()) indicates which stream they intend to use the value on. So if the stream they use is the same stream that we've produced the value on, there's no need for synchronization. Right?

should we perhaps only interpret 2 as a force flush when the thread being active is not the main application thread?

In DrJit, is the main default stream always used, even when launching kernels from non-main threads?
If so, then jit_sync_thread() only syncs the main thread's default stream regardless of which thread it's called from?
Then, indeed we would only want to call jit_sync_thread() when __dlpack__ is being called from a non-main thread.

In DrJit, what would be a reliable way to check if the current thread is the "main thread"? (= the thread for which the DrJit stream is the default stream)

If we make the changes above, then I'm not sure which value of stream to use in order to request a sync when calling .tf(), because:

TF uses non-default streams and doesn't expose any of them, so we can't pass stream=pointer
.tf() can be called from the main thread but we always want to sync (because TF never uses the same stream as DrJit AFAICT), so we can't pass stream=2
stream=0 would currently do the right thing in DrJit, but it's technically disallowed by the spec: "0 is disallowed due to its ambiguity: 0 could mean either None, 1, or 2"

Maybe stream=0 is still our best bet.

merlinND · 2025-04-09T08:00:52Z

Hi @wjakob,

Could you please check the questions above? I'll update the PR based on your answers and it should fix the flaky TF unit tests.

wjakob · 2025-04-15T05:26:08Z

Hi Merlin,

I just had a moment to take a look at this.

So essentially, when passing stream, the consumer (caller of dlpack()) indicates which stream they intend to use the value on. So if the stream they use is the same stream that we've produced the value on, there's no need for synchronization. Right?

That's right. I looked again what Dr.Jit does, since this changed some time ago. Mitsuba creates a custom stream per device, which is set up with flag CU_STREAM_DEFAULT. This means that it synchronizes with respect to the NULL stream (global default stream) but not other ones (in particular, it does not implicitly synchronize with respect to per-thread default streams / PTDS). Mitsuba also does not use the PTDS/PTSZ ABI of the cuda driver. We used to, and it caused difficulties with parallel scene loading.
IMO the following behavior is sensible:

stream is -1: user opted out of synchronization, don't do anything.
stream is 0: ambiguous case, synchronize via jit_sync_thread().
stream is 1: user requested to sync with respect to global default thread. Don't need to do anything since we are already synced.
stream is 2: user requested to sync with respect to the per thread default stream. We do need to synchronize. But doing a device-wide synchronization that waits on the CPU is a highly pessimistic way of doing that. The better way is to insert an event and wait on that, but it means that we will need to access PTDS/PTSZ-flavored ABI of the cuda driver to get access to the per thread default stream. I've attached an untested patch that tries to do this below, I hope it makes sense.
stream is > 2: we are getting a CUDA stream handle. Pass to jitc_cuda_sync_stream(), which will push an event and then wait on that.

diff --git a/include/drjit-core/jit.h b/include/drjit-core/jit.h
index 720ca16..200d3cc 100644
--- a/include/drjit-core/jit.h
+++ b/include/drjit-core/jit.h
@@ -202,8 +202,11 @@ extern JIT_EXPORT void *jit_cuda_lookup(const char *name);
  * \brief Add CUDA event synchronization between thread state's and external
  * CUDA stream.
  *
- * An event will be recorded into the thread's states stream and the external stream
- * will wait on the event before performing any subsequent work.
+ * An event will be recorded into the thread's states stream and the external
+ * stream will wait on the event before performing any subsequent work. The
+ * special value stream==2 denotes the caller's per-thread default stream.
+ * There is no need to ever synchronize with the global NULL stream, since
+ * Dr.Jit implicitly synchronizes with respect to it.
  *
  * \param stream The CUstream handle of the external stream
  */
diff --git a/src/cuda_api.cpp b/src/cuda_api.cpp
index 293f4de..b3d0f70 100644
--- a/src/cuda_api.cpp
+++ b/src/cuda_api.cpp
@@ -123,6 +123,7 @@ bool jitc_cuda_api_init() {
         LOAD(cuStreamDestroy, "v2");
         LOAD(cuStreamSynchronize);
         LOAD(cuStreamWaitEvent);
+        LOAD(cuStreamWaitEvent_ptsz);
         LOAD(cuPointerGetAttribute);
         LOAD(cuArrayCreate, "v2");
         LOAD(cuArray3DCreate, "v2");
@@ -174,7 +175,7 @@ void jitc_cuda_api_shutdown() {
     Z(cuModuleGetFunction); Z(cuModuleLoadData); Z(cuModuleLoadDataEx); Z(cuModuleUnload);
     Z(cuOccupancyMaxPotentialBlockSize); Z(cuCtxPushCurrent);
     Z(cuCtxPopCurrent); Z(cuStreamCreate); Z(cuStreamDestroy);
-    Z(cuStreamSynchronize); Z(cuStreamWaitEvent); Z(cuPointerGetAttribute);
+    Z(cuStreamSynchronize); Z(cuStreamWaitEvent); Z(cuStreamWaitEvent_ptsz); Z(cuPointerGetAttribute);
     Z(cuArrayCreate); Z(cuArray3DCreate); Z(cuArray3DGetDescriptor);
     Z(cuArrayDestroy); Z(cuTexObjectCreate); Z(cuTexObjectGetResourceDesc);
     Z(cuTexObjectDestroy); Z(cuMemcpy2DAsync); Z(cuMemcpy3DAsync);
diff --git a/src/cuda_api.h b/src/cuda_api.h
index a2a04c0..142d9e5 100644
--- a/src/cuda_api.h
+++ b/src/cuda_api.h
@@ -260,6 +260,7 @@ DR_CUDA_SYM(CUresult (*cuStreamCreate)(CUstream *, unsigned int));
 DR_CUDA_SYM(CUresult (*cuStreamDestroy)(CUstream));
 DR_CUDA_SYM(CUresult (*cuStreamSynchronize)(CUstream));
 DR_CUDA_SYM(CUresult (*cuStreamWaitEvent)(CUstream, CUevent, unsigned int));
+DR_CUDA_SYM(CUresult (*cuStreamWaitEvent_ptsz)(CUstream, CUevent, unsigned int));
 DR_CUDA_SYM(CUresult (*cuMemAllocAsync)(CUdeviceptr *, size_t, CUstream));
 DR_CUDA_SYM(CUresult (*cuMemFreeAsync)(CUdeviceptr, CUstream));

diff --git a/src/cuda_core.cpp b/src/cuda_core.cpp
index 1e54389..cbcbdda 100644
--- a/src/cuda_core.cpp
+++ b/src/cuda_core.cpp
@@ -109,8 +109,12 @@ std::pair<CUmodule, bool> jitc_cuda_compile(const char *buf, bool release_state_
 void jitc_cuda_sync_stream(uintptr_t stream) {
     ThreadState* ts = thread_state(JitBackend::CUDA);
     CUevent sync_event = ts->sync_stream_event;
-    cuda_check(cuEventRecord(sync_event, (CUstream)ts->stream));
-    cuda_check(cuStreamWaitEvent((CUstream)stream, sync_event, CU_EVENT_DEFAULT));
+    scoped_set_context guard(ts->context);
+    cuda_check(cuEventRecord(sync_event, (CUstream) ts->stream));
+    if (stream != 2)
+        cuda_check(cuStreamWaitEvent((CUstream)stream, sync_event, CU_EVENT_DEFAULT));
+    else
+        cuda_check(cuStreamWaitEvent_ptsz(nullptr, sync_event, CU_EVENT_DEFAULT));
 }

merlinND · 2025-04-16T09:43:34Z

Thank you @wjakob for looking into it and providing the patch! I've opened mitsuba-renderer/drjit-core#139 with your patch.
The solution makes sense to me.

@dvicini, do the changes in this PR make sense to you? In particular, calling array.tf() will now trigger a jit_sync_thread() call.
The reasoning is that otherwise, the streams used by TF and the stream used by DrJit are not synchronized at all. Did you ever run into sync issues?

wjakob · 2025-04-16T09:59:15Z

@merlinND Is that so? I thought we now insert an event and wait for it asynchronously. That is assuming that TF uses the special 2 argument.

merlinND · 2025-04-16T10:08:15Z

Sorry, it was not very clear because two things were included in this PR:

TF uses 3 non-default streams (compute, device-to-host memcpy, host-to-device memcpy) that it creates and manages itself. I didn't find a way to get pointers to those streams, so I don't know how to ensure synchronization w.r.t. TF's streams other than jit_sync_thread().
stream=2 is part of the DLPack spec but was not correctly implemented in DrJit. Your patch added support, which is nice. But I don't think we can use stream=2 in array.tf() because TF doesn't use the per-thread stream.

Please let me know if I missed something. As you said, jit_sync_thread() is heavy and we want to avoid it as much as possible.

wjakob · 2025-04-18T01:26:29Z

Ok, that makes sense. Let's wait for @dvicini's on the Google™ viewpoint before merging :-)

merlinND · 2025-05-13T08:09:26Z

Ping @dvicini, do you have an opinion above the above regarding TF interop?
#377 (comment)

wjakob · 2025-08-22T07:06:09Z

This PR was still waiting for feedback from @dvicini

dvicini · 2025-08-22T08:03:15Z

I completely missed those pings, sorry!

I don't have a groundtruth answer unfortunately. I experimented with Tensorflow some time ago and had a version of a custom TF op (before the PR to Dr.Jit itself was made). At the time, I ran into a number of issues with TF's own threading system (entirely separate from CUDA streams) and in the end never had a good solution. I recall that I ran into issues with specific combinations of using tf.function and possibly jit_compile=True. Maybe those are now fixed with the recently merged TF support?

Doing the conservative sync call makes sense to me and I am not surprised that TF does not expose those streams directly. We mostly use Jax now for our projects that use Mitsuba/Dr.jit.

wjakob force-pushed the master branch 4 times, most recently from c821068 to dcb5217 Compare April 14, 2025 16:01

merlinND added 3 commits April 16, 2025 11:38

Bindings for sync_device() and sync_all_devices()

4d9586f

DLPack: sync CUDA stream when exporting to TF

9e66754

Update drjit-core submodule

6eaa0c3

merlinND force-pushed the dlpack-tf-sync branch from 61c30e2 to 6eaa0c3 Compare April 16, 2025 09:40

merlinND mentioned this pull request Apr 16, 2025

jitc_cuda_sync_stream(): allow sync w.r.t. default per-thread stream mitsuba-renderer/drjit-core#139

Merged

wjakob force-pushed the master branch 2 times, most recently from bf48ccb to e3f5762 Compare July 8, 2025 18:00

wjakob force-pushed the master branch from 4a3a8bd to 2924c6c Compare August 6, 2025 12:26

wjakob force-pushed the master branch from 8281a4c to 275910d Compare September 4, 2025 12:02

wjakob force-pushed the master branch from d54b58d to ff4d717 Compare December 8, 2025 15:18

wjakob force-pushed the master branch from 2b73de8 to df4cf48 Compare January 30, 2026 10:53

njroussel force-pushed the master branch 2 times, most recently from fc79b70 to bdc17f6 Compare February 13, 2026 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DLPack: sync CUDA stream when exporting to TensorFlow#377

DLPack: sync CUDA stream when exporting to TensorFlow#377
merlinND wants to merge 3 commits intomasterfrom
dlpack-tf-sync

merlinND commented Apr 3, 2025

Uh oh!

merlinND commented Apr 3, 2025

Uh oh!

wjakob commented Apr 4, 2025

Uh oh!

merlinND commented Apr 4, 2025 •

edited

Loading

Uh oh!

merlinND commented Apr 9, 2025 •

edited

Loading

Uh oh!

wjakob commented Apr 15, 2025

Uh oh!

merlinND commented Apr 16, 2025

Uh oh!

wjakob commented Apr 16, 2025

Uh oh!

merlinND commented Apr 16, 2025

Uh oh!

wjakob commented Apr 18, 2025

Uh oh!

merlinND commented May 13, 2025

Uh oh!

wjakob commented Aug 22, 2025

Uh oh!

dvicini commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

merlinND commented Apr 3, 2025

Uh oh!

merlinND commented Apr 3, 2025

Uh oh!

wjakob commented Apr 4, 2025

Uh oh!

merlinND commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merlinND commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wjakob commented Apr 15, 2025

Uh oh!

merlinND commented Apr 16, 2025

Uh oh!

wjakob commented Apr 16, 2025

Uh oh!

merlinND commented Apr 16, 2025

Uh oh!

wjakob commented Apr 18, 2025

Uh oh!

merlinND commented May 13, 2025

Uh oh!

wjakob commented Aug 22, 2025

Uh oh!

dvicini commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

merlinND commented Apr 4, 2025 •

edited

Loading

merlinND commented Apr 9, 2025 •

edited

Loading