-
Notifications
You must be signed in to change notification settings - Fork 61
Closed
Closed
Copy link
Description
While working on the Numba CUDA GDB pretty-printer, I came across an issue with shared array data addresses. While in general GPU addresses are 64-bit, the value of a shared address can fit in a 32 bits. The shared array creation code builds an array datastructure which uses a 64-bit address type for the data pointer, but only the lower 32-bits appear to be initialized when the array is created. The upper bits are left uninitialized and contain random data.
This can be reproduced using:
- Numba CUDA from 9e0a986
- cuda-gdb from the CUDA Toolkit r13.0 release
Here is the reproduction code. Save it as shared-array.py:
#!/usr/bin/env python3
import numpy as np
from numba import cuda
# Must be a constant expression
num_entries = 8
@cuda.jit(debug=True, opt=False)
def shared_array_kernel(input):
gid = cuda.grid(1)
size = len(input)
if gid >= size:
return
# Allocate a new array in shared memory and fill it with the
# input array in reverse order
shared_array = cuda.shared.array(num_entries, dtype=np.int64)
shared_array[gid] = input[size - gid - 1]
# Synchronize all threads in the block
cuda.syncthreads()
# Breakpoint here
breakpoint()
if __name__ == '__main__':
# Generate data
input = cuda.to_device(np.array(range(num_entries), dtype=np.int64))
print(f'input: {input.copy_to_host()}')
# Launch the kernel
shared_array_kernel.forall(len(input))(input)
# All done
print('All Done!')
Then run as follows to see the error. The shared memory read does not work correctly unless the upper bits are masked. Masking out those bits results in the expected value of "7".
$ /usr/local/cuda-13.0/bin/cuda-gdb -q python3
Reading symbols from python3...
(cuda-gdb) run shared-array.py
Starting program: /home/mmason/anaconda3/envs/Numba-GA-python3.12/bin/python3 shared-array.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 2184642]
[Detaching after fork from child process 2184650]
input: [0 1 2 3 4 5 6 7]
/home/mmason/PythonDebugging/NUMBA/numba-cuda/numba_cuda/numba/cuda/dispatcher.py:696: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.
warn(errors.NumbaPerformanceWarning(msg))
Thread 1 "python3" received signal SIGTRAP, Trace/breakpoint trap.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x00007ffd3f6f81d0 in shared_array_kernel<<<(1,1,1),(768,1,1)>>> (input=...) at shared-array.py:25
25 breakpoint()
(cuda-gdb) ptype shared_array
type = @local struct array(int64, 1d, C) {
i8 * @local meminfo;
i8 * @local parent;
@local int64 nitems;
@local int64 itemsize;
@shared int64 * @local data;
@local UniTuple(int64 x 1) ([1 x i64]) shape;
@local UniTuple(int64 x 1) ([1 x i64]) strides;
}
(cuda-gdb) print shared_array
$1 = {meminfo = 0x0, parent = 0x0, nitems = 8, itemsize = 8, data = 0x7ffe00000400, shape = {8}, strides = {8}}
(cuda-gdb) print *shared_array.data
Error: read_shared_memory(0, 0, 0): failed to read shared memory at address 0x7ffe00000400 size 8, error=CUDBG_ERROR_INVALID_MEMORY_ACCESS, error message=CUDBG_ERROR_INVALID_MEMORY_ACCESS
(cuda-gdb) print *(@shared int64 *)(((long long)shared_array.data) & 0xffffffff)
$2 = 7
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working