Skip to content

[BUG] data pointer values for shared arrays have uninitialized data in the upper bits #627

@mmason-nvidia

Description

@mmason-nvidia

While working on the Numba CUDA GDB pretty-printer, I came across an issue with shared array data addresses. While in general GPU addresses are 64-bit, the value of a shared address can fit in a 32 bits. The shared array creation code builds an array datastructure which uses a 64-bit address type for the data pointer, but only the lower 32-bits appear to be initialized when the array is created. The upper bits are left uninitialized and contain random data.

This can be reproduced using:

  • Numba CUDA from 9e0a986
  • cuda-gdb from the CUDA Toolkit r13.0 release

Here is the reproduction code. Save it as shared-array.py:

#!/usr/bin/env python3

import numpy as np
from numba import cuda

# Must be a constant expression
num_entries = 8

@cuda.jit(debug=True, opt=False)
def shared_array_kernel(input):
    gid = cuda.grid(1)
    size = len(input)
    if gid >= size:
        return

    # Allocate a new array in shared memory and fill it with the
    # input array in reverse order
    shared_array = cuda.shared.array(num_entries, dtype=np.int64)
    shared_array[gid] = input[size - gid - 1]

    # Synchronize all threads in the block
    cuda.syncthreads()

    # Breakpoint here
    breakpoint()


if __name__ == '__main__':
    # Generate data
    input = cuda.to_device(np.array(range(num_entries), dtype=np.int64))
    print(f'input: {input.copy_to_host()}')

    # Launch the kernel
    shared_array_kernel.forall(len(input))(input)

    # All done
    print('All Done!')

Then run as follows to see the error. The shared memory read does not work correctly unless the upper bits are masked. Masking out those bits results in the expected value of "7".

$ /usr/local/cuda-13.0/bin/cuda-gdb -q python3
Reading symbols from python3...

(cuda-gdb) run shared-array.py 
Starting program: /home/mmason/anaconda3/envs/Numba-GA-python3.12/bin/python3 shared-array.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 2184642]
[Detaching after fork from child process 2184650]
input: [0 1 2 3 4 5 6 7]
/home/mmason/PythonDebugging/NUMBA/numba-cuda/numba_cuda/numba/cuda/dispatcher.py:696: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.
  warn(errors.NumbaPerformanceWarning(msg))

Thread 1 "python3" received signal SIGTRAP, Trace/breakpoint trap.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x00007ffd3f6f81d0 in shared_array_kernel<<<(1,1,1),(768,1,1)>>> (input=...) at shared-array.py:25
25	    breakpoint()

(cuda-gdb) ptype shared_array
type = @local struct array(int64, 1d, C) {
    i8 * @local meminfo;
    i8 * @local parent;
    @local int64 nitems;
    @local int64 itemsize;
    @shared int64 * @local data;
    @local UniTuple(int64 x 1) ([1 x i64]) shape;
    @local UniTuple(int64 x 1) ([1 x i64]) strides;
}

(cuda-gdb) print shared_array
$1 = {meminfo = 0x0, parent = 0x0, nitems = 8, itemsize = 8, data = 0x7ffe00000400, shape = {8}, strides = {8}}

(cuda-gdb) print *shared_array.data
Error: read_shared_memory(0, 0, 0): failed to read shared memory at address 0x7ffe00000400 size 8, error=CUDBG_ERROR_INVALID_MEMORY_ACCESS, error message=CUDBG_ERROR_INVALID_MEMORY_ACCESS

(cuda-gdb) print *(@shared int64 *)(((long long)shared_array.data) & 0xffffffff)
$2 = 7

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions