Skip to content

BitwiseHamming distance for NN Descent#1101

Merged
rapids-bot[bot] merged 19 commits intorapidsai:branch-25.08from
jinsolp:nnd-hamming
Jul 18, 2025
Merged

BitwiseHamming distance for NN Descent#1101
rapids-bot[bot] merged 19 commits intorapidsai:branch-25.08from
jinsolp:nnd-hamming

Conversation

@jinsolp
Copy link
Copy Markdown
Contributor

@jinsolp jinsolp commented Jul 10, 2025

Adding bitwise hamming distance for NN Descent

@jinsolp jinsolp self-assigned this Jul 10, 2025
@jinsolp jinsolp requested a review from a team as a code owner July 10, 2025 19:11
@jinsolp jinsolp added feature request New feature or request non-breaking Introduces a non-breaking change labels Jul 10, 2025
int num_load_elems = (step == raft::ceildiv(data_dim, TILE_COL_WIDTH) - 1)
? data_dim - step * TILE_COL_WIDTH
: TILE_COL_WIDTH;
if (metric != cuvs::distance::DistanceType::BitwiseHamming) {
Copy link
Copy Markdown
Contributor Author

@jinsolp jinsolp Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only diff here is wrapping the wmma operation part with if (metric != cuvs::distance::DistanceType::BitwiseHamming) so that we don't perform unnecessary computations for BitwiseHamming. (the diff seems to detect all additional indents as changes which makes it confusing)

Copy link
Copy Markdown
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to allocate just half the fp16 matrix for bitwise hamming, right?

Comment on lines +1015 to +1019
res,
nrow_,
build_config.metric == cuvs::distance::DistanceType::BitwiseHamming
? (build_config.dataset_dim + 1) / 2
: build_config.dataset_dim)},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's what we do here!

const uint8_t* data_n1 = reinterpret_cast<const uint8_t*>(data) + n1 * data_dim;
const uint8_t* data_n2 = reinterpret_cast<const uint8_t*>(data) + n2 * data_dim;
for (int d = 0; d < data_dim; d++) {
s_distances[i] += __popc(static_cast<uint32_t>(data_n1[d] ^ data_n2[d]) & 0xff);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wherever you are doing this, have you ensured that the dim that you pass along is correctly scaled? If you were to convert a half to two uint8s, the dim would have to be doubled.
Furthermore, I haven't looked into the nn descent logic entirely, but in case you are operating in the half space, I dont think you'd have to reinterpret_cast everywhere to uint8. Its more efficient to stay in the half space and do something like:
__popc(static_cast<uint32_t>(data_n1[d] ^ data_n2[d]) & 0xffffu)

Copy link
Copy Markdown
Contributor Author

@jinsolp jinsolp Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wherever you are doing this, have you ensured that the dim that you pass along is correctly scaled? If you were to convert a half to two uint8s, the dim would have to be doubled.

I am allocating dim/2 dimensions for the half type array. To make things straightforward, the dim is always used as the dimension of the "given dataset" (fp32, int8 etc). And dimensions for allocating the fp16 type device array is configured based on the data type (dim/2 for int8 and uint8, as-is for other types)

I dont think you'd have to reinterpret_cast everywhere to uint8.

the issue with keeping the pointer fp16 and doing this for data_dim/2 dimensions results in having to check if the last byte is a valid value or not inside the kernel (because the original int8 data could have an odd number of dimensions). I thought it would be easier to cast to int8 to loop over the original data_dim instead : )

Copy link
Copy Markdown
Contributor

@tarang-jain tarang-jain Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yes there can be an odd number of dims, for which we can fall back to int8, but if its not an odd number of dims, we can do it in the half space. I'd argue that we can do even more -- if it is divisible by 4, reinterpret_cast to uint32_t so you'd have to popcount only over one fourth the dim (I'm doing the same thing with BitwiseHamming in ivf-flat). However, considering the deadlines, we can look into these optimizations later. Can we create a github issue for this and write it as a TODO here?
Regarding the dims, I just wanted to verify if the dims being used everywhere need to be scaled, but it looks like you have already checked those things, so apart from the creation of that github issue this PR looks okay to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, added an issue here #1127

@divyegala
Copy link
Copy Markdown
Member

/merge

@rapids-bot rapids-bot Bot merged commit 722b7e6 into rapidsai:branch-25.08 Jul 18, 2025
231 of 249 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Unstructured Data Processing Jul 18, 2025
@jinsolp jinsolp deleted the nnd-hamming branch July 18, 2025 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpp feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants