Fix UMAP outlier issue by checking for outliers and shuffling#7131
Fix UMAP outlier issue by checking for outliers and shuffling#7131rapids-bot[bot] merged 20 commits intorapidsai:branch-25.10from
Conversation
ee73d15 to
4055864
Compare
viclafargue
left a comment
There was a problem hiding this comment.
Thanks @jinsolp! It looks great.
| truncate_gradient(rounding, current_buffer[d * TPB_X])); | ||
| raft::myAtomicAdd<T>((T*)cur_write + d, truncate_gradient(rounding, grads_buffer[d * TPB_X])); |
There was a problem hiding this comment.
Importantly, when random_state is set, current != cur_write and other != oth_write as updates accumulate in separate buffer to allow high precision deterministic accumulation of updates. It looks like we may still have outliers in this case? But, I guess that is acceptable for now.
|
Have to add unit test for outlier checking. Plan is to grab a large enough dataset that originally fails (i.e. has outliers). e.g. get min/max values of the CPU embedding, and check if all values in out embedding is within a certain threshold of that range. |
viclafargue
left a comment
There was a problem hiding this comment.
Thanks @jinsolp! It looks great! However, I believe that we would have to apply the shuffling before make_epochs_per_sample is called (see comment).
815a018 to
5cf16e3
Compare
| import pytest | ||
| import scipy.sparse as scipy_sparse | ||
| import umap | ||
| from cuvs.neighbors import all_neighbors, nn_descent |
There was a problem hiding this comment.
We should not directly import the cuvs Python API in cuML. If we do, then we need to cuvs (not just libcuvs) to our test dependencies. CC @divyegala
There was a problem hiding this comment.
we can get rid of this, but if we do, we have to run the full e2e cpu umap on a not-so-small dataset for comparison (because outliers don't show up with small datasets)
There was a problem hiding this comment.
We can consider adding cuvs to the test dependencies, but then let's make sure to guard the cuvs import with pytest.importorskip. I'll let @divyegala chime in on this.
There was a problem hiding this comment.
Yeah, it is fine as a testing dependency. Please point me to the commit that adds cuvs as so, I will verify that we don't leak the dependency by mistake.
There was a problem hiding this comment.
Using importorskip for now. Left an issue: #7279
There was a problem hiding this comment.
It looks like you are still using a direct import here?
|
/merge |
e736d05
into
rapidsai:branch-25.10
Closing #6454
Main difference between out simplicial set embedding and CPU UMAP was in negative sampling.
We should use updated values (value after adding gradients) in the negative sampling stage.
Dispatched to two kernels (and three usages) based on `n_components. Fixed like below.
optimize_batch_kernel_reg(n_components=2): update thecurrent_regregister value (used later in the negative sampling stage) along withgradsoptimize_batch_kernel(with shared memory): distinguishcurrent_buffer(which used to JUST hold the gradient) from thegrad_buffer. Nowcurrent_bufferandgrad_buffercorresponds to thecurrent_regandgradsregisters in the register-approch kernel.optimize_batch_kernel(without shared memory): untouched because the grads are applied directly to global memory. This updated value in global memory is read directly for negative sampling later on.Visualizations 2D
50K samples random selected for plotting.
From the left
Using dataset 639K x 384

Using dataset 1.8M x 384

Visualizations 3D
50K samples random selected for plotting.
Plotting the same dataset with
n_components=3(Which uses the second kernel).From the left
Using dataset 639K x 384 (was already doing pretty well without outliers, still doing well)

Using dataset 1.8M x 384

before fix had outliers.