Fix overflow in coordinate descent#7399
Conversation
|
I didn't add a test since that would require a non-negligible amount of GPU memory (~10 GiB for inputs, + whatever the solver overhead is) The following quick check though runs fine now, but used to result in cublas errors: Script: import cupy as cp
from cuml.datasets import make_regression
from cuml.linear_model import ElasticNet
N = 1_000_000
M = 2200 # N * M exceeds max int32
X, y = make_regression(n_samples=N, n_features=M, random_state=42)
X = cp.asfortranarray(X)
weights = cp.random.uniform(0, 1, size=y.shape, dtype=y.dtype)
for sample_weight in [None, weights]:
model = ElasticNet().fit(X, y, sample_weight=sample_weight)
print(model.score(X, y))Output: |
There was a problem hiding this comment.
Thanks for spotting this! I would strongly suggest updating the function signature to make sure that n_rows is int64_t and also make ci be a int64_t too. This would prevent issues related to future code update using these as multiplicative operands. More importantly the RAFT operations are templated, and may possibly be vulnerable to integer overflows (especially the ones involving both rows and columns, see here). Using int64_t would solve this too. Additionally there might be similar patterns in the multi-GPU version of CD (see here or here).
I'm not sure I follow. Even with this PR,
Yes, that's what the problem in
I don't want to touch the multi-gpu versions here, please keep this PR limited to just the single GPU code. Are you suggesting we do something like then use the 64bit versions everywhere withing |
Yes, we won't support very large number of rows, but we want to make sure what you are fixing here won't reappear if we multiply the number of rows with something else.
Could work yes.
If there isn't any issue with |
Previously our coordinate descent solver would fail on problems where `n_cols * n_rows > INT_MAX` due to an `int` overflow. There were two locations where this was happening: - Calculating the offset into the input matrix - Within the computation of the L2Norm The former is a quick local fix. The latter I also fixed locally by switching from an `int` to a `int64_t` in the template call. However, I'm not sure if that's the best fix, or if it'd be better to handle this upstream within the template itself to avoid overflow of the index types. This was easy to do so I did it here for now. I've checked, and with this we can solve very large coordinate descent problems, with the dimension limitation now being `INT_MAX` in both rows and columns. Moving larger than that would require using the 64 bit cublas API, but I have no need for that now. Additionally, on the python side if a user tries to pass a larger value they'll get a nicer python-side error, rather than a cublas error code (and a potentially corrupted handle).
1866fc9 to
5678dd6
Compare
|
I updated
Feels to me like the solution I have here is the best option for now. If you believe otherwise, I'd love to hear some specific suggestions for improvements. |
|
/merge |
Previously our coordinate descent solver would fail on problems where
n_cols * n_rows > INT_MAXdue to anintoverflow.There were two locations where this was happening:
The former is a quick local fix. The latter I also fixed locally by switching from an
intto aint64_tin the template call. However, I'm not sure if that's the best fix, or if it'd be better to handle this upstream within the template itself to avoid overflow of the index types. This was easy to do so I did it here for now.I've checked, and with this we can solve very large coordinate descent problems, with the dimension limitation now being
INT_MAXin both rows and columns. Moving larger than that would require using the 64 bit cublas API, but I have no need for that now.Additionally, on the python side if a user tries to pass a larger value they'll get a nicer python-side error, rather than a cublas error code (and a potentially corrupted handle).
Fixes #6736.