Skip to content

Conversation

@mariosasko
Copy link
Collaborator

Fixes #2335

More info about this error can be found here.

@mariosasko
Copy link
Collaborator Author

mariosasko commented May 8, 2021

Seems like the CI failure is unrelated to this PR (fixed with the merge).

@lhoestq Can you please verify that everything is OK in terms of speed? Another solution is to change the offsets array dtype to np.int64 (but this doesn't scale in theory compared to Python integer which is unbound). I'm not sure why on my 64-bit machine the default numpy dtype is np.int32 tho.

Copy link

@mtala3t mtala3t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is OK.

@lhoestq
Copy link
Member

lhoestq commented May 10, 2021

Hi ! Thanks for the fix.
Unfortunately in terms of speed this is not acceptable :/
The get_batch_of_1024_random_rows metric or the benchmark_getitem_100B benchmark is almost at 1sec instead of a few milliseconds.

Would it be possible to avoid the overflow by simply passing dtype=np.int64 to np.cumsum ?
On windows machines the default is int32 unfortunately so we have to force the dtype to be int64

@mariosasko
Copy link
Collaborator Author

Yes, casting the array to np.int64 should work as well. Another option would be to cast the array elements (arr[i], arr[j]) in interpolation search to Python integers (bound only with memory) before multiplication (the error stems from this part: (j - i) * (x - arr[i])) when working with big values. But for now, the first option is OK for the sake of simplicity.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :)

@lhoestq lhoestq merged commit 4047443 into huggingface:master May 10, 2021
@mariosasko mariosasko deleted the fix-interpolation-search-overflow branch May 10, 2021 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Index error in Dataset.map

3 participants