[Caching] Deterministic hashing of torch tensors

Currently this fails
```python
import torch
from datasets.fingerprint import Hasher

t = torch.tensor([1.])

def func(x):
    return t + x

hash1 = Hasher.hash(func)
t = torch.tensor([1.])
hash2 = Hasher.hash(func)
assert hash1 == hash2
```

Also as noticed in https://discuss.huggingface.co/t/dataset-cant-cache-models-outputs/24945, using a model in a `map` function doesn't work well with caching. Indeed the `bert-base-uncased` model has a different hash every time you reload it. Supporting torch tensors may also help in this case.

This can be fixed by registering a custom pickling functions for torch tensors - as we did for other objects such as CodeType, FunctionType and Regex in `py_utils.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Caching] Deterministic hashing of torch tensors #5170

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Caching] Deterministic hashing of torch tensors #5170

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions