`generate_random_fingerprint()` deterministic with 🤗Transformers' `set_seed()`

## Describe the bug

**Update:** I dug into this to try to reproduce the underlying issue, and I believe it's that `set_seed()` from the `transformers` library makes the "random" fingerprint identical each time. I believe this is still a bug, because `datasets` is used exactly this way in `transformers` after `set_seed()` has been called, and I think that using `set_seed()` is a standard procedure to aid reproducibility. I've added more details to reproduce this below.

Hi there! I'm using my own local dataset and custom preprocessing function. My preprocessing function seems to be unpickle-able, perhaps because it is from a closure (will debug this separately). I get this warning, which is expected:

https://github.com/huggingface/datasets/blob/450b9174765374111e5c6daab0ed294bc3d9b639/src/datasets/fingerprint.py#L260-L265

However, what's not expected is that the `datasets` actually _does_ seem to cache and reuse this dataset between runs! After that line, the next thing that's logged looks like:

```text
 Loading cached processed dataset at /home/xxx/.cache/huggingface/datasets/csv/default-xxx/0.0.0/xxx/cache-xxx.arrow
```

The path is exactly the same each run (e.g., last 26 runs).

This becomes a problem because I'll pass in the `--max_eval_samples` flag to the HuggingFace example script I'm running off of ([run_swag.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/multiple-choice/run_swag.py)).  The fact that the cached dataset is reused means this flag gets ignored. I'll try to load 100 examples, and it will load the full cached 1,000,000.

I think that

https://github.com/huggingface/datasets/blob/450b9174765374111e5c6daab0ed294bc3d9b639/src/datasets/fingerprint.py#L248

... is actually consistent because randomness is being controlled in HuggingFace/Transformers for reproducibility. I've added a demo of this below.

## Steps to reproduce the bug

```python
# Contents of print_fingerprint.py
from transformers import set_seed
from datasets.fingerprint import generate_random_fingerprint
set_seed(42)
print(generate_random_fingerprint())
```

```bash
for i in {0..10}; do
    python print_fingerprint.py
done

1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
```

## Expected results
After the "random hash" warning is emitted, a random hash is generated, and no outdated cached datasets are reused.

## Actual results
After the "random hash" warning is emitted, an identical hash is generated each time, and an outdated cached dataset is reused each run.

## Environment info


- `datasets` version: 1.9.0
- Platform: Linux-5.8.0-1038-gcp-x86_64-with-glibc2.31
- Python version: 3.9.6
- PyArrow version: 4.0.1

	logger.warning(
	f"Transform {transform} couldn't be hashed properly, a random hash was used instead. "
	"Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. "
	"If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. "
	"This warning is only showed once. Subsequent hashing failures won't be showed."
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`generate_random_fingerprint()` deterministic with 🤗Transformers' `set_seed()` #2775

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

generate_random_fingerprint() deterministic with 🤗Transformers' set_seed() #2775

Description

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`generate_random_fingerprint()` deterministic with 🤗Transformers' `set_seed()` #2775