-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Describe the bug
Update: I dug into this to try to reproduce the underlying issue, and I believe it's that set_seed() from the transformers library makes the "random" fingerprint identical each time. I believe this is still a bug, because datasets is used exactly this way in transformers after set_seed() has been called, and I think that using set_seed() is a standard procedure to aid reproducibility. I've added more details to reproduce this below.
Hi there! I'm using my own local dataset and custom preprocessing function. My preprocessing function seems to be unpickle-able, perhaps because it is from a closure (will debug this separately). I get this warning, which is expected:
datasets/src/datasets/fingerprint.py
Lines 260 to 265 in 450b917
| logger.warning( | |
| f"Transform {transform} couldn't be hashed properly, a random hash was used instead. " | |
| "Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. " | |
| "If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. " | |
| "This warning is only showed once. Subsequent hashing failures won't be showed." | |
| ) |
However, what's not expected is that the datasets actually does seem to cache and reuse this dataset between runs! After that line, the next thing that's logged looks like:
Loading cached processed dataset at /home/xxx/.cache/huggingface/datasets/csv/default-xxx/0.0.0/xxx/cache-xxx.arrow
The path is exactly the same each run (e.g., last 26 runs).
This becomes a problem because I'll pass in the --max_eval_samples flag to the HuggingFace example script I'm running off of (run_swag.py). The fact that the cached dataset is reused means this flag gets ignored. I'll try to load 100 examples, and it will load the full cached 1,000,000.
I think that
datasets/src/datasets/fingerprint.py
Line 248 in 450b917
| return f"{random.getrandbits(nbits):0{nbits//4}x}" |
... is actually consistent because randomness is being controlled in HuggingFace/Transformers for reproducibility. I've added a demo of this below.
Steps to reproduce the bug
# Contents of print_fingerprint.py
from transformers import set_seed
from datasets.fingerprint import generate_random_fingerprint
set_seed(42)
print(generate_random_fingerprint())for i in {0..10}; do
python print_fingerprint.py
done
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799dExpected results
After the "random hash" warning is emitted, a random hash is generated, and no outdated cached datasets are reused.
Actual results
After the "random hash" warning is emitted, an identical hash is generated each time, and an outdated cached dataset is reused each run.
Environment info
datasetsversion: 1.9.0- Platform: Linux-5.8.0-1038-gcp-x86_64-with-glibc2.31
- Python version: 3.9.6
- PyArrow version: 4.0.1