Skip to content

generate_random_fingerprint() deterministic with 🤗Transformers' set_seed() #2775

@mbforbes

Description

@mbforbes

Describe the bug

Update: I dug into this to try to reproduce the underlying issue, and I believe it's that set_seed() from the transformers library makes the "random" fingerprint identical each time. I believe this is still a bug, because datasets is used exactly this way in transformers after set_seed() has been called, and I think that using set_seed() is a standard procedure to aid reproducibility. I've added more details to reproduce this below.

Hi there! I'm using my own local dataset and custom preprocessing function. My preprocessing function seems to be unpickle-able, perhaps because it is from a closure (will debug this separately). I get this warning, which is expected:

logger.warning(
f"Transform {transform} couldn't be hashed properly, a random hash was used instead. "
"Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. "
"If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. "
"This warning is only showed once. Subsequent hashing failures won't be showed."
)

However, what's not expected is that the datasets actually does seem to cache and reuse this dataset between runs! After that line, the next thing that's logged looks like:

 Loading cached processed dataset at /home/xxx/.cache/huggingface/datasets/csv/default-xxx/0.0.0/xxx/cache-xxx.arrow

The path is exactly the same each run (e.g., last 26 runs).

This becomes a problem because I'll pass in the --max_eval_samples flag to the HuggingFace example script I'm running off of (run_swag.py). The fact that the cached dataset is reused means this flag gets ignored. I'll try to load 100 examples, and it will load the full cached 1,000,000.

I think that

return f"{random.getrandbits(nbits):0{nbits//4}x}"

... is actually consistent because randomness is being controlled in HuggingFace/Transformers for reproducibility. I've added a demo of this below.

Steps to reproduce the bug

# Contents of print_fingerprint.py
from transformers import set_seed
from datasets.fingerprint import generate_random_fingerprint
set_seed(42)
print(generate_random_fingerprint())
for i in {0..10}; do
    python print_fingerprint.py
done

1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d
1c80317fa3b1799d

Expected results

After the "random hash" warning is emitted, a random hash is generated, and no outdated cached datasets are reused.

Actual results

After the "random hash" warning is emitted, an identical hash is generated each time, and an outdated cached dataset is reused each run.

Environment info

  • datasets version: 1.9.0
  • Platform: Linux-5.8.0-1038-gcp-x86_64-with-glibc2.31
  • Python version: 3.9.6
  • PyArrow version: 4.0.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions