Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Aug 31, 2021

When caching the result of a map function, the hash that is computed depends on many properties of this function, such as all the python objects it uses, its code and also the location of this code.

Using the full path of the python script for the location of the code makes the hash change if a script like run_mlm.py is moved.

I changed this by simply using the base name of the script instead of the full path.

Note that this change also affects the hash of the code used from imported modules, but I think it's fine. Indeed it hashes the code of the imported modules anyway, so the location of the python files of the imported modules doesn't matter when computing the hash.

Close #2825

@lhoestq
Copy link
Member Author

lhoestq commented Aug 31, 2021

Merging since the CI failure is unrelated to this PR

@lhoestq lhoestq merged commit f0b29b8 into master Aug 31, 2021
@lhoestq lhoestq deleted the fix-caching-when-moving-script branch August 31, 2021 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The datasets.map function does not load cached dataset after moving python script

2 participants