Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jun 16, 2021

The fingerprint of a dataset changes if the cache directory is moved.
I fixed that by setting the fingerprint to be the hash of:

  • the relative cache dir (dataset_name/version/config_id)
  • the requested split

Close #2496

I had to fix an issue with the filelock filename that was too long (>255). It prevented the tests to run on my machine. I just added hash_filename_if_too_long in case this happens, to not get filenames longer than 255.
We usually have long filenames for filelocks because they are named after the path that is being locked. In case the path is a cache directory that has long directory names, then the filelock filename could en up being very long.

@lhoestq lhoestq requested a review from albertvillanova June 16, 2021 16:45
@lhoestq
Copy link
Member Author

lhoestq commented Jun 16, 2021

Windows, why are you doing this to me ?

@albertvillanova
Copy link
Member

Thanks @lhoestq, I'm starting reviewing this PR.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments below.

Comment on lines 848 to 851
hasher = Hasher()
hasher.update(self._relative_data_dir().replace(os.sep, "/"))
hasher.update(split.to_spec() if isinstance(split, ReadInstruction) else str(split))
fingerprint = hasher.hexdigest()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be extracted into a funcion/method.

# ------------------------------------------------


MAX_LOCK_FILENAME_LENGTH = 255
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a parameter that the user should be allowed to customize? If yes, I would suggest moving it to config; if not (I think this is the case), then I would suggest to remove the variable and pass the value directly in the function signature of hash_filename_if_too_long.

MAX_LOCK_FILENAME_LENGTH = 255


def hash_filename_if_too_long(path: str, max_length=MAX_LOCK_FILENAME_LENGTH) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this function is only used by FileLock, maybe it could become a method of that class: it is the class responsibility to check whether the filename is too long and hash it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I would recommend adding some unit tests for this function (even if they seem quite trivial).

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping hash_filename_if_too_long would fix issues with FileLock on Windows if long paths are not enabled (#2443), but sadly it doesn't.

@lhoestq
Copy link
Member Author

lhoestq commented Jun 21, 2021

Yea issues on windows are about long paths, not long filenames.
We can make sure the lock filenames are not too long, but not for the paths

@lhoestq
Copy link
Member Author

lhoestq commented Jun 21, 2021

Took your suggestions into account @albertvillanova :)

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the code quality check is fixed, it is OK for me.
Thanks!

@lhoestq lhoestq merged commit 0f44a7a into master Jun 21, 2021
@lhoestq lhoestq deleted the fix-fingerprint-when-moving-cache-dir branch June 21, 2021 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset fingerprint changes after moving the cache directory, which prevent cache reload when using map

4 participants