Replies: 2 comments
-
|
Very closely related (if not even duplicate) to #9 🤦 . Well at least you can see it's clearly still an issue... |
Beta Was this translation helpful? Give feedback.
-
|
Hey, Git LFS functions (as we have configured it) by storing a dataset in S3 storage. The name however is not the human-readable name of the dataset, but a shasum. If I am not mistaken, also the file size, which would make this even more robust. In a repository only a pointer file, containing the shasum and the file size, are stored. If another file with the same size and shasum is about to be uploaded into the S3-Storage, it is assumed that these files are the same and only a pointer file is created. So we can assume there are no duplicates in the S3 backend, even if we fork repos or upload the same dataset multiple times. If a reference is removed from a repo, the object is not removed automatically. There has to be some kind of garbage collection for this to happen. But I do not know the details here. There is also a solution in discussion using RO-Crates for Metadata as well as directly accessible S3 references for the datasets. But I do not know more details about this. @Thyra Thanks for moving this here :) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
While exploring the ARChive today, I found a "published dataset" (https://doi.org/10.60534/qsh62-7p088) where the link to the actual ARC/data gave me a 404 error (the ARC is private) and after asking about it on the Data Steward matrix chat, it turns out there is a somewhat systemic problem underneath:
Solutions
Why I think this is important
People are going to write things in their manuscripts like "All data and code used in this manuscript are freely accessible at DOI foobar". That is not true if the DOI publication is in reality just the metadata (I was actually a bit irritated when I learned that today, it almost feels intentionally misleading when it's always framed as a "data publication") and if I was an editor or reviewer for that paper, I would not be content unless the accessibility and integrity of the data was ensured in some way.
Beta Was this translation helpful? Give feedback.
All reactions