Replies: 1 comment
-
Current best solution
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Upload of large data amounts with ARCs keeps being a challenge for users (and data stewards trying to support / clean up).
Use-case
April 2025: ARC project storage: 2 TB in datahub
**/dataset**folderrunsand not larger than 150 MB) -> piling up some 4 GB repository, slowing down git interactions per seJuly 2025: local ARC size: 7 TB
Challenge
arc sync(i.e. add + commit + push 5 TB of data)git lfs migrate, but afraid to touch and rewrite the history of a 7 TB repoThis is a more and more frequent scenario. Unfortunately, this especially affects coders, who may produce big amounts of data, but are not trained / used enough to git, and leaves putative ARC super users / multipliers frustrated. (I'm not saying it's not also a client site duty, but I understand the frustration). So I'm trying to deduce lessons learned for prevention in the future, but also understand what DataPLANT could improve to avoid this.
Lessons learned
More frequent and more modular RDM habits
Possible rules of thumb, not hard requirements:
Setup LFS correctly, from the beginning (
.gitattributes)Make sure to track all large files (by path, e.g.
runs/my-run01/results/**) or file types (by file type extension, e.g.*.bam,*.sam) before yougit addandgit commit(orarc sync) them.Data Selectivity (
.gitignore)Consider which files MUST be pushed vs. which files are fine not to push at all (and handle them via
.gitignore).Again, rule of thumb, not hard requirement:
What can DataPLANT do?
**/dataset/**,**/runs/**?isa.*.xlsx,README.md,run.yml,validation_packages.yml,*.cwl?Beta Was this translation helpful? Give feedback.
All reactions