Large data upload #25

Brilator · 2025-07-22T15:10:20Z

Brilator
Jul 22, 2025

Upload of large data amounts with ARCs keeps being a challenge for users (and data stewards trying to support / clean up).

Use-case

April 2025: ARC project storage: 2 TB in datahub

User creates and pushes ARC to datahub
Most things perfectly in place:
- datahub user account
- Tools installed: git + git-lfs + ARCitect + ARC commander installed and configured)
- proper use of access tokens for large data upload (preventing 2 hours timeout)
Trusting ARCitect and ARC commander for proper LFS tracking
- ARCitect default: 1 MB file threshold
- ARC commander default: 150 MB file threshold, and everything in a **/dataset** folder
some of the data not tracked as LFS (since they're in runs and not larger than 150 MB) -> piling up some 4 GB repository, slowing down git interactions per se

July 2025: local ARC size: 7 TB

Now user worked in the ARC for a few months
created a bunch of (test) data (e.g. in runs), most of which should be mutable / reproducible mixed with adding some raw (immutable) data

Challenge

No commits & no push between April 2025 (2 TB) and July (7TB)
User now tries to accomplish this glitch in history with a one-liner: arc sync (i.e. add + commit + push 5 TB of data)
Project storage 4 GB repository (non-LFS) could probably be handled with git lfs migrate, but afraid to touch and rewrite the history of a 7 TB repo

This is a more and more frequent scenario. Unfortunately, this especially affects coders, who may produce big amounts of data, but are not trained / used enough to git, and leaves putative ARC super users / multipliers frustrated. (I'm not saying it's not also a client site duty, but I understand the frustration). So I'm trying to deduce lessons learned for prevention in the future, but also understand what DataPLANT could improve to avoid this.

Lessons learned

More frequent and more modular RDM habits

Possible rules of thumb, not hard requirements:

A commit should not contain more changes than can be meaningfully described with a single short commit message.
A commit should not contain more than 100 GB of changes (unless it is a single or few files that exceed this by file size)
A commit should not contain more than 100 single file changes (unless it is a single folder / run result with very many small e.g. text files)

Setup LFS correctly, from the beginning (`.gitattributes`)

Make sure to track all large files (by path, e.g. runs/my-run01/results/**) or file types (by file type extension, e.g. *.bam, *.sam) before you git add and git commit (or arc sync) them.

Data Selectivity ( `.gitignore`)

Consider which files MUST be pushed vs. which files are fine not to push at all (and handle them via .gitignore).

Again, rule of thumb, not hard requirement:

During iterative data analysis, where some large (intermediate) files are generated just by testing workflows, double-check whether these intermediated files should land in the datahub at all. If it is quicker to reproduce a file (consider reproducible workflows with CWL) than to sync (upload/push and download/pull) the same file, it might make sense not to push it at all. You save your time, you save storage volume. Once the data analysis is "final", upload those files from the final run.

What can DataPLANT do?

DataHUB: hard threshold for non-LFS tracked files?
ARCitect / ARC Commander: More, improved and aligned default LFS handling

same threshold, e.g. 1 MB?
same default patterns, e.g. **/dataset/**, **/runs/**?
default exclude / white list for files that MUST be non-LFS: e.g. isa.*.xlsx, README.md, run.yml, validation_packages.yml, *.cwl ?

Knowledge base

A clear and prominent summary of this discussion
Guides to prevent it
Guides to fix it

Data Stewards: Knowing about this issue and how to fix it

Brilator · 2025-07-24T11:09:39Z

Brilator
Jul 24, 2025
Author

Current best solution

Leave local ARC (the big one, which is out of sync with the DataHUB) aside (--> "ARC old")
Download (clone) a fresh copy of the ARC with / without LFS from the DataHUB (--> "ARC new")
Make sure, that communication (commit, sync) from ARC new with the DataHUB works fine
Configure proper patterns for .gitignore (files that should not be version controlled nor uploaded) and .gitattributes (files that must be LFS tracked)
In small chunks, transfer data from "ARC old" to "ARC new" and chunk by chunk commit and sync these to the DataHUB
Remove "ARC old"

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataPLANT

Large data upload #25

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DataPLANT

Large data upload #25

Uh oh!

Uh oh!

Brilator Jul 22, 2025

Use-case

April 2025: ARC project storage: 2 TB in datahub

July 2025: local ARC size: 7 TB

Challenge

Lessons learned

More frequent and more modular RDM habits

Setup LFS correctly, from the beginning (.gitattributes)

Data Selectivity ( .gitignore)

What can DataPLANT do?

Replies: 1 comment

Uh oh!

Brilator Jul 24, 2025 Author

Current best solution

Brilator
Jul 22, 2025

Setup LFS correctly, from the beginning (`.gitattributes`)

Data Selectivity ( `.gitignore`)

Brilator
Jul 24, 2025
Author