-
Notifications
You must be signed in to change notification settings - Fork 854
Preupload lfs files before commiting #1699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
bc201b6
4f23d11
3c2d504
61f0637
40ce31e
8266adb
acbe1e5
4a4af76
37949a9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -431,6 +431,50 @@ In addition to [`upload_file`] and [`upload_folder`], the following functions al | |
|
|
||
| For more detailed information, take a look at the [`HfApi`] reference. | ||
|
|
||
| ### Preupload LFS files before commit | ||
|
|
||
| In some cases, you might want to upload huge files to S3 **before** making the commit call. For example, if you are | ||
| committing a dataset in several shards that are generated in-memory, you would need to upload the shards one by one | ||
| to avoid a out-of-memory issue. A solution is to upload each shard as a separate commit on the repo. While being | ||
| perfectly valid, this solution has the drawback of potentially messing the git history by generating tens of commits. | ||
| To overcome this issue, you can upload your files one by one to S3 and then create a single commit at the end. This | ||
| is possible using [`preupload_lfs_files`] in combination with [`create_commit`]. | ||
|
|
||
| <Tip warning={true}> | ||
|
|
||
| This is a power-user method. Directly using [`upload_file`], [`upload_folder`] or [`create_commit`] instead of handling | ||
| the low-level logic of pre-uploading files is the way to go in the vast majority of cases. The main caveat of | ||
| [`preupload_lfs_files`] is that until the commit is actually made, the upload files are not accessible on the repo on | ||
| the Hub. If you have a question, feel free to ping us on our Discord or in a Github issue. | ||
Wauplin marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| </Tip> | ||
|
|
||
| Here is a simple example illustrating how to pre-upload files: | ||
|
|
||
| ```py | ||
| >>> from huggingface_hub import CommitOperationAdd, preupload_lfs_files, create_commit, create_repo | ||
|
|
||
| >>> repo_id = create_repo("test_preupload").repo_id | ||
|
|
||
| >>> operations = [] # List of all `CommitOperationAdd` objects that will be generated | ||
| >>> for i in range(5): | ||
| ... content = ... # generate binary content | ||
| ... addition = CommitOperationAdd(path_in_repo=f"shard_{i}_of_5.bin", path_or_fileobj=content) | ||
| ... preupload_lfs_files(repo_id, additions=[addition]) | ||
| ... operations.append(addition) | ||
|
|
||
| # Create commit | ||
| >>> create_commit(repo_id, operations=operations, commit_message="Commit all shards") | ||
Wauplin marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ``` | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if there is a failure on one of the operations before the commit is created?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there is a failure in one of the uploads, the script will crash (if not in a try/except). If the script is restarted, already uploaded files will not need to be re-uploaded ( Also there was a question at some point to garbage-collect the untracked files in S3 after some time (after 24h?) cc @Pierrci. I think it's not enabled yet but that would mean that if the user waits too long before creating the commit then it's lost.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a way to check if a certain file has been preuploaded based on its name ? Or it requires the hash ? That would help implementing a fast
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It requires the hash unfortunately. Filenames are just convenient aliases saved in git but what matters on S3 are the uniqueness of the files (i.e. based on hash). For context, until the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok I see. I guess we can do a commit every N files to allow resuming from there
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lhoestq Yes indeed.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our current "resuming" logic is to generate the shards and compute their "fingerprint" to check if they are already present in the repo (mimics hashing), so "resuming from a failed |
||
|
|
||
| First, we create the [`CommitOperationAdd`] objects one by one. In a real-world example, those would contain the | ||
| generated shards. Each file is uploaded before generating the next one. During the [`preupload_lfs_files`] step, **the | ||
| `CommitOperationAdd` object is mutated**. You should only use it to pass it to directly to [`create_commit`]. The main | ||
| update of the object is that **the binary content is removed** from it, meaning that it will be garbage-collected if | ||
| you don't store another reference to it. This is expected as we don't want to keep in memory the content that is | ||
| already uploaded. Finally we create the commit by passing all the operations to [`create_commit`]. You can pass | ||
| additional operations (add, delete or copy) that have not being processed yet and they will be handled correctly. | ||
Wauplin marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Tips and tricks for large uploads | ||
|
|
||
| There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data, | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.