Skip to content

Commit 61f0637

Browse files
committed
doc
1 parent 3c2d504 commit 61f0637

File tree

1 file changed

+43
-0
lines changed

1 file changed

+43
-0
lines changed

docs/source/en/guides/upload.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -431,6 +431,49 @@ In addition to [`upload_file`] and [`upload_folder`], the following functions al
431431

432432
For more detailed information, take a look at the [`HfApi`] reference.
433433

434+
### Preupload LFS files before commit
435+
436+
In some cases, you might want to upload huge files to S3 **before** making the commit call. For example, if you are
437+
committing a dataset in several shards that are generated in-memory, you would need to upload the shards one by one
438+
to avoid a out-of-memory issue. A solution is to upload each shard as a separate commit on the repo. While being
439+
perfectly valid, this solution has the drawback of potentially messing the git history by generating a tens of commits.
440+
To overcome this issue, you can upload your files one by one to S3 and then create a single commit at the end. This
441+
is possible using [`preupload_lfs_files`] in combination with [`create_commit`].
442+
443+
<Tip warning={true}>
444+
445+
This is a power-user method. Directly using [`upload_file`], [`upload_folder`] or [`create_commit`] instead of handling
446+
the low-level logic of pre-uploading files is the way to go in the vast majority of cases. If you have a question,
447+
feel free to ping us on our Discord or in a Github issue.
448+
449+
</Tip>
450+
451+
Here is a simple example illustrating how to pre-upload files:
452+
453+
```py
454+
>>> from huggingface_hub import CommitOperationAdd, preupload_lfs_files, create_commit, create_repo
455+
456+
>>> repo_id = create_repo("test_preupload").repo_id
457+
458+
>>> operations = [] # List of all `CommitOperationAdd` objects that will be generated
459+
>>> for i in range(5):
460+
... content = ... # generate binary content
461+
... addition = CommitOperationAdd(path_in_repo=f"shard_{i}_of_5.bin", path_or_fileobj=content)
462+
... preupload_lfs_files(repo_id, additions=[addition])
463+
... operations.append(addition)
464+
465+
# Create commit
466+
>>> create_commit(repo_id, operations=operations, commit_message="Commit all shards")
467+
```
468+
469+
First, we create the [`CommitOperationAdd`] objects one by one. In a real-world example, those would contain the
470+
generated shards. Each file is uploaded before generating the next one. During the [`preupload_lfs_files`] step, **the
471+
`CommitOperationAdd` object is mutated**. You should only use it to pass it to directly to [`create_commit`]. The main
472+
update of the object is that **the binary content is removed** from it, meaning that it will be garbage-collected if
473+
you don't store another reference to it. This is expected as we don't want to keep in memory the content that is
474+
already uploaded. Finally we create the commit by passing all the operations to [`create_commit`]. You can pass
475+
additional operations (add, delete or copy) that have not being processed yet and they will be handled correctly.
476+
434477
## Tips and tricks for large uploads
435478

436479
There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data,

0 commit comments

Comments
 (0)