doc

Wauplin · Wauplin · commit 61f06370e48d · 2023-09-29T16:12:35.000+02:00
diff --git a/docs/source/en/guides/upload.md b/docs/source/en/guides/upload.md
@@ -431,6 +431,49 @@ In addition to [`upload_file`] and [`upload_folder`], the following functions al
 
 For more detailed information, take a look at the [`HfApi`] reference.
 
+### Preupload LFS files before commit
+
+In some cases, you might want to upload huge files to S3 **before** making the commit call. For example, if you are
+committing a dataset in several shards that are generated in-memory, you would need to upload the shards one by one
+to avoid a out-of-memory issue. A solution is to upload each shard as a separate commit on the repo. While being
+perfectly valid, this solution has the drawback of potentially messing the git history by generating a tens of commits.
+To overcome this issue, you can upload your files one by one to S3 and then create a single commit at the end. This
+is possible using [`preupload_lfs_files`] in combination with [`create_commit`].
+
+<Tip warning={true}>
+
+This is a power-user method. Directly using [`upload_file`], [`upload_folder`] or [`create_commit`] instead of handling
+the low-level logic of pre-uploading files is the way to go in the vast majority of cases. If you have a question,
+feel free to ping us on our Discord or in a Github issue.
+
+</Tip>
+
+Here is a simple example illustrating how to pre-upload files:
+
+```py
+>>> from huggingface_hub import CommitOperationAdd, preupload_lfs_files, create_commit, create_repo
+
+>>> repo_id = create_repo("test_preupload").repo_id
+
+>>> operations = [] # List of all `CommitOperationAdd` objects that will be generated
+>>> for i in range(5):
+...     content = ... # generate binary content
+...     addition = CommitOperationAdd(path_in_repo=f"shard_{i}_of_5.bin", path_or_fileobj=content)
+...     preupload_lfs_files(repo_id, additions=[addition])
+...     operations.append(addition)
+
+# Create commit 
+>>> create_commit(repo_id, operations=operations, commit_message="Commit all shards")
+```
+
+First, we create the [`CommitOperationAdd`] objects one by one. In a real-world example, those would contain the
+generated shards. Each file is uploaded before generating the next one. During the [`preupload_lfs_files`] step, **the
+`CommitOperationAdd` object is mutated**. You should only use it to pass it to directly to [`create_commit`]. The main
+update of the object is that **the binary content is removed** from it, meaning that it will be garbage-collected if
+you don't store another reference to it. This is expected as we don't want to keep in memory the content that is
+already uploaded. Finally we create the commit by passing all the operations to [`create_commit`]. You can pass
+additional operations (add, delete or copy) that have not being processed yet and they will be handled correctly.
+
 ## Tips and tricks for large uploads
 
 There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data,