-
Notifications
You must be signed in to change notification settings - Fork 860
Description
Describe the bug
Hey!
I'm trying to optimize some code that updates a HF dataset.
In short, here's the pseudo-code that best describes what I'm doing:
from huggingface_hub import HfFileSystem
fs = HfFileSystem(token=os.environ['HF_TOKEN'])
...
def _update_chunk(self, df: pd.DataFrame, chunk_num: int) -> None:
chunks = self._get_chunk_names()
with fs.open(f"{self.fs_path}/train-{chunk_num:08d}-of-{len(chunks):08d}.parquet", "wb") as f:
df.to_parquet(f)
def _new_chunk(self, df: pd.DataFrame) -> None:
# Rename all chunks to be of one number higher
chunks = self._get_chunk_names()
for chunk in chunks:
key = chunk.split("-")[1]
fs.mv(f"{self.fs_path}/{chunk}", f"{self.fs_path}/train-{key:08d}-of-{len(chunks)+1:08d}.parquet")
with fs.open(f"{self.fs_path}/train-{len(chunks):08d}-of-{len(chunks)+1:08d}.parquet", "wb") as f:
df.to_parquet(f)
...
with fs.transaction:
for index, row in tqdm(messages.iterrows()):
if len(latest_chunk) >= DATASET_CHUNK_SIZE:
self._update_chunk(latest_chunk, latest_chunk_num)
latest_chunk = pd.DataFrame(columns=schema)
latest_chunk_num += 1
self._new_chunk(latest_chunk)What I expect with fs.transaction to do is to group renaming and writing actions until the transaction ends and then everything is committed to the HF dataset in one commit.
The issue is that, currently, it does not group the changes, and they are committed separately. We can very quickly hit API request limits because of this. We do this chunked updates because the Github runner that we're using can't handle downloading the HF dataset and updating it in-memory, so this is the solution we came up with. We could use larger machines, but that's not sustainable in the long run - this will happen eventually.
I've looked at the implementation in hf_api and hf_file_system, and I couldn't find a way to implement this pooling - I guess that it needs server-side support.
If this something that's possible to do? Am I missing anything?
Maybe someone can propose some other method to push new rows to a HF Dataset?
Thanks!
P.S.
Here's the PR that I proposed in our repo to solve the OOM issue we were seeing https://github.com/LAION-AI/Discord-Scrapers/pull/2/files
And also some discussions about this in the Dataset itself: https://huggingface.co/datasets/laion/dalle-3-dataset/discussions/3 https://huggingface.co/datasets/laion/dalle-3-dataset/discussions/4
Reproduction
No response
Logs
No response
System info
- huggingface_hub version: 0.18.0
- Platform: Linux-6.5.3-x64v3-xanmod1-x86_64-with-glibc2.37
- Python version: 3.11.4
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/twoabove/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: N/A
- Torch: N/A
- Jinja2: N/A
- Graphviz: N/A
- Pydot: N/A
- Pillow: 10.0.1
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.26.0
- pydantic: N/A
- aiohttp: 3.8.6
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: /home/twoabove/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /home/twoabove/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/twoabove/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False