Skip to content

HfFileSystem's transaction is working counterintuitively #1733

@TwoAbove

Description

@TwoAbove

Describe the bug

Hey!

I'm trying to optimize some code that updates a HF dataset.

In short, here's the pseudo-code that best describes what I'm doing:

from huggingface_hub import HfFileSystem

fs = HfFileSystem(token=os.environ['HF_TOKEN'])

...

	def _update_chunk(self, df: pd.DataFrame, chunk_num: int) -> None:
		chunks = self._get_chunk_names()
		with fs.open(f"{self.fs_path}/train-{chunk_num:08d}-of-{len(chunks):08d}.parquet", "wb") as f:
			df.to_parquet(f)
	
	def _new_chunk(self, df: pd.DataFrame) -> None:
		# Rename all chunks to be of one number higher
		chunks = self._get_chunk_names()
		for chunk in chunks:
			key = chunk.split("-")[1]
			fs.mv(f"{self.fs_path}/{chunk}", f"{self.fs_path}/train-{key:08d}-of-{len(chunks)+1:08d}.parquet")
	
		with fs.open(f"{self.fs_path}/train-{len(chunks):08d}-of-{len(chunks)+1:08d}.parquet", "wb") as f:
			df.to_parquet(f)


...


with fs.transaction:
	for index, row in tqdm(messages.iterrows()):
		if len(latest_chunk) >= DATASET_CHUNK_SIZE:
			self._update_chunk(latest_chunk, latest_chunk_num)
			latest_chunk = pd.DataFrame(columns=schema)
			latest_chunk_num += 1
			self._new_chunk(latest_chunk)

What I expect with fs.transaction to do is to group renaming and writing actions until the transaction ends and then everything is committed to the HF dataset in one commit.

The issue is that, currently, it does not group the changes, and they are committed separately. We can very quickly hit API request limits because of this. We do this chunked updates because the Github runner that we're using can't handle downloading the HF dataset and updating it in-memory, so this is the solution we came up with. We could use larger machines, but that's not sustainable in the long run - this will happen eventually.

I've looked at the implementation in hf_api and hf_file_system, and I couldn't find a way to implement this pooling - I guess that it needs server-side support.

If this something that's possible to do? Am I missing anything?
Maybe someone can propose some other method to push new rows to a HF Dataset?

Thanks!

P.S.
Here's the PR that I proposed in our repo to solve the OOM issue we were seeing https://github.com/LAION-AI/Discord-Scrapers/pull/2/files
And also some discussions about this in the Dataset itself: https://huggingface.co/datasets/laion/dalle-3-dataset/discussions/3 https://huggingface.co/datasets/laion/dalle-3-dataset/discussions/4

Reproduction

No response

Logs

No response

System info

- huggingface_hub version: 0.18.0
- Platform: Linux-6.5.3-x64v3-xanmod1-x86_64-with-glibc2.37
- Python version: 3.11.4
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/twoabove/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: N/A
- Jinja2: N/A
- Graphviz: N/A
- Pydot: N/A
- Pillow: 10.0.1
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.26.0
- pydantic: N/A
- aiohttp: 3.8.6
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: /home/twoabove/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /home/twoabove/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/twoabove/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions