Skip to content

Conversation

@albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Jul 9, 2021

Add support for (streaming) remote data files:

data_files = f"https://huggingface.co/datasets/{repo_id}/resolve/main/{relative_file_path}"
ds = load_dataset("json", split="train", data_files=data_files, streaming=True)

cc: @thomwolf

@albertvillanova albertvillanova requested a review from lhoestq July 9, 2021 14:07
@albertvillanova
Copy link
Member Author

@lhoestq maybe we could also use (if available) the ETag of the remote file in create_config_id?

@albertvillanova albertvillanova added the enhancement New feature or request label Jul 9, 2021
@albertvillanova albertvillanova added this to the 1.10 milestone Jul 9, 2021
@lhoestq
Copy link
Member

lhoestq commented Jul 9, 2021

@lhoestq maybe we could also use (if available) the ETag of the remote file in create_config_id?

Sure ! We can get the ETag with

headers = get_authentication_headers_for_url(url, use_auth_token=use_auth_token)  # auth for private repos
etag = http_head(url, headers=headers).headers.get("ETag")

Since the computation of the config_id is done in the DatasetBuilder.__init__, then this means that we need to add a new parameter use_auth_token in DatasetBuilder.__init__

Does that sound good ? We can add this in a following PR

@lhoestq lhoestq merged commit 2fad5e1 into huggingface:master Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants