-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Describe the bug
When using Dataset.from_generator with streaming=False, the internal logic will call download_and_prepare which will attempt to download from HF GCS which is redundant, because user has already provided the generator from which the data should be drawn.
If someone attempts to call Dataset.from_generator from an environment that doesn't have external internet access (for example internal production machine) and doesn't set HF_DATASETS_OFFLINE=1, this will result in process being stuck at building connection.
Steps to reproduce the bug
import datasets
def gen():
for _ in range(100):
yield {"text": "dummy text"}
dataset = datasets.Dataset.from_generator(gen)A minimum example executed on any environment that doesn't have access to HF GCS can result in the error
Expected behavior
try_from_hf_gcs should be set to False here
datasets/src/datasets/io/generator.py
Line 51 in c9c1166
| # try_from_hf_gcs=try_from_hf_gcs, |
Environment info
datasetsversion: 2.14.4- Platform: Linux-3.10.0-1160.90.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.10.12
- Huggingface_hub version: 0.17.1
- PyArrow version: 12.0.1
- Pandas version: 2.0.3