-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
Describe the bug
According to the docs of Datasets.from_generator:
gen_kwargs(`dict`, *optional*):
Keyword arguments to be passed to the `generator` callable.
You can define a sharded dataset by passing the list of shards in `gen_kwargs`.
So I'd expect that if gen_kwargs was a list, then my generator would be called once for each element in the list with the dict in the list for that element.
It doesn't work that way though.
Steps to reproduce the bug
#!/usr/bin/python
from pathlib import Path
import datasets
def process_yaml(file):
yield dict(example=42)
if __name__ == '__main__':
import sys
dir = Path(sys.argv[0]).parent
ds = datasets.Dataset.from_generator(process_yaml, gen_kwargs=[{'file':f} for f in dir.glob('*.yml')],
)
ds.to_json('training.jsonl')
Generating train split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
File "/tmp/dataset_bug.py", line 13, in <module>
ds = datasets.Dataset.from_generator(process_yaml, gen_kwargs=[{'file':f} for f in dir.glob('*.yml')],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 1072, in from_generator
).read()
^^^^^^
File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/io/generator.py", line 47, in read
self.builder.download_and_prepare(
File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/builder.py", line 1717, in _download_and_prepare
super()._download_and_prepare(
File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/builder.py", line 1555, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/builder.py", line 1656, in _prepare_split_single
generator = self._generate_examples(**gen_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: datasets.packaged_modules.generator.generator.Generator._generate_examples() argument after ** must be a ```
mapping, not list
### Expected behavior
I would expect that process_yaml would be called once for each yaml file in the directory where the script is run.
I also tried with the list being in gen_kwargs, but in that case process_yaml gets called with a list.
### Environment info
- `datasets` version: 2.14.6.dev0 (git commit 0cc77d7f45c7369; also tested with 2.14.0)
- Platform: Linux-6.1.0-10-amd64-x86_64-with-glibc2.36
- Python version: 3.11.2
- Huggingface_hub version: 0.16.4
- PyArrow version: 12.0.1
- Pandas version: 2.0.3
Metadata
Metadata
Assignees
Labels
No labels