Skip to content

Dataset.from_generator raises with sharded gen_args #6270

@hartmans

Description

@hartmans

Describe the bug

According to the docs of Datasets.from_generator:

        gen_kwargs(`dict`, *optional*):
            Keyword arguments to be passed to the `generator` callable.
            You can define a sharded dataset by passing the list of shards in `gen_kwargs`.

So I'd expect that if gen_kwargs was a list, then my generator would be called once for each element in the list with the dict in the list for that element.
It doesn't work that way though.

Steps to reproduce the bug

#!/usr/bin/python

from pathlib import Path
import datasets

def process_yaml(file):
    yield dict(example=42)


if __name__ == '__main__':
    import sys
    dir = Path(sys.argv[0]).parent
    ds = datasets.Dataset.from_generator(process_yaml, gen_kwargs=[{'file':f} for f in dir.glob('*.yml')],
        )
    ds.to_json('training.jsonl')
    
Generating train split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "/tmp/dataset_bug.py", line 13, in <module>
    ds = datasets.Dataset.from_generator(process_yaml, gen_kwargs=[{'file':f} for f in dir.glob('*.yml')],
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 1072, in from_generator
    ).read()
      ^^^^^^
  File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/io/generator.py", line 47, in read
    self.builder.download_and_prepare(
  File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/builder.py", line 1717, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/builder.py", line 1555, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/hartmans/ai/venv/lib/python3.11/site-packages/datasets/builder.py", line 1656, in _prepare_split_single
    generator = self._generate_examples(**gen_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: datasets.packaged_modules.generator.generator.Generator._generate_examples() argument after ** must be a ```
mapping, not list


### Expected behavior

I would expect that process_yaml would be called once for each yaml file in the directory where the script is run.
I also tried with the list being in gen_kwargs, but in that case process_yaml gets called with a list.


### Environment info

- `datasets` version: 2.14.6.dev0 (git commit 0cc77d7f45c7369; also tested with 2.14.0)
- Platform: Linux-6.1.0-10-amd64-x86_64-with-glibc2.36
- Python version: 3.11.2
- Huggingface_hub version: 0.16.4
- PyArrow version: 12.0.1
- Pandas version: 2.0.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions