Skip to content

DatasetBuilder._split_generators incomplete type annotation #6798

@JonasLoos

Description

@JonasLoos

Describe the bug

The DatasetBuilder._split_generators function has currently the following signature:

class DatasetBuilder:
    def _split_generators(self, dl_manager: DownloadManager):
        ...

However, the dl_manager argument can also be of type StreamingDownloadManager, which has different functionality. For example, the download function doesn't download, but rather just returns the given url(s).

I suggest changing the function signature to:

class DatasetBuilder:
    def _split_generators(self, dl_manager: Union[DownloadManager, StreamingDownloadManager]):
        ...

and also adjust the docstring accordingly.

I would like to create a Pull Request to fix this, and have the following questions:

  • Are there also other options than DownloadManager, and StreamingDownloadManager?
  • Should this also be changed in other functions?

Steps to reproduce the bug

Minimal example to print the different class names:

import tempfile
from datasets import load_dataset

example = b'''
from datasets import GeneratorBasedBuilder, DatasetInfo, Features, Value, SplitGenerator

class Test(GeneratorBasedBuilder):
    def _info(self):
        return DatasetInfo(features=Features({"x": Value("int64")}))
    def _split_generators(self, dl_manager):
        print(type(dl_manager))
        return [SplitGenerator('test')]
    def _generate_examples(self):
        yield 0, {'x': 42}
'''

with tempfile.NamedTemporaryFile(suffix='.py') as f:
    f.write(example)
    f.flush()
    load_dataset(f.name, streaming=False)
    load_dataset(f.name, streaming=True)

Expected behavior

complete type annotations

Environment info

/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions