Skip to content

Raise a proper exception when trying to stream a dataset that requires to manually download files #2749

@severo

Description

@severo

Describe the bug

At least for 'reclor', 'telugu_books', 'turkish_movie_sentiment', 'ubuntu_dialogs_corpus', 'wikihow', trying to load_dataset in streaming mode raises a TypeError without any detail about why it fails.

Steps to reproduce the bug

from datasets import load_dataset
dataset = load_dataset("reclor", streaming=True)

Expected results

Ideally: raise a specific exception, something like ManualDownloadError.

Or at least give the reason in the message, as when we load in normal mode:

from datasets import load_dataset
dataset = load_dataset("reclor")
AssertionError: The dataset reclor with config default requires manual data.
 Please follow the manual download instructions:   to use ReClor you need to download it manually. Please go to its homepage (http://whyu.me/reclor/) fill the google
  form and you will receive a download link and a password to extract it.Please extract all files in one folder and use the path folder in datasets.load_dataset('reclor', data_dir='path/to/folder/folder_name')
  .
 Manual data can be loaded with `datasets.load_dataset(reclor, data_dir='<path/to/manual/data>')

Actual results

TypeError: expected str, bytes or os.PathLike object, not NoneType

Environment info

  • datasets version: 1.11.0
  • Platform: macOS-11.5-x86_64-i386-64bit
  • Python version: 3.8.11
  • PyArrow version: 4.0.1

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions