Better error message when using the wrong load_from_disk #2437

lhoestq · 2021-06-01T09:43:22Z

As mentioned in #2424, the error message when one tries to use Dataset.load_from_disk to load a DatasetDict object (or vice versa) can be improved. I added a suggestion in the error message to let users know that they should use the other one.

thomwolf · 2021-06-01T09:47:00Z

We also have other cases where people are lost between Dataset and DatasetDict, maybe let's gather and solve them all here?

For instance, I remember that some people thought they would request a single element of a split but are calling this on a DatasetDict. Maybe here also a better error message when the split requested in not in the dict? pointing to the list of split and the fact that this is a datasetdict containing several datasets?

lhoestq · 2021-06-01T10:12:17Z

Good idea, let me add a better error message for this case too

albertvillanova · 2021-06-01T10:49:15Z

As a digression from the topic of this PR, IMHO I think that the difference between Dataset and DatasetDict is an additional abstraction complexity that confuses "typical" end users. I think a user expects a "Dataset" (whatever it contains multiple or a single split) and maybe it could be interesting to try to simplify the user-facing API as much as possible to hide this complexity from the end user.

I don't know your opinion about this, but it might be worth discussing...

For example, I really like the line of the solution of using the function load_from_disk, which hides the previous mentioned complexity and handles under the hood whether Dataset/DatasetDict instances should be created...

lhoestq · 2021-06-01T17:09:53Z

I totally agree, I just haven't found a solution that doesn't imply major breaking changes x)

thomwolf · 2021-06-02T12:18:32Z

Yes I would also like to find a better solution. Do we have any solution actually? (even implying breaking changes)

Here is a proposal for discussion and refined (and potential abandon if it's not good enough):

let's consider that a DatasetDict is also a Dataset with the various split concatenated one after the other
let's disallow the use of integers in split names (probably not a very big breaking change)
when you index with integers you access the examples progressively in split after the other is finished (in a deterministic order)
when you index with strings/split name you have the same behavior as now (full backward compat)
let's then also have all the methods of a Dataset on the DatasetDict

thomwolf · 2021-06-02T12:26:58Z

The end goal would be to merge both Dataset and DatasetDict object in a single object that would be (pretty much totally) backward compatible with both.

lhoestq · 2021-06-02T12:39:39Z

I like the direction :) I think it can make sense to concatenate them.

There are a few things that I we could discuss if we want to merge Dataset and DatasetDict:

what happens if you index by a string ? Does it return the column or the split ? We could disallow conflicts between column names and split names to avoid ambiguities. It can be surprising to be able to get a column or a split using the same indexing feature

from datasets import load_dataset

dataset = load_dataset(...)
dataset["train"]
dataset["input_ids"]

what happens when you iterate over the object ? I guess it should iterate over the examples as a Dataset object, but a DatasetDict used to iterate over the splits as they are the dictionary keys. This is a breaking change that we can discuss.

Moreover regarding your points:

integers are not allowed as split names already
it's definitely doable to have all the methods. Maybe some of them like train_test_split that is currently only available for Dataset can be tweaked to work for a split dataset

lhoestq · 2021-06-07T08:51:54Z

Instead of suggesting the use of Dataset.load_from_disk and DatasetDict.load_from_disk, the error message now suggests to use datasets.load_from_disk directly

lhoestq · 2021-06-08T18:03:45Z

Merging the error message improvement, feel free to continue the discussion here or in a github issue

lhoestq added 3 commits June 1, 2021 11:26

better error message when using the wrong load_from_disk

6358dbd

better message

d5ba284

fix

5361e4d

lhoestq mentioned this pull request Jun 1, 2021

load_from_disk and save_to_disk are not compatible with each other #2424

Closed

lhoestq mentioned this pull request Jun 1, 2021

Better error message when trying to access elements of a DatasetDict without specifying the split #2439

Merged

lhoestq added 2 commits June 7, 2021 10:49

Update arrow_dataset.py

609051a

Update dataset_dict.py

9c33081

lhoestq merged commit 1ea2239 into master Jun 8, 2021

lhoestq deleted the error-message-when-using-wrong-load_from_disk branch June 8, 2021 18:03

albertvillanova mentioned this pull request Jun 8, 2021

Merge DatasetDict and Dataset #2462

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better error message when using the wrong load_from_disk #2437

Better error message when using the wrong load_from_disk #2437

Uh oh!

lhoestq commented Jun 1, 2021

Uh oh!

thomwolf commented Jun 1, 2021

Uh oh!

lhoestq commented Jun 1, 2021

Uh oh!

albertvillanova commented Jun 1, 2021

Uh oh!

lhoestq commented Jun 1, 2021

Uh oh!

thomwolf commented Jun 2, 2021

Uh oh!

thomwolf commented Jun 2, 2021

Uh oh!

lhoestq commented Jun 2, 2021 •

edited

Loading

Uh oh!

lhoestq commented Jun 7, 2021

Uh oh!

lhoestq commented Jun 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Better error message when using the wrong load_from_disk #2437

Better error message when using the wrong load_from_disk #2437

Uh oh!

Conversation

lhoestq commented Jun 1, 2021

Uh oh!

thomwolf commented Jun 1, 2021

Uh oh!

lhoestq commented Jun 1, 2021

Uh oh!

albertvillanova commented Jun 1, 2021

Uh oh!

lhoestq commented Jun 1, 2021

Uh oh!

thomwolf commented Jun 2, 2021

Uh oh!

thomwolf commented Jun 2, 2021

Uh oh!

lhoestq commented Jun 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Jun 7, 2021

Uh oh!

lhoestq commented Jun 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lhoestq commented Jun 2, 2021 •

edited

Loading