Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jun 1, 2021

As mentioned in #2424, the error message when one tries to use Dataset.load_from_disk to load a DatasetDict object (or vice versa) can be improved. I added a suggestion in the error message to let users know that they should use the other one.

@thomwolf
Copy link
Member

thomwolf commented Jun 1, 2021

We also have other cases where people are lost between Dataset and DatasetDict, maybe let's gather and solve them all here?

For instance, I remember that some people thought they would request a single element of a split but are calling this on a DatasetDict. Maybe here also a better error message when the split requested in not in the dict? pointing to the list of split and the fact that this is a datasetdict containing several datasets?

@lhoestq
Copy link
Member Author

lhoestq commented Jun 1, 2021

Good idea, let me add a better error message for this case too

@albertvillanova
Copy link
Member

As a digression from the topic of this PR, IMHO I think that the difference between Dataset and DatasetDict is an additional abstraction complexity that confuses "typical" end users. I think a user expects a "Dataset" (whatever it contains multiple or a single split) and maybe it could be interesting to try to simplify the user-facing API as much as possible to hide this complexity from the end user.

I don't know your opinion about this, but it might be worth discussing...

For example, I really like the line of the solution of using the function load_from_disk, which hides the previous mentioned complexity and handles under the hood whether Dataset/DatasetDict instances should be created...

@lhoestq
Copy link
Member Author

lhoestq commented Jun 1, 2021

I totally agree, I just haven't found a solution that doesn't imply major breaking changes x)

@thomwolf
Copy link
Member

thomwolf commented Jun 2, 2021

Yes I would also like to find a better solution. Do we have any solution actually? (even implying breaking changes)

Here is a proposal for discussion and refined (and potential abandon if it's not good enough):

  • let's consider that a DatasetDict is also a Dataset with the various split concatenated one after the other
  • let's disallow the use of integers in split names (probably not a very big breaking change)
  • when you index with integers you access the examples progressively in split after the other is finished (in a deterministic order)
  • when you index with strings/split name you have the same behavior as now (full backward compat)
  • let's then also have all the methods of a Dataset on the DatasetDict

@thomwolf
Copy link
Member

thomwolf commented Jun 2, 2021

The end goal would be to merge both Dataset and DatasetDict object in a single object that would be (pretty much totally) backward compatible with both.

@lhoestq
Copy link
Member Author

lhoestq commented Jun 2, 2021

I like the direction :) I think it can make sense to concatenate them.

There are a few things that I we could discuss if we want to merge Dataset and DatasetDict:

  1. what happens if you index by a string ? Does it return the column or the split ? We could disallow conflicts between column names and split names to avoid ambiguities. It can be surprising to be able to get a column or a split using the same indexing feature
from datasets import load_dataset

dataset = load_dataset(...)
dataset["train"]
dataset["input_ids"]
  1. what happens when you iterate over the object ? I guess it should iterate over the examples as a Dataset object, but a DatasetDict used to iterate over the splits as they are the dictionary keys. This is a breaking change that we can discuss.

Moreover regarding your points:

  • integers are not allowed as split names already
  • it's definitely doable to have all the methods. Maybe some of them like train_test_split that is currently only available for Dataset can be tweaked to work for a split dataset

@lhoestq
Copy link
Member Author

lhoestq commented Jun 7, 2021

Instead of suggesting the use of Dataset.load_from_disk and DatasetDict.load_from_disk, the error message now suggests to use datasets.load_from_disk directly

@lhoestq
Copy link
Member Author

lhoestq commented Jun 8, 2021

Merging the error message improvement, feel free to continue the discussion here or in a github issue

@lhoestq lhoestq merged commit 1ea2239 into master Jun 8, 2021
@lhoestq lhoestq deleted the error-message-when-using-wrong-load_from_disk branch June 8, 2021 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants