-
Notifications
You must be signed in to change notification settings - Fork 3k
Better error message when using the wrong load_from_disk #2437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
We also have other cases where people are lost between Dataset and DatasetDict, maybe let's gather and solve them all here? For instance, I remember that some people thought they would request a single element of a split but are calling this on a DatasetDict. Maybe here also a better error message when the split requested in not in the dict? pointing to the list of split and the fact that this is a datasetdict containing several datasets? |
|
Good idea, let me add a better error message for this case too |
|
As a digression from the topic of this PR, IMHO I think that the difference between Dataset and DatasetDict is an additional abstraction complexity that confuses "typical" end users. I think a user expects a "Dataset" (whatever it contains multiple or a single split) and maybe it could be interesting to try to simplify the user-facing API as much as possible to hide this complexity from the end user. I don't know your opinion about this, but it might be worth discussing... For example, I really like the line of the solution of using the function |
|
I totally agree, I just haven't found a solution that doesn't imply major breaking changes x) |
|
Yes I would also like to find a better solution. Do we have any solution actually? (even implying breaking changes) Here is a proposal for discussion and refined (and potential abandon if it's not good enough):
|
|
The end goal would be to merge both |
|
I like the direction :) I think it can make sense to concatenate them. There are a few things that I we could discuss if we want to merge Dataset and DatasetDict:
from datasets import load_dataset
dataset = load_dataset(...)
dataset["train"]
dataset["input_ids"]
Moreover regarding your points:
|
|
Instead of suggesting the use of |
|
Merging the error message improvement, feel free to continue the discussion here or in a github issue |
As mentioned in #2424, the error message when one tries to use
Dataset.load_from_diskto load a DatasetDict object (or vice versa) can be improved. I added a suggestion in the error message to let users know that they should use the other one.