-
Notifications
You must be signed in to change notification settings - Fork 3k
task casting via load_dataset #2143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good start thanks :)
I left a few comments
(most of them we've already discussed them, but I added hem anyway if people want to share their thoughts)
| return RottenTomatoesMovieReviewClassificationSingleLabelDataset.from_dataset(dataset.map(mapper)) | ||
|
|
||
|
|
||
| class RottenTomatoesMovieReviewClassificationSingleLabelDataset(datasets.tasks.ClassificationSingleLabelDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be overkill to have one class here just for the label2id.
I think you can use directly ClassificationSingleLabelDataset and not make it an ABC. You can pass the label2id as a parameter to init the ClassificationSingleLabelDataset.
| def cast_as_classification_single_label( | ||
| self, dataset: Union[Dataset, DatasetDict] | ||
| ) -> datasets.tasks.ClassificationSingleLabelDataset: | ||
| def mapper(example): | ||
| return dict(text=example["text"], logits=[(1, 0), (0, 1)][example["label"]]) | ||
|
|
||
| return RottenTomatoesMovieReviewClassificationSingleLabelDataset.from_dataset(dataset.map(mapper)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be useful to allow more fine-grained transformations for casting !
In many cases though it will just be boiler plate code. Maybe there should also be an alternative/simpler way ?
For example maybe we can just provide supervised_keys=ClassificationKeys("text", "label") ?
What do you think ?
In this case we could even infer label2id by taking the ClassLabel names of the "label" field.
| # pydantic allows us to express simply validation patterns for dataset metadata and serialize object schemas | ||
| # for tasks descriptions. | ||
| "pydantic", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serialization is also possible with dataclasses.
Validation is a good reason to have an extra dependency, but we usually don't like adding extra dependencies to avoid issues in the future (breaking changes, bugs, etc.).
Maybe we can have a simple function for validation, without requiring an extra dependency
| T = TypeVar("T") | ||
|
|
||
|
|
||
| class DatasetDict(dict, MutableMapping[str, T]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a MutableMapping[str, Dataset]
| def test_task_implementations(self, dataset_name): | ||
| if dataset_name != "rotten_tomatoes": | ||
| return | ||
|
|
||
| # this is failing at the moment since the new code for "rotten_tomatoes" is not uploaded. | ||
| load_dataset(dataset_name, as_task=tasks.ClassificationSingleLabelDataset.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to have a more general test than check for the dataset names.
Maybe by checking the available tasks for the builder instance ?
wip
not satisfied with the API, it means as a dataset implementer I need to write a function with boilerplate and write classes for each
<dataset><task>"facet".