task casting via load_dataset #2143

theo-m · 2021-03-30T10:00:42Z

wip
not satisfied with the API, it means as a dataset implementer I need to write a function with boilerplate and write classes for each <dataset><task> "facet".

lhoestq

That's a good start thanks :)

I left a few comments
(most of them we've already discussed them, but I added hem anyway if people want to share their thoughts)

lhoestq · 2021-03-31T08:56:29Z

datasets/rotten_tomatoes/rotten_tomatoes.py

+        return RottenTomatoesMovieReviewClassificationSingleLabelDataset.from_dataset(dataset.map(mapper))
+
+
+class RottenTomatoesMovieReviewClassificationSingleLabelDataset(datasets.tasks.ClassificationSingleLabelDataset):


This might be overkill to have one class here just for the label2id.
I think you can use directly ClassificationSingleLabelDataset and not make it an ABC. You can pass the label2id as a parameter to init the ClassificationSingleLabelDataset.

lhoestq · 2021-03-31T08:59:59Z

datasets/rotten_tomatoes/rotten_tomatoes.py

+    def cast_as_classification_single_label(
+        self, dataset: Union[Dataset, DatasetDict]
+    ) -> datasets.tasks.ClassificationSingleLabelDataset:
+        def mapper(example):
+            return dict(text=example["text"], logits=[(1, 0), (0, 1)][example["label"]])
+
+        return RottenTomatoesMovieReviewClassificationSingleLabelDataset.from_dataset(dataset.map(mapper))


This will be useful to allow more fine-grained transformations for casting !

In many cases though it will just be boiler plate code. Maybe there should also be an alternative/simpler way ?
For example maybe we can just provide supervised_keys=ClassificationKeys("text", "label") ?
What do you think ?

In this case we could even infer label2id by taking the ClassLabel names of the "label" field.

lhoestq · 2021-03-31T09:10:52Z

setup.py

+    # pydantic allows us to express simply validation patterns for dataset metadata and serialize object schemas
+    # for tasks descriptions.
+    "pydantic",


Serialization is also possible with dataclasses.
Validation is a good reason to have an extra dependency, but we usually don't like adding extra dependencies to avoid issues in the future (breaking changes, bugs, etc.).
Maybe we can have a simple function for validation, without requiring an extra dependency

lhoestq · 2021-03-31T09:11:32Z

src/datasets/dataset_dict.py

+T = TypeVar("T")
+
+
+class DatasetDict(dict, MutableMapping[str, T]):


This is actually a MutableMapping[str, Dataset]

lhoestq · 2021-03-31T09:13:56Z

tests/test_dataset_common.py

+    def test_task_implementations(self, dataset_name):
+        if dataset_name != "rotten_tomatoes":
+            return
+
+        # this is failing at the moment since the new code for "rotten_tomatoes" is not uploaded.
+        load_dataset(dataset_name, as_task=tasks.ClassificationSingleLabelDataset.id)


it would be good to have a more general test than check for the dataset names.
Maybe by checking the available tasks for the builder instance ?

task casting via load_dataset

7da1543

theo-m requested a review from lhoestq March 30, 2021 11:02

theo-m self-assigned this Mar 30, 2021

theo added 4 commits March 30, 2021 13:20

classif output is logits

850a2d0

true naming, oh my is this bad

4eb6ea7

don't use raw strings

efee562

fix import & fix name refactor

1feaa59

lhoestq reviewed Mar 31, 2021

View reviewed changes

SBrandeis self-assigned this Apr 23, 2021

SBrandeis mentioned this pull request Apr 23, 2021

Task casting for text classification & question answering #2255

Merged

SBrandeis closed this Jun 11, 2021

SBrandeis deleted the theo/classif-mixin branch June 11, 2021 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

task casting via load_dataset #2143

task casting via load_dataset #2143

Uh oh!

theo-m commented Mar 30, 2021 •

edited

Loading

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq Mar 31, 2021

Uh oh!

lhoestq Mar 31, 2021

Uh oh!

lhoestq Mar 31, 2021

Uh oh!

lhoestq Mar 31, 2021

Uh oh!

lhoestq Mar 31, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return RottenTomatoesMovieReviewClassificationSingleLabelDataset.from_dataset(dataset.map(mapper))


		class RottenTomatoesMovieReviewClassificationSingleLabelDataset(datasets.tasks.ClassificationSingleLabelDataset):

		T = TypeVar("T")


		class DatasetDict(dict, MutableMapping[str, T]):

task casting via load_dataset #2143

task casting via load_dataset #2143

Uh oh!

Conversation

theo-m commented Mar 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

lhoestq Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

lhoestq Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

lhoestq Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

lhoestq Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

theo-m commented Mar 30, 2021 •

edited

Loading