Skip to content

Conversation

@theo-m
Copy link
Contributor

@theo-m theo-m commented Mar 30, 2021

wip
not satisfied with the API, it means as a dataset implementer I need to write a function with boilerplate and write classes for each <dataset><task> "facet".

@theo-m theo-m requested a review from lhoestq March 30, 2021 11:02
@theo-m theo-m self-assigned this Mar 30, 2021
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good start thanks :)

I left a few comments
(most of them we've already discussed them, but I added hem anyway if people want to share their thoughts)

return RottenTomatoesMovieReviewClassificationSingleLabelDataset.from_dataset(dataset.map(mapper))


class RottenTomatoesMovieReviewClassificationSingleLabelDataset(datasets.tasks.ClassificationSingleLabelDataset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be overkill to have one class here just for the label2id.
I think you can use directly ClassificationSingleLabelDataset and not make it an ABC. You can pass the label2id as a parameter to init the ClassificationSingleLabelDataset.

Comment on lines +133 to +139
def cast_as_classification_single_label(
self, dataset: Union[Dataset, DatasetDict]
) -> datasets.tasks.ClassificationSingleLabelDataset:
def mapper(example):
return dict(text=example["text"], logits=[(1, 0), (0, 1)][example["label"]])

return RottenTomatoesMovieReviewClassificationSingleLabelDataset.from_dataset(dataset.map(mapper))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be useful to allow more fine-grained transformations for casting !

In many cases though it will just be boiler plate code. Maybe there should also be an alternative/simpler way ?
For example maybe we can just provide supervised_keys=ClassificationKeys("text", "label") ?
What do you think ?

In this case we could even infer label2id by taking the ClassLabel names of the "label" field.

Comment on lines +92 to +94
# pydantic allows us to express simply validation patterns for dataset metadata and serialize object schemas
# for tasks descriptions.
"pydantic",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serialization is also possible with dataclasses.
Validation is a good reason to have an extra dependency, but we usually don't like adding extra dependencies to avoid issues in the future (breaking changes, bugs, etc.).
Maybe we can have a simple function for validation, without requiring an extra dependency

T = TypeVar("T")


class DatasetDict(dict, MutableMapping[str, T]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a MutableMapping[str, Dataset]

Comment on lines +235 to +240
def test_task_implementations(self, dataset_name):
if dataset_name != "rotten_tomatoes":
return

# this is failing at the moment since the new code for "rotten_tomatoes" is not uploaded.
load_dataset(dataset_name, as_task=tasks.ClassificationSingleLabelDataset.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to have a more general test than check for the dataset names.
Maybe by checking the available tasks for the builder instance ?

@SBrandeis SBrandeis self-assigned this Apr 23, 2021
@SBrandeis SBrandeis closed this Jun 11, 2021
@SBrandeis SBrandeis deleted the theo/classif-mixin branch June 11, 2021 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants