-
Notifications
You must be signed in to change notification settings - Fork 3k
Update text classification template labels in DatasetInfo __post_init__ #2392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update text classification template labels in DatasetInfo __post_init__ #2392
Conversation
|
If I'm not mistaken, one way to fix this would be to drop the task templates when copying the info by inserting dset = load_dataset("some_dataset") # let's say 'some_dataset' supports text classification and question answering
dset_tc = dset.prepare_for_task("text-classification")
dset_tc.preprare_for_task("question-answering") # this should raise an error because the schema is no longer valid for this task; currently this fails on 'rename_columns'I see 2 options:
|
thanks for the great idea @mariosasko and for spotting the problem with sequential task preparation! i am in favour of your option (1) since it is simple and saves us from having to keep track of the column mappings across multiple steps. i've implemented the change and refactored the tests to account for the new approach (including a new test that the templates are flushed after we call on the other hand, dropping in any case, i think it would be a good idea to merge #2376 soon as the current PR is touching a lot of the same places in the codebase 😄 |
|
cc @SBrandeis who might also be interested in this feature :) |
|
Tests are failing only because the |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great ! I added 2 comments:
|
@lhoestq @SBrandeis i've fixed the tests and think this is now in a good state for another review :) |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good ! Thanks for the fix :)
| task_template = TextClassification(text_column="text", label_column="labels", labels=labels) | ||
| info = DatasetInfo( | ||
| features=Features({"text": Value("string"), "labels": ClassLabel(names=labels)}), | ||
| task_templates=task_template, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a test to make sure that the DatasetInfo post init does pass the labels to the task templates, so that we can instantiate the task_template without labels as in the demo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea! i've now set up the unit test to cover the following cases:
DatasetInfodefined with aTextClassificationtemplate withoutlabelsDatasetInfodefined with aTextClassificationtemplate withlabels(this one might be overkill, but best to be safe)- Passing a
TextClassificationtemplate with labels directly toprepare_for_task(the no labels case is covered intest_task_with_incompatible_templates)
|
Maybe @SBrandeis you can also take a look to make sure you're fine with it ? |
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few suggestions.
src/datasets/info.py
Outdated
| for idx, template in enumerate(self.task_templates): | ||
| if isinstance(template, TextClassification) and self.features is not None: | ||
| labels = self.features[template.label_column].names | ||
| self.task_templates[idx] = TextClassification( | ||
| text_column=template.text_column, label_column=template.label_column, labels=labels | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small nit. It's cleaner to have the self.features is not None check outside the loop:
| for idx, template in enumerate(self.task_templates): | |
| if isinstance(template, TextClassification) and self.features is not None: | |
| labels = self.features[template.label_column].names | |
| self.task_templates[idx] = TextClassification( | |
| text_column=template.text_column, label_column=template.label_column, labels=labels | |
| ) | |
| if self.features is not None: | |
| for idx, template in enumerate(self.task_templates): | |
| if isinstance(template, TextClassification): | |
| labels = self.features[template.label_column].names | |
| self.task_templates[idx] = TextClassification( | |
| text_column=template.text_column, label_column=template.label_column, labels=labels | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea! done.
| assert len(self.labels) == len(set(self.labels)), "Labels must be unique" | ||
| # Cast labels to tuple to allow hashing | ||
| self.__dict__["labels"] = tuple(sorted(self.labels)) | ||
| self.__dict__["label_schema"] = copy.deepcopy(self.label_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be safe to remove the FeaturesWithLazyClassLabel descriptor after this to reduce the code complexity (and then the deepcopy call can be replaced with self.label_schema.copy()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i agree, thanks for the suggestion! since we now create a new instance of TextClassification in DatasetInfo.__post_init__ I think we should be safe from the side-effects you spotted earlier
No longer needed since we create a new instance of the task template during the `DatasetInfo.__post_init__`
SBrandeis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @lewtun !
* Insert task templates * Fix style * Insert templates for datasets with off configs * Fix style * Update labels in DatasetInfo __post_init__ * Add emotion example * Flush task templates before casting * Add labels to TextClassification __post_init__ * Add comment about casting to tuple * Fix capitalisation * Refactor tests to account for label update in `DatasetInfo`, add test * Update label schema in post_init * Use __dict__ instead of __setattr__ to update task template labels * Raise ValueError if TextClassification template has None or incompatible labels * Remove task templates from emotion demo * Add decorator to share docstrings across multiple functions * Update docstring for prepare_for_task * Reorder TextClassification args for better intuition * fix missing "task" field in json + edit copy of objects instead of modifying in-place * style * Fix failing tests due to new DatasetInfo.__post_init__ * Refactor TextClassification test to cover templates w / w-out labels * Refactor use of label names in task template concatenation test * Add separate test for template with labels in DatasetInfo * Fix log message * Fix comments * Remove custom feature with lazy classlabel No longer needed since we create a new instance of the task template during the `DatasetInfo.__post_init__` * Move conditional check of features to outer if statement * Move feature is not None check to inner if-statement * Revert task template insertion to account for API changes in PR #2392 * Insert task template to allocine dataset * Revert "Insert task template to allocine dataset" This reverts commit c577149. * Simplify args for text classification template insertion * Add datasets with text classification templates * Fix style * Exclude caner dataset from injection Co-authored-by: Quentin Lhoest <[email protected]>
This PR implements the idea discussed in #2389 to update the
labelsof theTextClassificationtemplate in theDatasetInfo.__post_init__. The main reason for doing so is so avoid duplicating the label definitions in bothDatasetInfo.featuresandDatasetInfo.task_templates.To avoid storing state in
DatasetInfo.__post_init__, the current implementation flushesDatasetInfo.task_templatesbefore the features are cast inDataset.prepare_for_task(thanks to @mariosasko for this idea!).Here is an example of the current workflow:
Note that if users want to pass a
TextClassificationtemplate toprepare_for_task, we require them to setTextClassification.labelsto match the dataset's features corresponding tolabel_column:This PR also adds:
assertRaisesproperlyDatasetDict.prepare_for_taskandDataset.prepare_for_taskin one place.DatasetInfo.task_templatesinDatasetInfo.__post_init__. Thanks to @lhoestq for figuring this out!FeaturesWithLazyClassLabelsince we now create a new instance ofTextClassificationinDatasetInfo.__post_init__and avoid the side-effects first pointed out by @mariosaskoPR Description from original WIP
Hi @yjernite and @lhoestq, here's a first stab at the suggestion discussed in #2389 to update the
labelsof theTextClassificationtemplate in theDatasetInfo.__post_init__.One problem I've spotted is that my current implementation introduces state into the
__post_init__:load_dataset,DatasetInfo.featuresare the "raw" features without any casting so we can access the column names by thelabel_columnspecified inTextClassificationDataset.prepare_for_taskwe run into a problem because theDatasetInfo.featuresare first cast into the new schema which triggers aKeyErrorwhen we update the infos here.Here's an explicit example of what I mean with the stack trace appended below:
What do you think? I did this a bit quickly, so maybe I'm overlooking something obvious :) One thing would be to only update the labels of the task template on load, but this seems a bit hacky IMO