Skip to content

Conversation

@slowwavesleep
Copy link
Contributor

Hi,

This adds the Russian SuperGLUE dataset. For the most part I reused the code for the original SuperGLUE, although there are some relatively minor differences in the structure that I accounted for.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @slowwavesleep, thanks a lot for adding this dataset! You did a great job.

The code is excellent. I just leave some comments about the dataset card (we are trying to improve them in order to facilitate dataset discoverability and usage).

## Dataset Structure

### Data Instances

Copy link
Member

@albertvillanova albertvillanova Jul 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add as Data Instances, a dataset example for each of the tasks?

For example, for task LiDiRus, add the example in https://russiansuperglue.com/tasks/task_info/LiDiRus#Example ?

{
     'sentence1': "Кошка сидела на коврике.",
     'sentence2': "Кошка не сидела на коврике.",
     'label': 'not_entailment',
     'knowledge': '',
     'lexical-semantics': '',
     'logic': 'Negation',
     'predicate-argument-structure': ''
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added separate examples for train/dev and test, because the differences aren't always obvious. Also, I decided to sacrifice authenticity for sake of readability and wrapped the examples with exceedingly long text fragments with line breaks, although I'm still on the fence about this. On another note, the examples are specifically after the transformations, so the demonstrated format isn't completely identical to what's actually downloaded (as is the case with the original SuperGLUE). This is the least confusing way, in my opinion, since that's the format the end user is (presumably) going to use, after all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Could you also write explicitly at the beginning of the Data Instance section that the test sets are missing labels ?

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! It is awesome!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you ! This is really nice :)

I just have a few comments for the README and the python code - though both are already in really good shape

## Dataset Structure

### Data Instances

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Could you also write explicitly at the beginning of the Data Instance section that the test sets are missing labels ?

citation="",
url="https://russiansuperglue.com/tasks/task_info/TERRa",
),
RussianSuperGlueConfig(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration should have label_classes no ? I can see the label field in the examples in the readme

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the TERRa dataset? The label classes are there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that we could name the label classes of RUSSE, RWSD and DANETQA: "same sense"/"not same sense", "cofererence"/"not coreference", "yes"/"no" via the label_classes parameter that you can pass to the builder config.

It's useful to add this info to know what the label represents programmatically, since the label names are used to define the ClassLabel feature type of the label column of your data.

Copy link
Contributor Author

@slowwavesleep slowwavesleep Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the label names supposed to be used for inference? Conceptually, I mean. Submissions on the leaderboard are expected to be in a particular format (such as idx: 1, label: "true"), so I was thinking that it would be useful to be able to reuse label names for that. Here, and in the original SuperGLUE, the datasets with binary answers are generally labeled with boolean values. However, now I noticed that these values are used inconsistently in the sample submission. It appears as "true"/"false", "True"/"False", and 0/1.

Anyway, my question is are label names meant to be just supplementary info?

Meanwhile, I'll have to check that the code written so far works as intended with inconsistent label names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the choice is to use whatever label names are expected in the submission or make the names meaningful. I'm not so sure that it's an obvious one in this case.

Copy link
Member

@lhoestq lhoestq Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about the expected names in the submission. We could keep the names from what's expected in the submission then. Though if they're not consistent, not sure what would be the best

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In doubt maybe we can just stay consistent with the super_glue dataset script, i.e. don't have label_classes for those tasks. Does that sound good to you ?

If yes, I guess we're done since it doesn't have label_classes for those tasks already.

citation=_RUSSE_CITATION,
url="https://russiansuperglue.com/tasks/task_info/RUSSE",
),
RussianSuperGlueConfig(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

citation="",
url="https://russiansuperglue.com/tasks/task_info/RWSD",
),
RussianSuperGlueConfig(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@slowwavesleep
Copy link
Contributor Author

Added the missing label classes and their explanations (to the best of my understanding)

@lhoestq
Copy link
Member

lhoestq commented Jul 27, 2021

Thanks a lot ! Once the last comment about the label names is addressed we can merge :)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the label as is then for consistency with super_glue

Thanks a lot for adding this one !

@lhoestq lhoestq merged commit e253feb into huggingface:master Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants