Add Russian SuperGLUE #2668

slowwavesleep · 2021-07-17T17:41:28Z

Hi,

This adds the Russian SuperGLUE dataset. For the most part I reused the code for the original SuperGLUE, although there are some relatively minor differences in the structure that I accounted for.

…s into russian_super_glue

albertvillanova

Hi @slowwavesleep, thanks a lot for adding this dataset! You did a great job.

The code is excellent. I just leave some comments about the dataset card (we are trying to improve them in order to facilitate dataset discoverability and usage).

albertvillanova · 2021-07-23T09:21:52Z

datasets/russian_super_glue/README.md

+## Dataset Structure
+
+### Data Instances
+


Could you please add as Data Instances, a dataset example for each of the tasks?

For example, for task LiDiRus, add the example in https://russiansuperglue.com/tasks/task_info/LiDiRus#Example ?

{ 'sentence1': "Кошка сидела на коврике.", 'sentence2': "Кошка не сидела на коврике.", 'label': 'not_entailment', 'knowledge': '', 'lexical-semantics': '', 'logic': 'Negation', 'predicate-argument-structure': '' }

I've added separate examples for train/dev and test, because the differences aren't always obvious. Also, I decided to sacrifice authenticity for sake of readability and wrapped the examples with exceedingly long text fragments with line breaks, although I'm still on the fence about this. On another note, the examples are specifically after the transformations, so the demonstrated format isn't completely identical to what's actually downloaded (as is the case with the original SuperGLUE). This is the least confusing way, in my opinion, since that's the format the end user is (presumably) going to use, after all.

Thanks ! Could you also write explicitly at the beginning of the Data Instance section that the test sets are missing labels ?

albertvillanova

Thank you! It is awesome!

lhoestq

Awesome thank you ! This is really nice :)

I just have a few comments for the README and the python code - though both are already in really good shape

datasets/russian_super_glue/README.md

lhoestq · 2021-07-26T12:48:42Z

datasets/russian_super_glue/README.md

+## Dataset Structure
+
+### Data Instances
+


Thanks ! Could you also write explicitly at the beginning of the Data Instance section that the test sets are missing labels ?

lhoestq · 2021-07-26T12:56:20Z

datasets/russian_super_glue/russian_super_glue.py

+            citation="",
+            url="https://russiansuperglue.com/tasks/task_info/TERRa",
+        ),
+        RussianSuperGlueConfig(


This configuration should have label_classes no ? I can see the label field in the examples in the readme

Do you mean the TERRa dataset? The label classes are there.

What I mean is that we could name the label classes of RUSSE, RWSD and DANETQA: "same sense"/"not same sense", "cofererence"/"not coreference", "yes"/"no" via the label_classes parameter that you can pass to the builder config.

It's useful to add this info to know what the label represents programmatically, since the label names are used to define the ClassLabel feature type of the label column of your data.

Are the label names supposed to be used for inference? Conceptually, I mean. Submissions on the leaderboard are expected to be in a particular format (such as idx: 1, label: "true"), so I was thinking that it would be useful to be able to reuse label names for that. Here, and in the original SuperGLUE, the datasets with binary answers are generally labeled with boolean values. However, now I noticed that these values are used inconsistently in the sample submission. It appears as "true"/"false", "True"/"False", and 0/1.

Anyway, my question is are label names meant to be just supplementary info?

Meanwhile, I'll have to check that the code written so far works as intended with inconsistent label names.

So the choice is to use whatever label names are expected in the submission or make the names meaningful. I'm not so sure that it's an obvious one in this case.

I didn't know about the expected names in the submission. We could keep the names from what's expected in the submission then. Though if they're not consistent, not sure what would be the best

In doubt maybe we can just stay consistent with the super_glue dataset script, i.e. don't have label_classes for those tasks. Does that sound good to you ?

If yes, I guess we're done since it doesn't have label_classes for those tasks already.

lhoestq · 2021-07-26T12:56:26Z

datasets/russian_super_glue/russian_super_glue.py

+            citation=_RUSSE_CITATION,
+            url="https://russiansuperglue.com/tasks/task_info/RUSSE",
+        ),
+        RussianSuperGlueConfig(


lhoestq · 2021-07-26T12:56:31Z

datasets/russian_super_glue/russian_super_glue.py

+            citation="",
+            url="https://russiansuperglue.com/tasks/task_info/RWSD",
+        ),
+        RussianSuperGlueConfig(


Co-authored-by: Quentin Lhoest <[email protected]>

slowwavesleep · 2021-07-26T20:25:03Z

Added the missing label classes and their explanations (to the best of my understanding)

lhoestq · 2021-07-27T09:15:13Z

Thanks a lot ! Once the last comment about the label names is addressed we can merge :)

lhoestq

Let's keep the label as is then for consistency with super_glue

Thanks a lot for adding this one !

slowwavesleep added 29 commits July 13, 2021 21:49

Initial commit

67a16a6

Added some tasks

cff594b

Add terra

1c693ae

Fixe terra

d7def22

Add lidirus

ecd89f4

Add rcb

7d19e9a

Add parus

793f815

Add muserc

e465f34

Add russe

ccabe0c

Add rwsd

d6a02c5

Add danetqa

36ed18c

Add danetqa

6d77850

Add rucos

4ff720e

Add citations and descriptions

e4c34f8

Add infos and dummy data

fbfc02b

Add license

79f1302

Update reamde

5885460

Update tags

7b1ca43

Update descriptions

7fd2833

Add downloaded sizes

6d84deb

Add additional dataset info

f4c623d

Merge branch 'huggingface:master' into russian_super_glue

f509e6f

Merge branch 'russian_super_glue' of github.com:slowwavesleep/dataset…

e438a19

…s into russian_super_glue

Fix typo

e2c290a

Update style

9147e2f

Update YAML tags

f70f930

Remove trailing whitespaces

8fe453d

Add size categories

5a89cc7

Remove more trailing whitespaces

98071e1

albertvillanova requested changes Jul 23, 2021

View reviewed changes

Add examples

a96a2bf

albertvillanova approved these changes Jul 26, 2021

View reviewed changes

lhoestq reviewed Jul 26, 2021

View reviewed changes

slowwavesleep and others added 5 commits July 26, 2021 18:20

Update datasets/russian_super_glue/README.md

970e826

Co-authored-by: Quentin Lhoest <[email protected]>

Update RUSSE

54d7af0

Update rwsd and danetqa

a64928d

Add missing labels info

c340417

Update muserc

df071aa

Rebuild dummy_data and infos

b7fa969

lhoestq approved these changes Jul 29, 2021

View reviewed changes

lhoestq merged commit e253feb into huggingface:master Jul 29, 2021

Add Russian SuperGLUE #2668

Add Russian SuperGLUE #2668

Uh oh!

Conversation

slowwavesleep commented Jul 17, 2021

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova Jul 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slowwavesleep Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slowwavesleep commented Jul 26, 2021

Uh oh!

lhoestq commented Jul 27, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

albertvillanova Jul 23, 2021 •

edited

Loading

slowwavesleep Jul 27, 2021 •

edited

Loading

lhoestq Jul 27, 2021 •

edited

Loading