Skip to content

Commit cbbda53

Browse files
authored
xed_en_fi dataset Cleanup (#1668)
* xed_en_fi cleanup * style
1 parent 628911e commit cbbda53

File tree

3 files changed

+60
-68
lines changed

3 files changed

+60
-68
lines changed

datasets/xed_en_fi/README.md

Lines changed: 25 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -50,19 +50,18 @@ task_ids:
5050

5151
## Dataset Description
5252

53-
- **Homepage:** [Needs More Information]
54-
- **Repository:** https://github.com/Helsinki-NLP/XED
55-
- **Paper:** https://arxiv.org/pdf/2011.01612.pdf
56-
- **Leaderboard:** [Needs More Information]
57-
- **Point of Contact:** [Needs More Information]
53+
- **Homepage:**
54+
- **Repository:** [Github](https://github.com/Helsinki-NLP/XED)
55+
- **Paper:** [Arxiv](https://arxiv.org/abs/2011.01612)
56+
- **Leaderboard:**
57+
- **Point of Contact:**
5858

5959
### Dataset Summary
6060

6161
This is the XED dataset. The dataset consists of emotion annotated movie subtitles from OPUS. We use Plutchik's 8 core emotions to annotate. The data is multilabel. The original annotations have been sourced for mainly English and Finnish.
6262
For the English data we used Stanford NER (named entity recognition) (Finkel et al., 2005) to replace names and locations with the tags: [PERSON] and [LOCATION] respectively.
6363
For the Finnish data, we replaced names and locations using the Turku NER corpus (Luoma et al., 2020).
6464

65-
6665
### Supported Tasks and Leaderboards
6766

6867
Sentiment Classification, Multilabel Classification, Multilabel Classification, Intent Classification
@@ -74,74 +73,79 @@ English, Finnish
7473
## Dataset Structure
7574

7675
### Data Instances
76+
7777
```
7878
{ "sentence": "A confession that you hired [PERSON] ... and are responsible for my father's murder."
79-
"labels": [1, 6]
79+
"labels": [1, 6] # anger, sadness
8080
}
8181
```
8282

8383
### Data Fields
8484

8585
- sentence: a line from the dataset
86-
- labels: labels corresponding to the emotion
86+
- labels: labels corresponding to the emotion as an integer
8787

88-
Where the number indicates the emotion in ascending alphabetical order: anger:1, anticipation:2, disgust:3, fear:4, joy:5, sadness:6, surprise:7, trust:8, with neutral:0 where applicable.
88+
Where the number indicates the emotion in ascending alphabetical order: anger:1, anticipation:2, disgust:3, fear:4, joy:5, sadness:6, surprise:7, trust:8, with neutral:0 where applicable.
8989

9090
### Data Splits
9191

9292
For English:
93-
Number of unique data points: 17530 + 6420 (neutral)
94-
Number of emotions: 8 (+pos, neg, neu)
93+
Number of unique data points: 17528 ('en_annotated' config) + 9675 ('en_neutral' config)
94+
Number of emotions: 8 (+neutral)
95+
96+
For Finnish:
97+
Number of unique data points: 14449 ('fi_annotated' config) + 10794 ('fi_neutral' config)
98+
Number of emotions: 8 (+neutral)
9599

96100
## Dataset Creation
97101

98102
### Curation Rationale
99103

100-
[Needs More Information]
104+
[More Information Needed]
101105

102106
### Source Data
103107

104108
#### Initial Data Collection and Normalization
105109

106-
[Needs More Information]
110+
[More Information Needed]
107111

108112
#### Who are the source language producers?
109113

110-
[Needs More Information]
114+
[More Information Needed]
111115

112116
### Annotations
113117

114118
#### Annotation process
115119

116-
[Needs More Information]
120+
[More Information Needed]
117121

118122
#### Who are the annotators?
119123

120-
[Needs More Information]
124+
[More Information Needed]
121125

122126
### Personal and Sensitive Information
123127

124-
[Needs More Information]
128+
[More Information Needed]
125129

126130
## Considerations for Using the Data
127131

128132
### Social Impact of Dataset
129133

130-
[Needs More Information]
134+
[More Information Needed]
131135

132136
### Discussion of Biases
133137

134-
[Needs More Information]
138+
[More Information Needed]
135139

136140
### Other Known Limitations
137141

138-
[Needs More Information]
142+
[More Information Needed]
139143

140144
## Additional Information
141145

142146
### Dataset Curators
143147

144-
[Needs More Information]
148+
[More Information Needed]
145149

146150
### Licensing Information
147151

0 commit comments

Comments
 (0)