Skip to content

Commit c447fc8

Browse files
authored
Revert "Refactor kwargs and configs" (#299)
Revert "Refactor kwargs and configs (#188)" This reverts commit e4a2724.
1 parent e4a2724 commit c447fc8

File tree

74 files changed

+450
-1358
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+450
-1358
lines changed

comparisons/exact_match/exact_match.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,13 +46,12 @@
4646

4747
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
4848
class ExactMatch(evaluate.Comparison):
49-
def _info(self, config):
49+
def _info(self):
5050
return evaluate.ComparisonInfo(
5151
module_type="comparison",
5252
description=_DESCRIPTION,
5353
citation=_CITATION,
5454
inputs_description=_KWARGS_DESCRIPTION,
55-
config=config,
5655
features=datasets.Features(
5756
{
5857
"predictions1": datasets.Value("int64"),

comparisons/mcnemar/mcnemar.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,13 +62,12 @@
6262

6363
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
6464
class McNemar(evaluate.Comparison):
65-
def _info(self, config):
65+
def _info(self):
6666
return evaluate.ComparisonInfo(
6767
module_type="comparison",
6868
description=_DESCRIPTION,
6969
citation=_CITATION,
7070
inputs_description=_KWARGS_DESCRIPTION,
71-
config=config,
7271
features=datasets.Features(
7372
{
7473
"predictions1": datasets.Value("int64"),

comparisons/wilcoxon/wilcoxon.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,13 +55,12 @@
5555

5656
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
5757
class Wilcoxon(evaluate.Comparison):
58-
def _info(self, config):
58+
def _info(self):
5959
return evaluate.ComparisonInfo(
6060
module_type="comparison",
6161
description=_DESCRIPTION,
6262
citation=_CITATION,
6363
inputs_description=_KWARGS_DESCRIPTION,
64-
config=config,
6564
features=datasets.Features(
6665
{
6766
"predictions1": datasets.Value("float"),

docs/source/a_quick_tour.mdx

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,6 @@ All evalution modules come with a range of useful attributes that help to use a
6565
|---|---|
6666
|`description`|A short description of the evaluation module.|
6767
|`citation`|A BibTex string for citation when available.|
68-
|`config` | A `dataclass` containing the settings of the module. |
6968
|`features`|A `Features` object defining the input format.|
7069
|`inputs_description`|This is equivalent to the modules docstring.|
7170
|`homepage`|The homepage of the module.|
@@ -179,23 +178,6 @@ A common way to overcome this issue is to fallback on single process evaluation.
179178

180179
This solution allows 🤗 Evaluate to perform distributed predictions, which is important for evaluation speed in distributed settings. At the same time, you can also use complex non-additive metrics without wasting valuable GPU or CPU memory.
181180

182-
## Configuration
183-
184-
Some metrics can be configured with additional settings. For example, `accuracy` has an extra `normalize` setting which returns the fraction of correctly classified samples and is set to `True` by default. To change it you have two options: pass it as a keyword argument with `load()` or during `compute()`. With `load()`, the setting is changed permanently for the module, while passing it to `compute()` only changes it for the duration of the `compute()` call.
185-
186-
```python
187-
188-
>>> metric = evaluate.load("accuracy", normalize=False)
189-
>>> refs, preds = [1, 1], [1, 0]
190-
>>> acc_1 = metric.compute(references=refs, predictions=preds)["accuracy"]
191-
>>> acc_2 = metric.compute(references=refs, predictions=preds, normalize=True)["accuracy"]
192-
>>> acc_3 = metric.compute(references=refs, predictions=preds)["accuracy"]
193-
>>> print((acc_1, acc_2, acc_3))
194-
(1.0, 0.5, 1.0)
195-
```
196-
197-
This is also useful for the following `combine()` method since it allows to load modules with specific settings before combining them.
198-
199181
## Combining several evaluations
200182

201183
Often one wants to not only evaluate a single metric but a range of different metrics capturing different aspects of a model. E.g. for classification it is usually a good idea to compute F1-score, recall, and precision in addition to accuracy to get a better picture of model performance. Naturally, you can load a bunch of metrics and call them sequentially. However, a more convenient way is to use the [`~evaluate.combine`] function to bundle them together:

measurements/label_distribution/label_distribution.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,13 +70,12 @@
7070

7171
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
7272
class LabelDistribution(evaluate.Measurement):
73-
def _info(self, config):
73+
def _info(self):
7474
return evaluate.MeasurementInfo(
7575
module_type="measurement",
7676
description=_DESCRIPTION,
7777
citation=_CITATION,
7878
inputs_description=_KWARGS_DESCRIPTION,
79-
config=config,
8079
features=[
8180
datasets.Features({"data": datasets.Value("int32")}),
8281
datasets.Features({"data": datasets.Value("string")}),

measurements/perplexity/perplexity.py

Lines changed: 10 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,6 @@
1313
# limitations under the License.
1414
"""Perplexity Metric."""
1515

16-
from dataclasses import dataclass
17-
from typing import Optional
18-
1916
import datasets
2017
import numpy as np
2118
import torch
@@ -87,29 +84,14 @@
8784
"""
8885

8986

90-
@dataclass
91-
class PerplexityConfig(evaluate.info.Config):
92-
93-
name: str = "default"
94-
95-
batch_size: int = 16
96-
model_id: str = "gpt2"
97-
add_start_token: bool = True
98-
device: Optional[str] = None
99-
100-
10187
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
10288
class Perplexity(evaluate.Measurement):
103-
CONFIG_CLASS = PerplexityConfig
104-
ALLOWED_CONFIG_NAMES = ["default"]
105-
106-
def _info(self, config):
89+
def _info(self):
10790
return evaluate.MeasurementInfo(
10891
module_type="measurement",
10992
description=_DESCRIPTION,
11093
citation=_CITATION,
11194
inputs_description=_KWARGS_DESCRIPTION,
112-
config=config,
11395
features=datasets.Features(
11496
{
11597
"data": datasets.Value("string"),
@@ -118,25 +100,24 @@ def _info(self, config):
118100
reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
119101
)
120102

121-
def _compute(self, data):
103+
def _compute(self, data, model_id, batch_size: int = 16, add_start_token: bool = True, device=None):
122104

123-
device = self.config.device
124105
if device is not None:
125106
assert device in ["gpu", "cpu", "cuda"], "device should be either gpu or cpu."
126107
if device == "gpu":
127108
device = "cuda"
128109
else:
129110
device = "cuda" if torch.cuda.is_available() else "cpu"
130111

131-
model = AutoModelForCausalLM.from_pretrained(self.config.model_id)
112+
model = AutoModelForCausalLM.from_pretrained(model_id)
132113
model = model.to(device)
133114

134-
tokenizer = AutoTokenizer.from_pretrained(self.config.model_id)
115+
tokenizer = AutoTokenizer.from_pretrained(model_id)
135116

136117
# if batch_size > 1 (which generally leads to padding being required), and
137118
# if there is not an already assigned pad_token, assign an existing
138119
# special token to also be the padding token
139-
if tokenizer.pad_token is None and self.config.batch_size > 1:
120+
if tokenizer.pad_token is None and batch_size > 1:
140121
existing_special_tokens = list(tokenizer.special_tokens_map_extended.values())
141122
# check that the model already has at least one special token defined
142123
assert (
@@ -145,7 +126,7 @@ def _compute(self, data):
145126
# assign one of the special tokens to also be the pad token
146127
tokenizer.add_special_tokens({"pad_token": existing_special_tokens[0]})
147128

148-
if self.config.add_start_token:
129+
if add_start_token:
149130
# leave room for <BOS> token to be added:
150131
assert (
151132
tokenizer.bos_token is not None
@@ -168,7 +149,7 @@ def _compute(self, data):
168149
attn_masks = encodings["attention_mask"]
169150

170151
# check that each input is long enough:
171-
if self.config.add_start_token:
152+
if add_start_token:
172153
assert torch.all(torch.ge(attn_masks.sum(1), 1)), "Each input text must be at least one token long."
173154
else:
174155
assert torch.all(
@@ -178,12 +159,12 @@ def _compute(self, data):
178159
ppls = []
179160
loss_fct = CrossEntropyLoss(reduction="none")
180161

181-
for start_index in logging.tqdm(range(0, len(encoded_texts), self.config.batch_size)):
182-
end_index = min(start_index + self.config.batch_size, len(encoded_texts))
162+
for start_index in logging.tqdm(range(0, len(encoded_texts), batch_size)):
163+
end_index = min(start_index + batch_size, len(encoded_texts))
183164
encoded_batch = encoded_texts[start_index:end_index]
184165
attn_mask = attn_masks[start_index:end_index]
185166

186-
if self.config.add_start_token:
167+
if add_start_token:
187168
bos_tokens_tensor = torch.tensor([[tokenizer.bos_token_id]] * encoded_batch.size(dim=0)).to(device)
188169
encoded_batch = torch.cat([bos_tokens_tensor, encoded_batch], dim=1)
189170
attn_mask = torch.cat(

measurements/regard/regard.py

Lines changed: 6 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,8 @@
1515
""" Regard measurement. """
1616

1717
from collections import defaultdict
18-
from dataclasses import dataclass
1918
from operator import itemgetter
2019
from statistics import mean
21-
from typing import Optional
2220

2321
import datasets
2422
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
@@ -117,28 +115,16 @@ def regard(group, regard_classifier):
117115
return group_regard, dict(group_scores)
118116

119117

120-
@dataclass
121-
class RegardConfig(evaluate.info.Config):
122-
123-
name: str = "default"
124-
125-
aggregation: Optional[str] = None
126-
127-
128118
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
129119
class Regard(evaluate.Measurement):
130-
CONFIG_CLASS = RegardConfig
131-
ALLOWED_CONFIG_NAMES = ["default", "compare"]
132-
133-
def _info(self, config):
120+
def _info(self):
134121
if self.config_name not in ["compare", "default"]:
135122
raise KeyError("You should supply a configuration name selected in " '["config", "default"]')
136123
return evaluate.MeasurementInfo(
137124
module_type="measurement",
138125
description=_DESCRIPTION,
139126
citation=_CITATION,
140127
inputs_description=_KWARGS_DESCRIPTION,
141-
config=config,
142128
features=datasets.Features(
143129
{
144130
"data": datasets.Value("string", id="sequence"),
@@ -164,6 +150,7 @@ def _compute(
164150
self,
165151
data,
166152
references=None,
153+
aggregation=None,
167154
):
168155
if self.config_name == "compare":
169156
pred_scores, pred_regard = regard(data, self.regard_classifier)
@@ -172,22 +159,22 @@ def _compute(
172159
pred_max = {k: max(v) for k, v in pred_regard.items()}
173160
ref_mean = {k: mean(v) for k, v in ref_regard.items()}
174161
ref_max = {k: max(v) for k, v in ref_regard.items()}
175-
if self.config.aggregation == "maximum":
162+
if aggregation == "maximum":
176163
return {
177164
"max_data_regard": pred_max,
178165
"max_references_regard": ref_max,
179166
}
180-
elif self.config.aggregation == "average":
167+
elif aggregation == "average":
181168
return {"average_data_regard": pred_mean, "average_references_regard": ref_mean}
182169
else:
183170
return {"regard_difference": {key: pred_mean[key] - ref_mean.get(key, 0) for key in pred_mean}}
184171
else:
185172
pred_scores, pred_regard = regard(data, self.regard_classifier)
186173
pred_mean = {k: mean(v) for k, v in pred_regard.items()}
187174
pred_max = {k: max(v) for k, v in pred_regard.items()}
188-
if self.config.aggregation == "maximum":
175+
if aggregation == "maximum":
189176
return {"max_regard": pred_max}
190-
elif self.config.aggregation == "average":
177+
elif aggregation == "average":
191178
return {"average_regard": pred_mean}
192179
else:
193180
return {"regard": pred_scores}

measurements/text_duplicates/text_duplicates.py

Lines changed: 4 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@
1414

1515
import hashlib
1616
from collections import Counter
17-
from dataclasses import dataclass
1817

1918
import datasets
2019

@@ -58,29 +57,18 @@ def get_hash(example):
5857
return hashlib.md5(example.strip().encode("utf-8")).hexdigest()
5958

6059

61-
@dataclass
62-
class TextDuplicatesConfig(evaluate.info.Config):
63-
64-
name: str = "default"
65-
66-
list_duplicates: bool = False
67-
68-
6960
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
7061
class TextDuplicates(evaluate.Measurement):
7162
"""This measurement returns the duplicate strings contained in the input(s)."""
7263

73-
CONFIG_CLASS = TextDuplicatesConfig
74-
ALLOWED_CONFIG_NAMES = ["default"]
75-
76-
def _info(self, config):
64+
def _info(self):
65+
# TODO: Specifies the evaluate.MeasurementInfo object
7766
return evaluate.MeasurementInfo(
7867
# This is the description that will appear on the modules page.
7968
module_type="measurement",
8069
description=_DESCRIPTION,
8170
citation=_CITATION,
8271
inputs_description=_KWARGS_DESCRIPTION,
83-
config=config,
8472
# This defines the format of each prediction and reference
8573
features=datasets.Features(
8674
{
@@ -89,9 +77,9 @@ def _info(self, config):
8977
),
9078
)
9179

92-
def _compute(self, data):
80+
def _compute(self, data, list_duplicates=False):
9381
"""Returns the duplicates contained in the input data and the number of times they are repeated."""
94-
if self.config.list_duplicates == True:
82+
if list_duplicates == True:
9583
logger.warning("This functionality can be memory-intensive for large datasets!")
9684
n_dedup = len(set([get_hash(d) for d in data]))
9785
c = Counter(data)

measurements/toxicity/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ The model should be compatible with the AutoModelForSequenceClassification class
3030
For more information, see [the AutoModelForSequenceClassification documentation]( https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForSequenceClassification).
3131

3232
Args:
33-
`data` (list of str): prediction/candidate sentences
33+
`predictions` (list of str): prediction/candidate sentences
3434
`toxic_label` (str) (optional): the toxic label that you want to detect, depending on the labels that the model has been trained on.
3535
This can be found using the `id2label` function, e.g.:
3636
```python
@@ -47,7 +47,7 @@ Args:
4747

4848
## Output values
4949

50-
`toxicity`: a list of toxicity scores, one for each sentence in `data` (default behavior)
50+
`toxicity`: a list of toxicity scores, one for each sentence in `predictions` (default behavior)
5151

5252
`max_toxicity`: the maximum toxicity over all scores (if `aggregation` = `maximum`)
5353

@@ -62,31 +62,31 @@ Args:
6262
```python
6363
>>> toxicity = evaluate.load("toxicity", module_type="measurement")
6464
>>> input_texts = ["she went to the library", "he is a douchebag"]
65-
>>> results = toxicity.compute(data=input_texts)
65+
>>> results = toxicity.compute(predictions=input_texts)
6666
>>> print([round(s, 4) for s in results["toxicity"]])
6767
[0.0002, 0.8564]
6868
```
6969
Example 2 (returns ratio of toxic sentences):
7070
```python
7171
>>> toxicity = evaluate.load("toxicity", module_type="measurement")
7272
>>> input_texts = ["she went to the library", "he is a douchebag"]
73-
>>> results = toxicity.compute(data=input_texts, aggregation="ratio")
73+
>>> results = toxicity.compute(predictions=input_texts, aggregation="ratio")
7474
>>> print(results['toxicity_ratio'])
7575
0.5
7676
```
7777
Example 3 (returns the maximum toxicity score):
7878
```python
7979
>>> toxicity = evaluate.load("toxicity", module_type="measurement")
8080
>>> input_texts = ["she went to the library", "he is a douchebag"]
81-
>>> results = toxicity.compute(data=input_texts, aggregation="maximum")
81+
>>> results = toxicity.compute(predictions=input_texts, aggregation="maximum")
8282
>>> print(round(results['max_toxicity'], 4))
8383
0.8564
8484
```
8585
Example 4 (uses a custom model):
8686
```python
87-
>>> toxicity = evaluate.load("toxicity", model_name='DaNLP/da-electra-hatespeech-detection')
87+
>>> toxicity = evaluate.load("toxicity", 'DaNLP/da-electra-hatespeech-detection')
8888
>>> input_texts = ["she went to the library", "he is a douchebag"]
89-
>>> results = toxicity.compute(data=input_texts, toxic_label='offensive')
89+
>>> results = toxicity.compute(predictions=input_texts, toxic_label='offensive')
9090
>>> print([round(s, 4) for s in results["toxicity"]])
9191
[0.0176, 0.0203]
9292
```

0 commit comments

Comments
 (0)