You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/a_quick_tour.mdx
+60-41Lines changed: 60 additions & 41 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -199,47 +199,6 @@ The `combine` function accepts both the list of names of the metrics as well as
199
199
}
200
200
```
201
201
202
-
## Running evaluation on a suite of tasks
203
-
204
-
It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. The [EvaluationSuite](evaluation_suite) enables evaluation of models on a collection of tasks. Tasks can be constructed as ([evaluator](base_evaluator), dataset, metric) tuples and passed to an [EvaluationSuite](evaluation_suite) stored on the Hugging Face Hub as a Space. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.
205
-
206
-
`EvaluationSuite` scripts can be defined as follows, and supports Python code for data preprocessing.
Evaluation can be run by loading the `EvaluationSuite` and calling `run()` method with a model or pipeline.
236
-
237
-
```python
238
-
from evaluate import EvaluationSuite
239
-
suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite')
240
-
results = suite.run("gpt2")
241
-
```
242
-
243
202
## Save and push to the Hub
244
203
245
204
Saving and sharing evaluation results is an important step. We provide the [`evaluate.save`] function to easily save metrics results. You can either pass a specific filename or a directory. In the latter case, the results are saved in a file with an automatically created file name. Besides the directory or file name, the function takes any key-value pairs as inputs and stores them in a JSON file.
@@ -332,3 +291,63 @@ Calculating the value of the metric alone is often not enough to know if a model
332
291
```
333
292
334
293
The evaluator expects a `"text"` and `"label"` column for the data input. If your dataset differs you can provide the columns with the keywords `input_column="text"` and `label_column="label"`. Currently only `"text-classification"` is supported with more tasks being added in the future.
294
+
295
+
## Running evaluation on a suite of tasks
296
+
297
+
It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. The [EvaluationSuite](evaluation_suite) enables evaluation of models on a collection of tasks. Tasks can be constructed as ([evaluator](base_evaluator), dataset, metric) tuples and passed to an [EvaluationSuite](evaluation_suite) stored on the Hugging Face Hub as a Space, or locally as a Python script. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.
298
+
299
+
`EvaluationSuite` scripts can be defined as follows, and supports Python code for data preprocessing.
300
+
301
+
```python
302
+
import evaluate
303
+
from evaluate.evaluation_suite import SubTask
304
+
305
+
classSuite(evaluate.EvaluationSuite):
306
+
307
+
def__init__(self, name):
308
+
super().__init__(name)
309
+
310
+
self.suite = [
311
+
SubTask(
312
+
task_type="text-classification",
313
+
data="imdb",
314
+
split="test[:1]",
315
+
args_for_task={
316
+
"metric": "accuracy",
317
+
"input_column": "text",
318
+
"label_column": "label",
319
+
"label_mapping": {
320
+
"LABEL_0": 0.0,
321
+
"LABEL_1": 1.0
322
+
}
323
+
}
324
+
),
325
+
SubTask(
326
+
task_type="text-classification",
327
+
data="sst2",
328
+
split="test[:1]",
329
+
args_for_task={
330
+
"metric": "accuracy",
331
+
"input_column": "sentence",
332
+
"label_column": "label",
333
+
"label_mapping": {
334
+
"LABEL_0": 0.0,
335
+
"LABEL_1": 1.0
336
+
}
337
+
}
338
+
)
339
+
]
340
+
```
341
+
342
+
Evaluation can be run by loading the `EvaluationSuite` and calling `run()` method with a model or pipeline.
343
+
344
+
```
345
+
>>> from evaluate import EvaluationSuite
346
+
>>> suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite')
The `EvaluationSuite` provides a way to compose any number of ([evaluator](base_evaluator), dataset, metric) tuples to evaluate a modelon a collection of several evaluation tasks. See the [evaluator documentation](base_evaluator)for a list of currently supported tasks.
3
+
It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. Assessing the model on several types of tasks can reveal gaps in performance along some axis. For example, when training a language model, it is often useful to measure perplexity on an in-domain corpus, but also to concurrently evaluate on tasks which test for general language capabilities like natural language entailment or question-answering, or tasks designed to probe the model along fairness and bias dimensions.
4
4
5
-
A new `EvaluationSuite` is made up of a list of `SubTask` classes, each defining an evaluation task. The Python file containing the definition can be uploaded to a Space on the Hugging Face Hub so it can be shared with the community or saved/loaded locally.
5
+
The `EvaluationSuite` provides a way to compose any number of ([evaluator](base_evaluator), dataset, metric) tuples as a SubTask to evaluate a model on a collection of several evaluation tasks. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.
6
+
7
+
A new `EvaluationSuite` is made up of a list of `SubTask` classes, each defining an evaluation task. The Python file containing the definition can be uploaded to a Space on the Hugging Face Hub so it can be shared with the community or saved/loaded locally as a Python script.
6
8
7
9
Some datasets require additional preprocessing before passing them to an `Evaluator`. You can set a `data_preprocessor` for each `SubTask` which is applied via a `map` operation using the `datasets` library. Keyword arguments for the `Evaluator` can be passed down through the `args_for_task` attribute.
8
10
@@ -19,8 +21,8 @@ class Suite(evaluate.EvaluationSuite):
19
21
SubTask(
20
22
task_type="text-classification",
21
23
data="glue",
22
-
subset="cola",
23
-
split="test[:10]",
24
+
subset="sst2",
25
+
split="validation[:10]",
24
26
args_for_task={
25
27
"metric": "accuracy",
26
28
"input_column": "sentence",
@@ -30,19 +32,35 @@ class Suite(evaluate.EvaluationSuite):
30
32
"LABEL_1": 1.0
31
33
}
32
34
}
35
+
),
36
+
SubTask(
37
+
task_type="text-classification",
38
+
data="glue",
39
+
subset="rte",
40
+
split="validation[:10]",
41
+
args_for_task={
42
+
"metric": "accuracy",
43
+
"input_column": "sentence1",
44
+
"second_input_column": "sentence2",
45
+
"label_column": "label",
46
+
"label_mapping": {
47
+
"LABEL_0": 0,
48
+
"LABEL_1": 1
49
+
}
50
+
}
33
51
)
34
52
]
35
53
```
36
54
37
55
An `EvaluationSuite` can be loaded by name from the Hugging Face Hub, or locally by providing a path, and run with the `run(model_or_pipeline)` method. The evaluation results are returned along with their task names and information about the time it took to obtain predictions through the pipeline. These can be easily displayed with a `pandas.DataFrame`.
38
56
39
-
```python
40
-
import pandas as pd
41
-
from evaluate import EvaluationSuite
42
-
43
-
suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite')
44
-
results = suite.run("gpt2")
57
+
```
58
+
>>> from evaluate import EvaluationSuite
59
+
>>> suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite')
0 commit comments