Skip to content

Commit 7da77ed

Browse files
committed
Code review
1 parent fb89329 commit 7da77ed

File tree

2 files changed

+90
-53
lines changed

2 files changed

+90
-53
lines changed

docs/source/a_quick_tour.mdx

Lines changed: 60 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -199,47 +199,6 @@ The `combine` function accepts both the list of names of the metrics as well as
199199
}
200200
```
201201

202-
## Running evaluation on a suite of tasks
203-
204-
It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. The [EvaluationSuite](evaluation_suite) enables evaluation of models on a collection of tasks. Tasks can be constructed as ([evaluator](base_evaluator), dataset, metric) tuples and passed to an [EvaluationSuite](evaluation_suite) stored on the Hugging Face Hub as a Space. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.
205-
206-
`EvaluationSuite` scripts can be defined as follows, and supports Python code for data preprocessing.
207-
208-
```python
209-
import evaluate
210-
from evaluate.evaluation_suite import SubTask
211-
212-
class Suite(evaluate.EvaluationSuite):
213-
214-
def __init__(self, name):
215-
super().__init__(name)
216-
self.preprocessor = lambda x: {"text": x["text"].lower()}
217-
self.suite = [
218-
SubTask(
219-
task_type="text-classification",
220-
data="glue",
221-
subset="cola",
222-
split="test[:10]",
223-
args_for_task={
224-
"metric": "accuracy",
225-
"input_column": "sentence",
226-
"label_column": "label",
227-
"label_mapping": {
228-
"LABEL_0": 0.0,
229-
"LABEL_1": 1.0
230-
}
231-
}
232-
)]
233-
```
234-
235-
Evaluation can be run by loading the `EvaluationSuite` and calling `run()` method with a model or pipeline.
236-
237-
```python
238-
from evaluate import EvaluationSuite
239-
suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite')
240-
results = suite.run("gpt2")
241-
```
242-
243202
## Save and push to the Hub
244203

245204
Saving and sharing evaluation results is an important step. We provide the [`evaluate.save`] function to easily save metrics results. You can either pass a specific filename or a directory. In the latter case, the results are saved in a file with an automatically created file name. Besides the directory or file name, the function takes any key-value pairs as inputs and stores them in a JSON file.
@@ -332,3 +291,63 @@ Calculating the value of the metric alone is often not enough to know if a model
332291
```
333292

334293
The evaluator expects a `"text"` and `"label"` column for the data input. If your dataset differs you can provide the columns with the keywords `input_column="text"` and `label_column="label"`. Currently only `"text-classification"` is supported with more tasks being added in the future.
294+
295+
## Running evaluation on a suite of tasks
296+
297+
It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. The [EvaluationSuite](evaluation_suite) enables evaluation of models on a collection of tasks. Tasks can be constructed as ([evaluator](base_evaluator), dataset, metric) tuples and passed to an [EvaluationSuite](evaluation_suite) stored on the Hugging Face Hub as a Space, or locally as a Python script. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.
298+
299+
`EvaluationSuite` scripts can be defined as follows, and supports Python code for data preprocessing.
300+
301+
```python
302+
import evaluate
303+
from evaluate.evaluation_suite import SubTask
304+
305+
class Suite(evaluate.EvaluationSuite):
306+
307+
def __init__(self, name):
308+
super().__init__(name)
309+
310+
self.suite = [
311+
SubTask(
312+
task_type="text-classification",
313+
data="imdb",
314+
split="test[:1]",
315+
args_for_task={
316+
"metric": "accuracy",
317+
"input_column": "text",
318+
"label_column": "label",
319+
"label_mapping": {
320+
"LABEL_0": 0.0,
321+
"LABEL_1": 1.0
322+
}
323+
}
324+
),
325+
SubTask(
326+
task_type="text-classification",
327+
data="sst2",
328+
split="test[:1]",
329+
args_for_task={
330+
"metric": "accuracy",
331+
"input_column": "sentence",
332+
"label_column": "label",
333+
"label_mapping": {
334+
"LABEL_0": 0.0,
335+
"LABEL_1": 1.0
336+
}
337+
}
338+
)
339+
]
340+
```
341+
342+
Evaluation can be run by loading the `EvaluationSuite` and calling `run()` method with a model or pipeline.
343+
344+
```
345+
>>> from evaluate import EvaluationSuite
346+
>>> suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite')
347+
>>> results = suite.run("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")
348+
349+
| accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds | task_name |
350+
|------------:|---------------------:|--------------------------:|:----------------|:-----------|
351+
| 0.3 | 4.62804 | 2.16074 | 0.462804 | imdb |
352+
| 0 | 0.686388 | 14.569 | 0.0686388 | sst2 |
353+
```

docs/source/evaluation_suite.mdx

Lines changed: 30 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
# Creating an EvaluationSuite
22

3-
The `EvaluationSuite` provides a way to compose any number of ([evaluator](base_evaluator), dataset, metric) tuples to evaluate a model on a collection of several evaluation tasks. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.
3+
It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. Assessing the model on several types of tasks can reveal gaps in performance along some axis. For example, when training a language model, it is often useful to measure perplexity on an in-domain corpus, but also to concurrently evaluate on tasks which test for general language capabilities like natural language entailment or question-answering, or tasks designed to probe the model along fairness and bias dimensions.
44

5-
A new `EvaluationSuite` is made up of a list of `SubTask` classes, each defining an evaluation task. The Python file containing the definition can be uploaded to a Space on the Hugging Face Hub so it can be shared with the community or saved/loaded locally.
5+
The `EvaluationSuite` provides a way to compose any number of ([evaluator](base_evaluator), dataset, metric) tuples as a SubTask to evaluate a model on a collection of several evaluation tasks. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.
6+
7+
A new `EvaluationSuite` is made up of a list of `SubTask` classes, each defining an evaluation task. The Python file containing the definition can be uploaded to a Space on the Hugging Face Hub so it can be shared with the community or saved/loaded locally as a Python script.
68

79
Some datasets require additional preprocessing before passing them to an `Evaluator`. You can set a `data_preprocessor` for each `SubTask` which is applied via a `map` operation using the `datasets` library. Keyword arguments for the `Evaluator` can be passed down through the `args_for_task` attribute.
810

@@ -19,8 +21,8 @@ class Suite(evaluate.EvaluationSuite):
1921
SubTask(
2022
task_type="text-classification",
2123
data="glue",
22-
subset="cola",
23-
split="test[:10]",
24+
subset="sst2",
25+
split="validation[:10]",
2426
args_for_task={
2527
"metric": "accuracy",
2628
"input_column": "sentence",
@@ -30,19 +32,35 @@ class Suite(evaluate.EvaluationSuite):
3032
"LABEL_1": 1.0
3133
}
3234
}
35+
),
36+
SubTask(
37+
task_type="text-classification",
38+
data="glue",
39+
subset="rte",
40+
split="validation[:10]",
41+
args_for_task={
42+
"metric": "accuracy",
43+
"input_column": "sentence1",
44+
"second_input_column": "sentence2",
45+
"label_column": "label",
46+
"label_mapping": {
47+
"LABEL_0": 0,
48+
"LABEL_1": 1
49+
}
50+
}
3351
)
3452
]
3553
```
3654

3755
An `EvaluationSuite` can be loaded by name from the Hugging Face Hub, or locally by providing a path, and run with the `run(model_or_pipeline)` method. The evaluation results are returned along with their task names and information about the time it took to obtain predictions through the pipeline. These can be easily displayed with a `pandas.DataFrame`.
3856

39-
```python
40-
import pandas as pd
41-
from evaluate import EvaluationSuite
42-
43-
suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite')
44-
results = suite.run("gpt2")
57+
```
58+
>>> from evaluate import EvaluationSuite
59+
>>> suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite')
60+
>>> results = suite.run("gpt2")
4561
46-
results = [{'accuracy': 0.0, 'total_time_in_seconds': 0.6330130019999842, 'samples_per_second': 15.797463825237905, 'latency_in_seconds': 0.06330130019999843, 'task_name': 'glue/cola', 'data_preprocessor': None}, {'accuracy': 0.5, 'total_time_in_seconds': 0.7627554609999834, 'samples_per_second': 13.110361723126644, 'latency_in_seconds': 0.07627554609999834, 'task_name': 'glue/sst2', 'data_preprocessor': None}]
47-
print(pd.DataFrame(results))
62+
| accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds | task_name |
63+
|-----------:|------------------------:|---------------------:|---------------------:|:------------|
64+
| 0.5 | 0.740811 | 13.4987 | 0.0740811 | glue/sst2 |
65+
| 0.4 | 1.67552 | 5.9683 | 0.167552 | glue/rte |
4866
```

0 commit comments

Comments
 (0)