You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/open_source/testset_generation/rag_evaluation/index.md
+276Lines changed: 276 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -141,6 +141,282 @@ Built-in metrics include `ragas_context_precision`, `ragas_faithfulness`, `ragas
141
141
142
142
Alternatively, you can directly pass a list of answers instead of `get_answer_fn` to the `evaluate` function, you can then pass the retrieved documents as an optional argument `retrieved_documents` to compute the RAGAS metrics.
143
143
144
+
### Creating Custom Metrics
145
+
146
+
**You can easily create your own metrics by extending the base Metric class from Giskard.**
147
+
148
+
To illustrate how simple it is to build a custom metric, we’ll implement the `CorrectnessScoreMetric` — a custom metric that evaluates the performance of a language model as a judge.
149
+
150
+
While `Giskard` provides a default boolean correctness metric to check if a RAG (Retrieval-Augmented Generation) answer is correct, in this case, we want to generate a numerical score to gain deeper insights into the model's performance.
151
+
152
+
#### Step 1: Building the Prompts
153
+
154
+
Before we can evaluate the model's output, we need to craft a system prompt and input template for the language model. The system prompt will instruct the LLM on how to evaluate the Q&A system’s response, and the input template will format the conversation history, agent’s response, and reference answer.
155
+
156
+
##### System Prompt
157
+
158
+
The system prompt provides the LLM with clear instructions on how to rate the agent's response. You’ll be guiding the LLM to give a score between 1 and 5 based on the correctness, accuracy, and factuality of the answer.
159
+
160
+
```python
161
+
SYSTEM_PROMPT="""Your task is to evaluate a Q/A system.
162
+
The user will give you a question, an expected answer and the system's response.
163
+
You will evaluate the system's response and provide a score.
164
+
We are asking ourselves if the response is correct, accurate and factual, based on the reference answer.
165
+
166
+
Guidelines:
167
+
1. Write a score that is an integer between 1 and 5. You should refer to the scores description.
168
+
2. Follow the JSON format provided below for your output.
169
+
170
+
Scores description:
171
+
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
172
+
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
173
+
Score 3: The response is somewhat correct, accurate, and/or factual.
174
+
Score 4: The response is mostly correct, accurate, and factual.
175
+
Score 5: The response is completely correct, accurate, and factual.
176
+
177
+
Output Format (JSON only):
178
+
{{
179
+
"correctness_score": (your rating, as a number between 1 and 5)
180
+
}}
181
+
182
+
Do not include any additional text—only the JSON object. Any extra content will result in a grade of 0.
183
+
"""
184
+
```
185
+
186
+
##### Input Template
187
+
188
+
The input template formats the actual content you will send to the model. This includes the conversation history, the agent's response, and the reference answer.
189
+
190
+
```python
191
+
INPUT_TEMPLATE="""
192
+
### CONVERSATION
193
+
{conversation}
194
+
195
+
### AGENT ANSWER
196
+
{answer}
197
+
198
+
### REFERENCE ANSWER
199
+
{reference_answer}
200
+
"""
201
+
```
202
+
203
+
#### Step 2: Subclassing the Metric Class
204
+
205
+
We implement the custom metric by subclassing the `Metric` class provided by `giskard.rag.metrics.base`.
Now all we need to do is to fill in this `__call__` method with our logic.
217
+
218
+
Metrics are meant to be used with the `evaluate` function. That is where we assume the inputs are coming from.
219
+
220
+
Note the input types:
221
+
-`question_sample` is a dictionary that must have the following keys: `conversation_history`, `question` and `reference_answer`
222
+
-`answer` is of type `AgentAnswer`, which is a dataclass representing the output of the RAG system being evaluated. It includes two attributes: `message`, which contains the generated response, and `documents`, which holds the retrieved documents.
223
+
224
+
The core of our new metric consists in a litellm call with the prompts specified earlier.
225
+
226
+
```python
227
+
from giskard.rag.metrics.base import Metric
228
+
from giskard.rag import AgentAnswer
229
+
from giskard.rag.question_generators.utils import parse_json_output
230
+
from giskard.rag.metrics.correctness import format_conversation
231
+
from giskard.llm.client import get_default_client
232
+
233
+
from llama_index.core.base.llms.types import ChatMessage
# Parse the output string representation of a JSON object into a dictionary
266
+
json_output = parse_json_output(
267
+
out.content,
268
+
llm_client=llm_client,
269
+
keys=["correctness_score"],
270
+
caller_id=self.__class__.__name__,
271
+
)
272
+
273
+
return json_output
274
+
```
275
+
276
+
Note how the `keys=["correctness_score"]` must match with the keys you asked from the llm in your `SYSTEM_PROMPT`.
277
+
278
+
As you can see, it is as simple as calling our llm and parsing its output as a dictionary with one key, `correctness_score`, containing the score given by the LLM.
279
+
280
+
<details>
281
+
<summary>Putting everything together, here is our final implementation:</summary>
282
+
283
+
```python
284
+
from giskard.llm.client import get_default_client
285
+
from giskard.llm.errors import LLMGenerationError
286
+
287
+
from giskard.rag import AgentAnswer
288
+
from giskard.rag.metrics.base import Metric
289
+
from giskard.rag.question_generators.utils import parse_json_output
290
+
from giskard.rag.metrics.correctness import format_conversation
291
+
292
+
from llama_index.core.base.llms.types import ChatMessage
293
+
294
+
SYSTEM_PROMPT="""Your task is to evaluate a Q/A system.
295
+
The user will give you a question, an expected answer and the system's response.
296
+
You will evaluate the system's response and provide a score.
297
+
We are asking ourselves if the response is correct, accurate and factual, based on the reference answer.
298
+
299
+
Guidelines:
300
+
1. Write a score that is an integer between 1 and 5. You should refer to the scores description.
301
+
2. Follow the JSON format provided below for your output.
302
+
303
+
Scores description:
304
+
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
305
+
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
306
+
Score 3: The response is somewhat correct, accurate, and/or factual.
307
+
Score 4: The response is mostly correct, accurate, and factual.
308
+
Score 5: The response is completely correct, accurate, and factual.
309
+
310
+
Output Format (JSON only):
311
+
{{
312
+
"correctness_score": (your rating, as a number between 1 and 5)
313
+
}}
314
+
315
+
Do not include any additional text—only the JSON object. Any extra content will result in a grade of 0.
# We asked the LLM to output a JSON object, so we must parse the output into a dict
378
+
json_output = parse_json_output(
379
+
out.content,
380
+
llm_client=llm_client,
381
+
keys=["correctness_score"],
382
+
caller_id=self.__class__.__name__,
383
+
)
384
+
385
+
return json_output
386
+
387
+
exceptExceptionas err:
388
+
raise LLMGenerationError("Error while evaluating the agent") from err
389
+
```
390
+
</details>
391
+
392
+
#### Step 3: Use Your New Metric in the Evaluation Function
393
+
394
+
Integrating our new custom metric into the evaluation process is straightforward. Simply instantiate the metric and pass it to the `evaluate` function.
0 commit comments