Skip to content

Commit c13fbce

Browse files
Merge pull request #2150 from GTimothee/custom_metrics_example
chore(docs): update how to create custom metrics
2 parents dd7ab67 + 9c1386b commit c13fbce

File tree

1 file changed

+276
-0
lines changed
  • docs/open_source/testset_generation/rag_evaluation

1 file changed

+276
-0
lines changed

docs/open_source/testset_generation/rag_evaluation/index.md

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,282 @@ Built-in metrics include `ragas_context_precision`, `ragas_faithfulness`, `ragas
141141

142142
Alternatively, you can directly pass a list of answers instead of `get_answer_fn` to the `evaluate` function, you can then pass the retrieved documents as an optional argument `retrieved_documents` to compute the RAGAS metrics.
143143

144+
### Creating Custom Metrics
145+
146+
**You can easily create your own metrics by extending the base Metric class from Giskard.**
147+
148+
To illustrate how simple it is to build a custom metric, we’ll implement the `CorrectnessScoreMetric` — a custom metric that evaluates the performance of a language model as a judge.
149+
150+
While `Giskard` provides a default boolean correctness metric to check if a RAG (Retrieval-Augmented Generation) answer is correct, in this case, we want to generate a numerical score to gain deeper insights into the model's performance.
151+
152+
#### Step 1: Building the Prompts
153+
154+
Before we can evaluate the model's output, we need to craft a system prompt and input template for the language model. The system prompt will instruct the LLM on how to evaluate the Q&A system’s response, and the input template will format the conversation history, agent’s response, and reference answer.
155+
156+
##### System Prompt
157+
158+
The system prompt provides the LLM with clear instructions on how to rate the agent's response. You’ll be guiding the LLM to give a score between 1 and 5 based on the correctness, accuracy, and factuality of the answer.
159+
160+
```python
161+
SYSTEM_PROMPT = """Your task is to evaluate a Q/A system.
162+
The user will give you a question, an expected answer and the system's response.
163+
You will evaluate the system's response and provide a score.
164+
We are asking ourselves if the response is correct, accurate and factual, based on the reference answer.
165+
166+
Guidelines:
167+
1. Write a score that is an integer between 1 and 5. You should refer to the scores description.
168+
2. Follow the JSON format provided below for your output.
169+
170+
Scores description:
171+
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
172+
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
173+
Score 3: The response is somewhat correct, accurate, and/or factual.
174+
Score 4: The response is mostly correct, accurate, and factual.
175+
Score 5: The response is completely correct, accurate, and factual.
176+
177+
Output Format (JSON only):
178+
{{
179+
"correctness_score": (your rating, as a number between 1 and 5)
180+
}}
181+
182+
Do not include any additional text—only the JSON object. Any extra content will result in a grade of 0.
183+
"""
184+
```
185+
186+
##### Input Template
187+
188+
The input template formats the actual content you will send to the model. This includes the conversation history, the agent's response, and the reference answer.
189+
190+
```python
191+
INPUT_TEMPLATE = """
192+
### CONVERSATION
193+
{conversation}
194+
195+
### AGENT ANSWER
196+
{answer}
197+
198+
### REFERENCE ANSWER
199+
{reference_answer}
200+
"""
201+
```
202+
203+
#### Step 2: Subclassing the Metric Class
204+
205+
We implement the custom metric by subclassing the `Metric` class provided by `giskard.rag.metrics.base`.
206+
207+
```python
208+
from giskard.rag.metrics.base import Metric
209+
from giskard.rag import AgentAnswer
210+
211+
class CorrectnessScoreMetric(Metric):
212+
def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict:
213+
pass
214+
```
215+
216+
Now all we need to do is to fill in this `__call__` method with our logic.
217+
218+
Metrics are meant to be used with the `evaluate` function. That is where we assume the inputs are coming from.
219+
220+
Note the input types:
221+
- `question_sample` is a dictionary that must have the following keys: `conversation_history`, `question` and `reference_answer`
222+
- `answer` is of type `AgentAnswer`, which is a dataclass representing the output of the RAG system being evaluated. It includes two attributes: `message`, which contains the generated response, and `documents`, which holds the retrieved documents.
223+
224+
The core of our new metric consists in a litellm call with the prompts specified earlier.
225+
226+
```python
227+
from giskard.rag.metrics.base import Metric
228+
from giskard.rag import AgentAnswer
229+
from giskard.rag.question_generators.utils import parse_json_output
230+
from giskard.rag.metrics.correctness import format_conversation
231+
from giskard.llm.client import get_default_client
232+
233+
from llama_index.core.base.llms.types import ChatMessage
234+
235+
class CorrectnessScoreMetric(Metric):
236+
237+
def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict:
238+
239+
# Retrieve the llm client
240+
llm_client = self._llm_client or get_default_client()
241+
242+
# Call the llm
243+
out = llm_client.complete(
244+
messages=[
245+
ChatMessage(
246+
role="system",
247+
content=SYSTEM_PROMPT,
248+
),
249+
ChatMessage(
250+
role="user",
251+
content=INPUT_TEMPLATE.format(
252+
conversation=format_conversation(
253+
question_sample.conversation_history
254+
+ [{"role": "user", "content": question_sample.question}]
255+
),
256+
answer=answer.message,
257+
reference_answer=question_sample.reference_answer,
258+
),
259+
),
260+
],
261+
temperature=0,
262+
format="json_object",
263+
)
264+
265+
# Parse the output string representation of a JSON object into a dictionary
266+
json_output = parse_json_output(
267+
out.content,
268+
llm_client=llm_client,
269+
keys=["correctness_score"],
270+
caller_id=self.__class__.__name__,
271+
)
272+
273+
return json_output
274+
```
275+
276+
Note how the `keys=["correctness_score"]` must match with the keys you asked from the llm in your `SYSTEM_PROMPT`.
277+
278+
As you can see, it is as simple as calling our llm and parsing its output as a dictionary with one key, `correctness_score`, containing the score given by the LLM.
279+
280+
<details>
281+
<summary>Putting everything together, here is our final implementation:</summary>
282+
283+
```python
284+
from giskard.llm.client import get_default_client
285+
from giskard.llm.errors import LLMGenerationError
286+
287+
from giskard.rag import AgentAnswer
288+
from giskard.rag.metrics.base import Metric
289+
from giskard.rag.question_generators.utils import parse_json_output
290+
from giskard.rag.metrics.correctness import format_conversation
291+
292+
from llama_index.core.base.llms.types import ChatMessage
293+
294+
SYSTEM_PROMPT = """Your task is to evaluate a Q/A system.
295+
The user will give you a question, an expected answer and the system's response.
296+
You will evaluate the system's response and provide a score.
297+
We are asking ourselves if the response is correct, accurate and factual, based on the reference answer.
298+
299+
Guidelines:
300+
1. Write a score that is an integer between 1 and 5. You should refer to the scores description.
301+
2. Follow the JSON format provided below for your output.
302+
303+
Scores description:
304+
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
305+
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
306+
Score 3: The response is somewhat correct, accurate, and/or factual.
307+
Score 4: The response is mostly correct, accurate, and factual.
308+
Score 5: The response is completely correct, accurate, and factual.
309+
310+
Output Format (JSON only):
311+
{{
312+
"correctness_score": (your rating, as a number between 1 and 5)
313+
}}
314+
315+
Do not include any additional text—only the JSON object. Any extra content will result in a grade of 0.
316+
"""
317+
318+
INPUT_TEMPLATE = """
319+
### CONVERSATION
320+
{conversation}
321+
322+
### AGENT ANSWER
323+
{answer}
324+
325+
### REFERENCE ANSWER
326+
{reference_answer}
327+
"""
328+
329+
class CorrectnessScoreMetric(Metric):
330+
def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict:
331+
"""
332+
Compute the correctness *as a number from 1 to 5* between the agent answer and the reference answer.
333+
334+
Parameters
335+
----------
336+
question_sample : dict
337+
A question sample from a QATestset.
338+
answer : AgentAnswer
339+
The answer of the agent on the question.
340+
341+
Returns
342+
-------
343+
dict
344+
The result of the correctness scoring. It contains the key 'correctness_score'.
345+
346+
Raises
347+
------
348+
LLMGenerationError
349+
If there is an issue during the LLM evaluation process.
350+
"""
351+
352+
# Implement your LLM call with litellm
353+
llm_client = self._llm_client or get_default_client()
354+
try:
355+
out = llm_client.complete(
356+
messages=[
357+
ChatMessage(
358+
role="system",
359+
content=SYSTEM_PROMPT,
360+
),
361+
ChatMessage(
362+
role="user",
363+
content=INPUT_TEMPLATE.format(
364+
conversation=format_conversation(
365+
question_sample.conversation_history
366+
+ [{"role": "user", "content": question_sample.question}]
367+
),
368+
answer=answer.message,
369+
reference_answer=question_sample.reference_answer,
370+
),
371+
),
372+
],
373+
temperature=0,
374+
format="json_object",
375+
)
376+
377+
# We asked the LLM to output a JSON object, so we must parse the output into a dict
378+
json_output = parse_json_output(
379+
out.content,
380+
llm_client=llm_client,
381+
keys=["correctness_score"],
382+
caller_id=self.__class__.__name__,
383+
)
384+
385+
return json_output
386+
387+
except Exception as err:
388+
raise LLMGenerationError("Error while evaluating the agent") from err
389+
```
390+
</details>
391+
392+
#### Step 3: Use Your New Metric in the Evaluation Function
393+
394+
Integrating our new custom metric into the evaluation process is straightforward. Simply instantiate the metric and pass it to the `evaluate` function.
395+
396+
```python
397+
# Instantiate the custom correctness metric
398+
correctness_score = CorrectnessScoreMetric(name="correctness_score")
399+
400+
# Run the evaluation with the custom metric
401+
report = evaluate(
402+
answer_fn,
403+
testset=testset, # a QATestset instance
404+
knowledge_base=knowledge_base, # the knowledge base used for building the QATestset
405+
metrics=[correctness_score]
406+
)
407+
```
408+
409+
Again, be careful that the `name` argument in the `CorrectnessScoreMetric` match the key you return in the output dictionary.
410+
411+
#### Wrap-up
412+
413+
In this tutorial, we covered the steps to create and use a custom evaluation metric.
414+
415+
Here's a quick summary of the key points:
416+
1. *Define the prompts*: Create a system prompt and input template for the LLM to evaluate answers.
417+
2. *Implement the new metric*: Subclass the `Metric` class and implement the `__call__` method to create the custom `CorrectnessScoreMetric`.
418+
3. *Use the new metric*: Instantiate and integrate the custom metric into the evaluation function.
419+
144420
## Troubleshooting
145421

146422
If you encounter any issues, join our [Discord community](https://discord.gg/fkv7CAr3FE) and ask questions in our #support channel.

0 commit comments

Comments
 (0)