Merge pull request #2150 from GTimothee/custom_metrics_example

davidberenstein1957 · web-flow · commit c13fbce1464b · 2025-04-14T09:32:24.000+02:00
chore(docs): update how to create custom metrics
diff --git a/docs/open_source/testset_generation/rag_evaluation/index.md b/docs/open_source/testset_generation/rag_evaluation/index.md
@@ -141,6 +141,282 @@ Built-in metrics include `ragas_context_precision`, `ragas_faithfulness`, `ragas
 
 Alternatively, you can directly pass a list of answers instead of `get_answer_fn` to the `evaluate` function, you can then pass the retrieved documents as an optional argument `retrieved_documents` to compute the RAGAS metrics.
 
+### Creating Custom Metrics
+
+**You can easily create your own metrics by extending the base Metric class from Giskard.** 
+
+To illustrate how simple it is to build a custom metric, we’ll implement the `CorrectnessScoreMetric` — a custom metric that evaluates the performance of a language model as a judge. 
+
+While `Giskard` provides a default boolean correctness metric to check if a RAG (Retrieval-Augmented Generation) answer is correct, in this case, we want to generate a numerical score to gain deeper insights into the model's performance.
+
+#### Step 1: Building the Prompts
+
+Before we can evaluate the model's output, we need to craft a system prompt and input template for the language model. The system prompt will instruct the LLM on how to evaluate the Q&A system’s response, and the input template will format the conversation history, agent’s response, and reference answer.
+
+##### System Prompt
+
+The system prompt provides the LLM with clear instructions on how to rate the agent's response. You’ll be guiding the LLM to give a score between 1 and 5 based on the correctness, accuracy, and factuality of the answer.
+
+```python
+SYSTEM_PROMPT = """Your task is to evaluate a Q/A system. 
+The user will give you a question, an expected answer and the system's response.
+You will evaluate the system's response and provide a score.
+We are asking ourselves if the response is correct, accurate and factual, based on the reference answer.
+
+Guidelines:
+1. Write a score that is an integer between 1 and 5. You should refer to the scores description.
+2. Follow the JSON format provided below for your output.
+
+Scores description:
+    Score 1: The response is completely incorrect, inaccurate, and/or not factual.
+    Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
+    Score 3: The response is somewhat correct, accurate, and/or factual.
+    Score 4: The response is mostly correct, accurate, and factual.
+    Score 5: The response is completely correct, accurate, and factual.
+
+Output Format (JSON only):
+{{
+    "correctness_score": (your rating, as a number between 1 and 5)
+}}
+
+Do not include any additional text—only the JSON object. Any extra content will result in a grade of 0.
+"""
+```
+
+##### Input Template
+
+The input template formats the actual content you will send to the model. This includes the conversation history, the agent's response, and the reference answer.
+
+```python
+INPUT_TEMPLATE = """
+### CONVERSATION
+{conversation}
+
+### AGENT ANSWER
+{answer}
+
+### REFERENCE ANSWER
+{reference_answer}
+"""
+```
+
+#### Step 2: Subclassing the Metric Class
+
+We implement the custom metric by subclassing the `Metric` class provided by `giskard.rag.metrics.base`.
+
+```python
+from giskard.rag.metrics.base import Metric
+from giskard.rag import AgentAnswer
+
+class CorrectnessScoreMetric(Metric):
+    def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict:
+        pass
+```
+
+Now all we need to do is to fill in this `__call__` method with our logic.
+
+Metrics are meant to be used with the `evaluate` function. That is where we assume the inputs are coming from. 
+
+Note the input types: 
+- `question_sample` is a dictionary that must have the following keys: `conversation_history`, `question` and `reference_answer`
+- `answer` is of type `AgentAnswer`, which is a dataclass representing the output of the RAG system being evaluated. It includes two attributes: `message`, which contains the generated response, and `documents`, which holds the retrieved documents. 
+
+The core of our new metric consists in a litellm call with the prompts specified earlier. 
+
+```python
+from giskard.rag.metrics.base import Metric
+from giskard.rag import AgentAnswer
+from giskard.rag.question_generators.utils import parse_json_output
+from giskard.rag.metrics.correctness import format_conversation
+from giskard.llm.client import get_default_client
+
+from llama_index.core.base.llms.types import ChatMessage
+
+class CorrectnessScoreMetric(Metric):
+    
+    def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict:
+
+        # Retrieve the llm client
+        llm_client = self._llm_client or get_default_client()
+
+        # Call the llm
+        out = llm_client.complete(
+            messages=[
+                ChatMessage(
+                    role="system",
+                    content=SYSTEM_PROMPT,
+                ),
+                ChatMessage(
+                    role="user",
+                    content=INPUT_TEMPLATE.format(
+                        conversation=format_conversation(
+                            question_sample.conversation_history
+                            + [{"role": "user", "content": question_sample.question}]
+                        ),
+                        answer=answer.message,
+                        reference_answer=question_sample.reference_answer,
+                    ),
+                ),
+            ],
+            temperature=0,
+            format="json_object",
+        )
+
+        # Parse the output string representation of a JSON object into a dictionary
+        json_output = parse_json_output(
+            out.content,
+            llm_client=llm_client,
+            keys=["correctness_score"],
+            caller_id=self.__class__.__name__,
+        )
+
+        return json_output
+```
+
+Note how the `keys=["correctness_score"]` must match with the keys you asked from the llm in your `SYSTEM_PROMPT`.
+
+As you can see, it is as simple as calling our llm and parsing its output as a dictionary with one key, `correctness_score`, containing the score given by the LLM. 
+
+<details>
+  <summary>Putting everything together, here is our final implementation:</summary>
+
+```python
+from giskard.llm.client import get_default_client
+from giskard.llm.errors import LLMGenerationError
+
+from giskard.rag import AgentAnswer
+from giskard.rag.metrics.base import Metric
+from giskard.rag.question_generators.utils import parse_json_output
+from giskard.rag.metrics.correctness import format_conversation
+
+from llama_index.core.base.llms.types import ChatMessage
+
+SYSTEM_PROMPT = """Your task is to evaluate a Q/A system. 
+The user will give you a question, an expected answer and the system's response.
+You will evaluate the system's response and provide a score.
+We are asking ourselves if the response is correct, accurate and factual, based on the reference answer.
+
+Guidelines:
+1. Write a score that is an integer between 1 and 5. You should refer to the scores description.
+2. Follow the JSON format provided below for your output.
+
+Scores description:
+    Score 1: The response is completely incorrect, inaccurate, and/or not factual.
+    Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
+    Score 3: The response is somewhat correct, accurate, and/or factual.
+    Score 4: The response is mostly correct, accurate, and factual.
+    Score 5: The response is completely correct, accurate, and factual.
+
+Output Format (JSON only):
+{{
+    "correctness_score": (your rating, as a number between 1 and 5)
+}}
+
+Do not include any additional text—only the JSON object. Any extra content will result in a grade of 0.
+"""
+
+INPUT_TEMPLATE = """
+### CONVERSATION
+{conversation}
+
+### AGENT ANSWER
+{answer}
+
+### REFERENCE ANSWER
+{reference_answer}
+"""
+
+class CorrectnessScoreMetric(Metric):
+    def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict:
+        """
+        Compute the correctness *as a number from 1 to 5* between the agent answer and the reference answer.
+
+        Parameters
+        ----------
+        question_sample : dict
+            A question sample from a QATestset.
+        answer : AgentAnswer
+            The answer of the agent on the question.
+
+        Returns
+        -------
+        dict
+            The result of the correctness scoring. It contains the key 'correctness_score'.
+            
+        Raises
+        ------
+        LLMGenerationError
+            If there is an issue during the LLM evaluation process.
+        """
+
+        # Implement your LLM call with litellm
+        llm_client = self._llm_client or get_default_client()
+        try:
+            out = llm_client.complete(
+                messages=[
+                    ChatMessage(
+                        role="system",
+                        content=SYSTEM_PROMPT,
+                    ),
+                    ChatMessage(
+                        role="user",
+                        content=INPUT_TEMPLATE.format(
+                            conversation=format_conversation(
+                                question_sample.conversation_history
+                                + [{"role": "user", "content": question_sample.question}]
+                            ),
+                            answer=answer.message,
+                            reference_answer=question_sample.reference_answer,
+                        ),
+                    ),
+                ],
+                temperature=0,
+                format="json_object",
+            )
+            
+            # We asked the LLM to output a JSON object, so we must parse the output into a dict
+            json_output = parse_json_output(
+                out.content,
+                llm_client=llm_client,
+                keys=["correctness_score"],
+                caller_id=self.__class__.__name__,
+            )
+
+            return json_output
+
+        except Exception as err:
+            raise LLMGenerationError("Error while evaluating the agent") from err
+```
+</details> 
+
+#### Step 3: Use Your New Metric in the Evaluation Function
+
+Integrating our new custom metric into the evaluation process is straightforward. Simply instantiate the metric and pass it to the `evaluate` function.
+
+```python
+# Instantiate the custom correctness metric
+correctness_score = CorrectnessScoreMetric(name="correctness_score")
+
+# Run the evaluation with the custom metric
+report = evaluate(
+    answer_fn,
+    testset=testset, # a QATestset instance
+    knowledge_base=knowledge_base, # the knowledge base used for building the QATestset 
+    metrics=[correctness_score]
+)
+``` 
+
+Again, be careful that the `name` argument in the `CorrectnessScoreMetric` match the key you return in the output dictionary.
+
+#### Wrap-up
+
+In this tutorial, we covered the steps to create and use a custom evaluation metric. 
+
+Here's a quick summary of the key points:
+1. *Define the prompts*: Create a system prompt and input template for the LLM to evaluate answers.
+2. *Implement the new metric*: Subclass the `Metric` class and implement the `__call__` method to create the custom `CorrectnessScoreMetric`.
+3. *Use the new metric*: Instantiate and integrate the custom metric into the evaluation function.
+
 ## Troubleshooting
 
 If you encounter any issues, join our [Discord community](https://discord.gg/fkv7CAr3FE) and ask questions in our #support channel.