Giskard-AI · mattbit · May 28, 2024 · May 10, 2024 · May 10, 2024 · May 10, 2024
diff --git a/docs/open_source/testset_generation/rag_evaluation/index.md b/docs/open_source/testset_generation/rag_evaluation/index.md
@@ -1,6 +1,6 @@
 # 🥇 RAGET Evaluation
 
-After automatically generating a test set for your RAG agent using RAGET, you can then evaluate the **correctness 
+After automatically generating a test set for your RAG agent using RAGET, you can then evaluate the **correctness
 of the agent's answers** compared to the reference answers (using a LLM-as-a-judge approach). The main purpose
 of this evaluation is to help you **identify the weakest components in your RAG agent**.
 
@@ -50,25 +50,24 @@ report.to_html("rag_eval_report.html")
 This report is what you'll obtain:
 ![image](../../../_static/raget.png)
 
-
 ### RAG Components Scores
 
-RAGET computes scores for each component of the RAG agent. The scores are computed by aggregating the correctness 
-of the agent's answers on different question types (see question type to component mapping [here](q_types)). 
-Each score grades a component on a scale from 0 to 100, 100 being a perfect score. **Low scores can help you identify 
+RAGET computes scores for each component of the RAG agent. The scores are computed by aggregating the correctness
+of the agent's answers on different question types (see question type to component mapping [here](q_types)).
+Each score grades a component on a scale from 0 to 100, 100 being a perfect score. **Low scores can help you identify
 weaknesses of your RAG agent and which components need improvement.**
 
 Here is the list of components evaluated with RAGET:
+
 - **`Generator`**: the LLM used inside the RAG to generate the answers
 - **`Retriever`**: fetch relevant documents from the knowledge base according to a user query
 - **`Rewriter`** (optional): rewrite the user query to make it more relevant to the knowledge base or to account for chat history
 - **`Router`** (optional): filter the query of the user based on his intentions (intentions detection)
 - **`Knowledge Base`**: the set of documents given to the RAG to generate the answers
 
-
 ### Analyze Correctness and Failures
 
-You can access the correctness of the agent aggregated in various ways or analyze only it failures: 
+You can access the correctness of the agent aggregated in various ways or analyze only it failures:
 
 ```python
 # Correctness on each topic of the Knowledge Base
@@ -92,19 +91,45 @@ results = report.to_pandas()
 
 ### RAGAS Metrics
 
-**You can pass additional evaluation metrics to the `evaluate` function**. They will be computed during the evaluation. 
+**You can pass additional evaluation metrics to the `evaluate` function**. They will be computed during the evaluation.
 We currently provide [RAGAS metrics](https://docs.ragas.io/en/latest/concepts/metrics/index.html) as additional metrics.
 
-The results of your metrics will be displayed in the report object as histograms and will be available inside the report main `DataFrame`. 
+The results of your metrics will be displayed in the report object as histograms and will be available inside the report main `DataFrame`.
 ![image](../../../_static/ragas_metrics.png)
 
-To include RAGAS metrics in evaluation, make sure to have installed the `ragas>=0.1.5` library, then use the following code:
+To include RAGAS metrics in evaluation, make sure to have installed the `ragas>=0.1.5` library. Some of the RAGAS metrics need access to the contexts retrieved by the RAG agent for each question. These can be returned by the `get_answer_fn` function along with the answer to the question:
+
+```python
+from giskard.rag import AgentAnwer
+
+def get_answer_fn(question: str, history=None) -> str:
+    """A function representing your RAG agent."""
+    # Format appropriately the history for your RAG agent
+    messages = history if history else []
+    messages.append({"role": "user", "content": question})
+
+    # Get the answer and the documents
+    agent_output = get_answer_from_agent(messages)
+
+    # Following llama_index syntax, you can get the answer and the retrieved documents
+    answer = agent_output.text
+    documents = agent_output.source_nodes
+
+    # Instead of returning a simple string, we return the AgentAnswer object which
+    # allows us to specify the retrieved context which is used by RAGAS metrics
+    return AgentAnswer(
+        message=answer,
+        documents=documents
+    )
+```
+
+Then, you can include the RAGAS metrics in the evaluation:
 
 ```python
 from giskard.rag.metrics.ragas_metrics import ragas_context_recall, ragas_faithfulness
 
 report = evaluate(
-    answer_fn,
+    get_answer_fn,
     testset=testset,
     knowledge_base=knowledge_base,
     metrics=[ragas_context_recall, ragas_faithfulness]
@@ -114,14 +139,15 @@ report = evaluate(
 Built-in metrics include `ragas_context_precision`, `ragas_faithfulness`, `ragas_answer_relevancy`,
 `ragas_context_recall`. Note that including these metrics can significantly increase the evaluation time and LLM usage.
 
+Alternatively, you can directly pass a list of answers instead of `get_answer_fn` to the `evaluate` function, you can then pass the retrieved documents as an optional argument `retrieved_documents` to compute the RAGAS metrics.
 
+## Going Further: Giskard's Visual Interface
 
-## Going Further: Giskard's Visual Interface 
-
-The tests generated by RAGET integrate directly with the **Giskard Hub** to allow for collaboration on the curation, 
+The tests generated by RAGET integrate directly with the **Giskard Hub** to allow for collaboration on the curation,
 review and execution of tests.
 
 ### Step 1: Convert the test set into a test suite
+
 Let's convert our test set into an actionable test suite ({class}`giskard.Suite`) that we can save and reuse in further iterations.
 
 ```python
@@ -130,7 +156,7 @@ test_suite = testset.to_test_suite("My first test suite")
 test_suite.run(model=giskard_model)
 ```
 
-Note that you can split the test suite on the question metadata values, for instance on each question type. 
+Note that you can split the test suite on the question metadata values, for instance on each question type.
 
 ```python
 test_suite_by_question_types = testset.to_test_suite("Split test suite", slicing_metadata=["question_type"])
@@ -141,6 +167,7 @@ and [test integration](https://docs.giskard.ai/en/stable/open_source/integrate_t
 everything you can do with test suites.
 
 ### Step 2: Wrap your model
+
 Before evaluating your model with a test suite, you must wrap it as a `giskard.Model`. This step is necessary to ensure a common format for your model and its metadata. You can wrap anything as long as you can represent it in a Python function (for example an API call to Azure, OpenAI, Mistral, Ollama etc...). We also have pre-built wrappers for LangChain objects, or you can create your own wrapper by extending the `giskard.Model` class if you need to wrap a complex object such as a custom-made RAG communicating with a vectorstore.
 
 To do so, you can follow the instructions from
@@ -158,13 +185,15 @@ test_suite.run(model=giskard_model)
 ### Step 3: upload your test suite to the Giskard Hub
 
 Uploading a test suite to the hub allows you to:
-* Compare the quality of different models and prompts to decide which one to promote
-* Create more tests relevant to your use case, combining input prompts that make your model fail and custome evaluation criteria
-* Share results, and collaborate with your team to integrate business feedback
+
+- Compare the quality of different models and prompts to decide which one to promote
+- Create more tests relevant to your use case, combining input prompts that make your model fail and custome evaluation criteria
+- Share results, and collaborate with your team to integrate business feedback
 
 To upload your test suite, you must have created a project on Giskard Hub and instantiated a Giskard Python client.
 
 Then, upload your test suite and model like this:
+
 ```python
 test_suite.upload(giskard_client, project_id)  # project_id should be the id of the Giskard project in which you want to upload your suite
 giskard_model.upload(giskard_client, project_id)
@@ -179,7 +208,6 @@ giskard_model.upload(giskard_client, project_id)
 
 [Here's a demo](https://huggingface.co/spaces/giskardai/giskard) of the Giskard Hub in action.
 
-
-
 ## Troubleshooting
+
 If you encounter any issues, join our [Discord community](https://discord.gg/fkv7CAr3FE) and ask questions in our #support channel.
diff --git a/giskard/rag/__init__.py b/giskard/rag/__init__.py
@@ -1,3 +1,4 @@
+from .base import AgentAnswer
 from .evaluate import evaluate
 from .knowledge_base import KnowledgeBase
 from .report import RAGReport
@@ -11,4 +12,5 @@
     "KnowledgeBase",
     "evaluate",
     "RAGReport",
+    "AgentAnswer",
 ]
diff --git a/giskard/rag/base.py b/giskard/rag/base.py
@@ -0,0 +1,9 @@
+from typing import Optional, Sequence
+
+from dataclasses import dataclass
+
+
+@dataclass
+class AgentAnswer:
+    message: str
+    documents: Optional[Sequence[str]] = None
diff --git a/giskard/rag/evaluate.py b/giskard/rag/evaluate.py
@@ -6,8 +6,9 @@
 
 from ..llm.client import LLMClient, get_default_client
 from ..utils.analytics_collector import analytics
+from .base import AgentAnswer
 from .knowledge_base import KnowledgeBase
-from .metrics import CorrectnessMetric, Metric
+from .metrics import CorrectnessMetric
 from .question_generators.utils import maybe_tqdm
 from .recommendation import get_rag_recommendation
 from .report import RAGReport
@@ -20,7 +21,7 @@
 
 
 def evaluate(
-    answer_fn: Union[Callable, Sequence[str]],
+    answer_fn: Union[Callable, Sequence[Union[AgentAnswer, str]]],
     testset: Optional[QATestset] = None,
     knowledge_base: Optional[KnowledgeBase] = None,
     llm_client: Optional[LLMClient] = None,
@@ -31,7 +32,7 @@ def evaluate(
 
     Parameters
     ----------
-    answers_fn : Union[Callable, Sequence[str]]
+    answers_fn : Union[Callable, Sequence[Union[AgentAnswer,str]]]
         The prediction function of the agent to evaluate or a list of precalculated answers on the testset.
     testset : QATestset, optional
         The test set to evaluate the agent on. If not provided, a knowledge base must be provided and a default testset will be created from the knowledge base.
@@ -72,7 +73,11 @@ def evaluate(
     if testset is None:
         testset = generate_testset(knowledge_base)
 
-    answers = answer_fn if isinstance(answer_fn, Sequence) else _compute_answers(answer_fn, testset)
+    model_outputs = (
+        [_cast_to_agent_answer(ans) for ans in answer_fn]
+        if isinstance(answer_fn, Sequence)
+        else _compute_answers(answer_fn, testset)
+    )
 
     llm_client = llm_client or get_default_client()
 
@@ -87,18 +92,19 @@ def evaluate(
     metrics_results = defaultdict(dict)
 
     for metric in metrics:
-        metric_name = getattr(
-            metric, "name", metric.__class__.__name__ if isinstance(metric, Metric) else metric.__name__
-        )
+        try:
+            metric_name = metric.__class__.__name__
+        except AttributeError:
+            metric_name = metric.__name__
 
         for sample, answer in maybe_tqdm(
-            zip(testset.to_pandas().to_records(index=True), answers),
+            zip(testset.to_pandas().to_records(index=True), model_outputs),
             desc=f"{metric_name} evaluation",
-            total=len(answers),
+            total=len(model_outputs),
         ):
             metrics_results[sample["id"]].update(metric(sample, answer))
 
-    report = RAGReport(testset, answers, metrics_results, knowledge_base)
+    report = RAGReport(testset, model_outputs, metrics_results, knowledge_base)
     recommendation = get_rag_recommendation(
         report.topics,
         report.correctness_by_question_type().to_dict()[metrics[0].name],
@@ -121,7 +127,7 @@ def evaluate(
 
 
 def _compute_answers(answer_fn, testset):
-    answers = []
+    model_outputs = []
     needs_history = (
         len(signature(answer_fn).parameters) > 1 and ANSWER_FN_HISTORY_PARAM in signature(answer_fn).parameters
     )
@@ -132,5 +138,17 @@ def _compute_answers(answer_fn, testset):
         if needs_history:
             kwargs[ANSWER_FN_HISTORY_PARAM] = sample.conversation_history
 
-        answers.append(answer_fn(sample.question, **kwargs))
-    return answers
+        answer = answer_fn(sample.question, **kwargs)
+        model_outputs.append(_cast_to_agent_answer(answer))
+
+    return model_outputs
+
+
+def _cast_to_agent_answer(answer) -> AgentAnswer:
+    if isinstance(answer, AgentAnswer):
+        return answer
+
+    if isinstance(answer, str):
+        return AgentAnswer(message=answer)
+
+    raise ValueError(f"The answer function must return a string or an AgentAnswer object. Got {type(answer)} instead.")
diff --git a/giskard/rag/metrics/__init__.py b/giskard/rag/metrics/__init__.py
@@ -1,4 +1,5 @@
+from ..base import AgentAnswer
 from .base import Metric
 from .correctness import CorrectnessMetric, correctness_metric
 
-__all__ = ["Metric", "correctness_metric", "CorrectnessMetric"]
+__all__ = ["Metric", "correctness_metric", "CorrectnessMetric", "AgentAnswer"]
diff --git a/giskard/rag/metrics/base.py b/giskard/rag/metrics/base.py
@@ -1,5 +1,7 @@
 from abc import ABC, abstractmethod
 
+from giskard.rag.base import AgentAnswer
+
 from ...llm.client.base import LLMClient
 
 
@@ -14,20 +16,20 @@ def __init__(self, name: str, llm_client: LLMClient = None) -> None:
         self._llm_client = llm_client
 
     @abstractmethod
-    def __call__(self, question_sample: dict, answer: str):
+    def __call__(self, question_sample: dict, answer: AgentAnswer):
         """
         Compute the metric on a single question and its associated answer.
 
         Parameters
         ----------
         question_sample : dict
             A question sample from a QATestset.
-        answer : Sequence[str]
+        answer : AgentAnswer
             The agent answer on that question.
 
         Returns
         -------
         dict
-            The result of the metric. The keys should be the names of the metrics computed.
+            The result of the metric computation. The keys should be the names of the metrics computed.
         """
         pass
diff --git a/giskard/rag/metrics/correctness.py b/giskard/rag/metrics/correctness.py
@@ -2,6 +2,7 @@
 
 from ...llm.client import ChatMessage, LLMClient, get_default_client
 from ...llm.errors import LLMGenerationError
+from ..base import AgentAnswer
 from ..question_generators.utils import parse_json_output
 from .base import Metric
 
@@ -55,7 +56,22 @@ def __init__(self, name: str, llm_client: LLMClient = None, agent_description: O
         self._llm_client = llm_client
         self.agent_description = agent_description or "This agent is a chatbot that answers question from users."
 
-    def __call__(self, question_sample: dict, answer: str) -> dict:
+    def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict:
+        """
+        Compute the correctness between the agent answer and the reference answer from QATestset.
+
+        Parameters
+        ----------
+        question_sample : dict
+            A question sample from a QATestset.
+        answer : ModelOutput
+            The answer of the agent on the question.
+
+        Returns
+        -------
+        dict
+            The result of the correctness evaluation. It contains the keys 'correctness' and 'correctness_reason'.
+        """
         llm_client = self._llm_client or get_default_client()
         try:
             out = llm_client.complete(
@@ -72,7 +88,7 @@ def __call__(self, question_sample: dict, answer: str) -> dict:
                         role="user",
                         content=CORRECTNESS_INPUT_TEMPLATE.format(
                             question=question_sample.question,
-                            agent_answer=answer,
+                            agent_answer=answer.message,
                             ground_truth=question_sample.reference_answer,
                         ),
                     ),