Skip to content
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 48 additions & 20 deletions docs/open_source/testset_generation/rag_evaluation/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 🥇 RAGET Evaluation

After automatically generating a test set for your RAG agent using RAGET, you can then evaluate the **correctness
After automatically generating a test set for your RAG agent using RAGET, you can then evaluate the **correctness
of the agent's answers** compared to the reference answers (using a LLM-as-a-judge approach). The main purpose
of this evaluation is to help you **identify the weakest components in your RAG agent**.

Expand Down Expand Up @@ -50,25 +50,24 @@ report.to_html("rag_eval_report.html")
This report is what you'll obtain:
![image](../../../_static/raget.png)


### RAG Components Scores

RAGET computes scores for each component of the RAG agent. The scores are computed by aggregating the correctness
of the agent's answers on different question types (see question type to component mapping [here](q_types)).
Each score grades a component on a scale from 0 to 100, 100 being a perfect score. **Low scores can help you identify
RAGET computes scores for each component of the RAG agent. The scores are computed by aggregating the correctness
of the agent's answers on different question types (see question type to component mapping [here](q_types)).
Each score grades a component on a scale from 0 to 100, 100 being a perfect score. **Low scores can help you identify
weaknesses of your RAG agent and which components need improvement.**

Here is the list of components evaluated with RAGET:

- **`Generator`**: the LLM used inside the RAG to generate the answers
- **`Retriever`**: fetch relevant documents from the knowledge base according to a user query
- **`Rewriter`** (optional): rewrite the user query to make it more relevant to the knowledge base or to account for chat history
- **`Router`** (optional): filter the query of the user based on his intentions (intentions detection)
- **`Knowledge Base`**: the set of documents given to the RAG to generate the answers


### Analyze Correctness and Failures

You can access the correctness of the agent aggregated in various ways or analyze only it failures:
You can access the correctness of the agent aggregated in various ways or analyze only it failures:

```python
# Correctness on each topic of the Knowledge Base
Expand All @@ -92,19 +91,45 @@ results = report.to_pandas()

### RAGAS Metrics

**You can pass additional evaluation metrics to the `evaluate` function**. They will be computed during the evaluation.
**You can pass additional evaluation metrics to the `evaluate` function**. They will be computed during the evaluation.
We currently provide [RAGAS metrics](https://docs.ragas.io/en/latest/concepts/metrics/index.html) as additional metrics.

The results of your metrics will be displayed in the report object as histograms and will be available inside the report main `DataFrame`.
The results of your metrics will be displayed in the report object as histograms and will be available inside the report main `DataFrame`.
![image](../../../_static/ragas_metrics.png)

To include RAGAS metrics in evaluation, make sure to have installed the `ragas>=0.1.5` library, then use the following code:
To include RAGAS metrics in evaluation, make sure to have installed the `ragas>=0.1.5` library. Some of the RAGAS metrics need access to the contexts retrieved by the RAG agent for each question. These can be returned by the `get_answer_fn` function along with the answer to the question:

```python
from giskard.rag import AgentAnwer

def get_answer_fn(question: str, history=None) -> str:
"""A function representing your RAG agent."""
# Format appropriately the history for your RAG agent
messages = history if history else []
messages.append({"role": "user", "content": question})

# Get the answer and the documents
agent_output = get_answer_from_agent(messages)

# Following llama_index syntax, you can get the answer and the retrieved documents
answer = agent_output.text
documents = agent_output.source_nodes

# Instead of returning a simple string, we return the AgentAnswer object which
# allows us to specify the retrieved context which is used by RAGAS metrics
return AgentAnswer(
message=answer,
documents=documents
)
```

Then, you can include the RAGAS metrics in the evaluation:

```python
from giskard.rag.metrics.ragas_metrics import ragas_context_recall, ragas_faithfulness

report = evaluate(
answer_fn,
get_answer_fn,
testset=testset,
knowledge_base=knowledge_base,
metrics=[ragas_context_recall, ragas_faithfulness]
Expand All @@ -114,14 +139,15 @@ report = evaluate(
Built-in metrics include `ragas_context_precision`, `ragas_faithfulness`, `ragas_answer_relevancy`,
`ragas_context_recall`. Note that including these metrics can significantly increase the evaluation time and LLM usage.

Alternatively, you can directly pass a list of answers instead of `get_answer_fn` to the `evaluate` function, you can then pass the retrieved documents as an optional argument `retrieved_documents` to compute the RAGAS metrics.

## Going Further: Giskard's Visual Interface

## Going Further: Giskard's Visual Interface

The tests generated by RAGET integrate directly with the **Giskard Hub** to allow for collaboration on the curation,
The tests generated by RAGET integrate directly with the **Giskard Hub** to allow for collaboration on the curation,
review and execution of tests.

### Step 1: Convert the test set into a test suite

Let's convert our test set into an actionable test suite ({class}`giskard.Suite`) that we can save and reuse in further iterations.

```python
Expand All @@ -130,7 +156,7 @@ test_suite = testset.to_test_suite("My first test suite")
test_suite.run(model=giskard_model)
```

Note that you can split the test suite on the question metadata values, for instance on each question type.
Note that you can split the test suite on the question metadata values, for instance on each question type.

```python
test_suite_by_question_types = testset.to_test_suite("Split test suite", slicing_metadata=["question_type"])
Expand All @@ -141,6 +167,7 @@ and [test integration](https://docs.giskard.ai/en/stable/open_source/integrate_t
everything you can do with test suites.

### Step 2: Wrap your model

Before evaluating your model with a test suite, you must wrap it as a `giskard.Model`. This step is necessary to ensure a common format for your model and its metadata. You can wrap anything as long as you can represent it in a Python function (for example an API call to Azure, OpenAI, Mistral, Ollama etc...). We also have pre-built wrappers for LangChain objects, or you can create your own wrapper by extending the `giskard.Model` class if you need to wrap a complex object such as a custom-made RAG communicating with a vectorstore.

To do so, you can follow the instructions from
Expand All @@ -158,13 +185,15 @@ test_suite.run(model=giskard_model)
### Step 3: upload your test suite to the Giskard Hub

Uploading a test suite to the hub allows you to:
* Compare the quality of different models and prompts to decide which one to promote
* Create more tests relevant to your use case, combining input prompts that make your model fail and custome evaluation criteria
* Share results, and collaborate with your team to integrate business feedback

- Compare the quality of different models and prompts to decide which one to promote
- Create more tests relevant to your use case, combining input prompts that make your model fail and custome evaluation criteria
- Share results, and collaborate with your team to integrate business feedback

To upload your test suite, you must have created a project on Giskard Hub and instantiated a Giskard Python client.

Then, upload your test suite and model like this:

```python
test_suite.upload(giskard_client, project_id) # project_id should be the id of the Giskard project in which you want to upload your suite
giskard_model.upload(giskard_client, project_id)
Expand All @@ -179,7 +208,6 @@ giskard_model.upload(giskard_client, project_id)

[Here's a demo](https://huggingface.co/spaces/giskardai/giskard) of the Giskard Hub in action.



## Troubleshooting

If you encounter any issues, join our [Discord community](https://discord.gg/fkv7CAr3FE) and ask questions in our #support channel.
2 changes: 2 additions & 0 deletions giskard/rag/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from .base import AgentAnswer
from .evaluate import evaluate
from .knowledge_base import KnowledgeBase
from .report import RAGReport
Expand All @@ -11,4 +12,5 @@
"KnowledgeBase",
"evaluate",
"RAGReport",
"AgentAnswer",
]
9 changes: 9 additions & 0 deletions giskard/rag/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from typing import Optional, Sequence

from dataclasses import dataclass


@dataclass
class AgentAnswer:
message: str
documents: Optional[Sequence[str]] = None
44 changes: 31 additions & 13 deletions giskard/rag/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@

from ..llm.client import LLMClient, get_default_client
from ..utils.analytics_collector import analytics
from .base import AgentAnswer
from .knowledge_base import KnowledgeBase
from .metrics import CorrectnessMetric, Metric
from .metrics import CorrectnessMetric
from .question_generators.utils import maybe_tqdm
from .recommendation import get_rag_recommendation
from .report import RAGReport
Expand All @@ -20,7 +21,7 @@


def evaluate(
answer_fn: Union[Callable, Sequence[str]],
answer_fn: Union[Callable, Sequence[Union[AgentAnswer, str]]],
testset: Optional[QATestset] = None,
knowledge_base: Optional[KnowledgeBase] = None,
llm_client: Optional[LLMClient] = None,
Expand All @@ -31,7 +32,7 @@ def evaluate(

Parameters
----------
answers_fn : Union[Callable, Sequence[str]]
answers_fn : Union[Callable, Sequence[Union[AgentAnswer,str]]]
The prediction function of the agent to evaluate or a list of precalculated answers on the testset.
testset : QATestset, optional
The test set to evaluate the agent on. If not provided, a knowledge base must be provided and a default testset will be created from the knowledge base.
Expand Down Expand Up @@ -72,7 +73,11 @@ def evaluate(
if testset is None:
testset = generate_testset(knowledge_base)

answers = answer_fn if isinstance(answer_fn, Sequence) else _compute_answers(answer_fn, testset)
model_outputs = (
[_cast_to_agent_answer(ans) for ans in answer_fn]
if isinstance(answer_fn, Sequence)
else _compute_answers(answer_fn, testset)
)

llm_client = llm_client or get_default_client()

Expand All @@ -87,18 +92,19 @@ def evaluate(
metrics_results = defaultdict(dict)

for metric in metrics:
metric_name = getattr(
metric, "name", metric.__class__.__name__ if isinstance(metric, Metric) else metric.__name__
)
try:
metric_name = metric.__class__.__name__
except AttributeError:
metric_name = metric.__name__

for sample, answer in maybe_tqdm(
zip(testset.to_pandas().to_records(index=True), answers),
zip(testset.to_pandas().to_records(index=True), model_outputs),
desc=f"{metric_name} evaluation",
total=len(answers),
total=len(model_outputs),
):
metrics_results[sample["id"]].update(metric(sample, answer))

report = RAGReport(testset, answers, metrics_results, knowledge_base)
report = RAGReport(testset, model_outputs, metrics_results, knowledge_base)
recommendation = get_rag_recommendation(
report.topics,
report.correctness_by_question_type().to_dict()[metrics[0].name],
Expand All @@ -121,7 +127,7 @@ def evaluate(


def _compute_answers(answer_fn, testset):
answers = []
model_outputs = []
needs_history = (
len(signature(answer_fn).parameters) > 1 and ANSWER_FN_HISTORY_PARAM in signature(answer_fn).parameters
)
Expand All @@ -132,5 +138,17 @@ def _compute_answers(answer_fn, testset):
if needs_history:
kwargs[ANSWER_FN_HISTORY_PARAM] = sample.conversation_history

answers.append(answer_fn(sample.question, **kwargs))
return answers
answer = answer_fn(sample.question, **kwargs)
model_outputs.append(_cast_to_agent_answer(answer))

return model_outputs


def _cast_to_agent_answer(answer) -> AgentAnswer:
if isinstance(answer, AgentAnswer):
return answer

if isinstance(answer, str):
return AgentAnswer(message=answer)

raise ValueError(f"The answer function must return a string or an AgentAnswer object. Got {type(answer)} instead.")
3 changes: 2 additions & 1 deletion giskard/rag/metrics/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from ..base import AgentAnswer
from .base import Metric
from .correctness import CorrectnessMetric, correctness_metric

__all__ = ["Metric", "correctness_metric", "CorrectnessMetric"]
__all__ = ["Metric", "correctness_metric", "CorrectnessMetric", "AgentAnswer"]
8 changes: 5 additions & 3 deletions giskard/rag/metrics/base.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from abc import ABC, abstractmethod

from giskard.rag.base import AgentAnswer

from ...llm.client.base import LLMClient


Expand All @@ -14,20 +16,20 @@ def __init__(self, name: str, llm_client: LLMClient = None) -> None:
self._llm_client = llm_client

@abstractmethod
def __call__(self, question_sample: dict, answer: str):
def __call__(self, question_sample: dict, answer: AgentAnswer):
"""
Compute the metric on a single question and its associated answer.

Parameters
----------
question_sample : dict
A question sample from a QATestset.
answer : Sequence[str]
answer : AgentAnswer
The agent answer on that question.

Returns
-------
dict
The result of the metric. The keys should be the names of the metrics computed.
The result of the metric computation. The keys should be the names of the metrics computed.
"""
pass
20 changes: 18 additions & 2 deletions giskard/rag/metrics/correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from ...llm.client import ChatMessage, LLMClient, get_default_client
from ...llm.errors import LLMGenerationError
from ..base import AgentAnswer
from ..question_generators.utils import parse_json_output
from .base import Metric

Expand Down Expand Up @@ -55,7 +56,22 @@ def __init__(self, name: str, llm_client: LLMClient = None, agent_description: O
self._llm_client = llm_client
self.agent_description = agent_description or "This agent is a chatbot that answers question from users."

def __call__(self, question_sample: dict, answer: str) -> dict:
def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict:
"""
Compute the correctness between the agent answer and the reference answer from QATestset.

Parameters
----------
question_sample : dict
A question sample from a QATestset.
answer : ModelOutput
The answer of the agent on the question.

Returns
-------
dict
The result of the correctness evaluation. It contains the keys 'correctness' and 'correctness_reason'.
"""
llm_client = self._llm_client or get_default_client()
try:
out = llm_client.complete(
Expand All @@ -72,7 +88,7 @@ def __call__(self, question_sample: dict, answer: str) -> dict:
role="user",
content=CORRECTNESS_INPUT_TEMPLATE.format(
question=question_sample.question,
agent_answer=answer,
agent_answer=answer.message,
ground_truth=question_sample.reference_answer,
),
),
Expand Down
Loading