Skip to content
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/create-release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ jobs:

- name: Adding file
run: |
git add pyproject.toml
git add pyproject.toml uv.lock
git fetch --quiet --tags
git commit -m "v${{ inputs.version }}" --allow-empty
git tag v${{ inputs.version }}
Expand Down
31 changes: 13 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Either use `hub.projects.list()` to get a list of all projects, or use

### Import a dataset

Let's now create a dataset and add a conversation example.
Let's now create a dataset and add a chat test case example.

```python
# Let's create a dataset
Expand All @@ -80,12 +80,12 @@ dataset = hub.datasets.create(
)
```

We can now add a conversation example to the dataset. This will be used
We can now add a chat test case example to the dataset. This will be used
for the model evaluation.

```python
# Add a conversation example
hub.conversations.create(
# Add a chat test case example
hub.chat_test_cases.create(
dataset_id=dataset.id,
messages=[
dict(role="user", content="What is the capital of France?"),
Expand All @@ -107,21 +107,21 @@ hub.conversations.create(
)
```

These are the attributes you can set for a conversation (the only
These are the attributes you can set for a chat test case (the only
required attribute is `messages`):

- `messages`: A list of messages in the conversation. Each message is a dictionary with the following keys:
- `messages`: A list of messages in the chat. Each message is a dictionary with the following keys:

- `role`: The role of the message, either "user" or "assistant".
- `content`: The content of the message.

- `demo_output`: A demonstration of a (possibly wrong) output from the
model with an optional metadata. This is just for demonstration purposes.

- `checks`: A list of checks that the conversation should pass. This is used for evaluation. Each check is a dictionary with the following keys:
- `checks`: A list of checks that the chat test case should pass. This is used for evaluation. Each check is a dictionary with the following keys:
- `identifier`: The identifier of the check. If it's a built-in check, you will also need to provide the `params` dictionary. The built-in checks are:
- `correctness`: The output of the model should match the reference.
- `conformity`: The conversation should follow a set of rules.
- `conformity`: The chat should follow a set of rules.
- `groundedness`: The output of the model should be grounded in the conversation.
- `string_match`: The output of the model should contain a specific string (keyword or sentence).
- `metadata`: The metadata output of the model should match a list of JSON path rules.
Expand All @@ -137,15 +137,13 @@ required attribute is `messages`):
- `expected_value_type`: The expected type of the value at the JSON path, one of `string`, `number`, `boolean`.
- For the `semantic_similarity` check, the parameters are `reference` (type: `str`) and `threshold` (type: `float`), where `reference` is the expected output and `threshold` is the similarity score below which the check will fail.

You can add as many conversations as you want to the dataset.
You can add as many chat test cases as you want to the dataset.

Again, you'll find your newly created dataset in the Hub UI.

### Configure a model/agent

Before running our first evaluation, we'll need to set up a model.
You'll need an API endpoint ready to serve the model. Then, you can
configure the model API in the Hub:
Before running our first evaluation, we'll need to set up a model. You'll need an API endpoint ready to serve the model. Then, you can configure the model API in the Hub:

```python
model = hub.models.create(
Expand All @@ -159,8 +157,7 @@ model = hub.models.create(
)
```

We can test that everything is working well by running a chat with the
model:
We can test that everything is working well by running a chat with the model:

```python
response = model.chat(
Expand Down Expand Up @@ -198,8 +195,7 @@ eval_run = client.evaluate(
)
```

The evaluation will run asynchronously on the Hub. To retrieve the
results once the run is complete, you can use the following:
The evaluation will run asynchronously on the Hub. To retrieve the results once the run is complete, you can use the following:

```python

Expand All @@ -213,5 +209,4 @@ eval_run.print_metrics()
**Tip**

You can directly pass IDs to the evaluate function, e.g.
`model=model_id` and `dataset=dataset_id`, without having to retrieve
the objects first.
`model=model_id` and `dataset=dataset_id`, without having to retrieve the objects first.
23 changes: 0 additions & 23 deletions examples/example.sh

This file was deleted.

53 changes: 0 additions & 53 deletions examples/example_python.py

This file was deleted.

14 changes: 7 additions & 7 deletions script-docs/hub/sdk/checks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The Giskard Hub provides a set of built-in checks that cover common use cases, s

* **Correctness**: Verifies if the agent's response matches the expected output (reference answer).
* **Conformity**: Ensures the agent's response adheres to the rules, such as "The agent must be polite."
* **Groundedness**: Ensures the agent's response is grounded in the conversation.
* **Groundedness**: Ensures the agent's response is grounded to a specific context.
* **String matching**: Checks if the agent's response contains a specific string, keyword, or sentence.
* **Metadata**: Verifies the presence of specific (tool calls, user information, etc.) metadata in the agent's response.
* **Semantic Similarity**: Verifies that the agent's response is semantically similar to the expected output.
Expand Down Expand Up @@ -46,7 +46,7 @@ Custom checks are reusable evaluation criteria that you can define for your proj

Custom checks can be used in the following ways:

- Applied to conversations in your datasets
- Applied to chat test cases (conversations) in your datasets
- Used during agent evaluations
- Shared across your team **within the same project**
- Modified or updated as your requirements evolve
Expand Down Expand Up @@ -243,7 +243,7 @@ You can delete a check using the ``hub.checks.delete()`` method. Here's a basic

.. warning::

Deleting a check is permanent and cannot be undone. Make sure you're not using the check in any active conversations or evaluations before deleting it.
Deleting a check is permanent and cannot be undone. Make sure you're not using the check in any active chat test cases or evaluations before deleting it.

List checks
___________
Expand All @@ -263,15 +263,15 @@ You can list all checks for a project using the ``hub.checks.list()`` method. He

.. _add-checks-to-conversations:

Add checks to conversations
Add checks to chat test cases
---------------------------

Once you've created a check, you can use it in your conversations by referencing its identifier:
Once you've created a check, you can use it in your chat test cases by referencing its identifier:

.. code-block:: python

# Add a conversation that uses your check
hub.conversations.create(
# Add a chat test case that uses your check
hub.chat_test_cases.create(
dataset_id=dataset.id,
messages=[
{"role": "user", "content": "What's the formula for compound interest?"},
Expand Down
8 changes: 4 additions & 4 deletions script-docs/hub/sdk/datasets/business.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
Detect business failures by generating synthetic tests
======================================================

Generative AI agents can face an endless variety of real-world scenarios, making it impossible to manually enumerate all possible test cases. Automated, synthetic test case generation is therefore essential—especially when you lack real user conversations to import as tests. However, a major challenge is to ensure that these synthetic cases are tailored to your business context, rather than being overly generic.
Generative AI agents can face an endless variety of real-world scenarios, making it impossible to manually enumerate all possible scenarios. Automated, synthetic test case generation is therefore essential—especially when you lack real user chats to import as tests. However, a major challenge is to ensure that these synthetic cases are tailored to your business context, rather than being overly generic.

By generating domain-specific synthetic tests, you can proactively identify and address these types of failures before they impact your users or business operations.

Expand All @@ -31,9 +31,9 @@ Before generating test cases, you need to `create a knowledge base </hub/sdk/pro
# Wait for the dataset to be created
business_dataset.wait_for_completion()

# List the conversations in the dataset
for conversation in business_dataset.conversations:
print(conversation.messages[0].content)
# List the chat test cases in the dataset
for chat_test_case in business_dataset.chat_test_cases:
print(chat_test_case.messages[0].content)

.. note::

Expand Down
18 changes: 9 additions & 9 deletions script-docs/hub/sdk/datasets/import.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
:og:title: Giskard Hub - Enterprise Agent Testing - Import Datasets
:og:description: Import existing test data programmatically into Giskard Hub. Support conversations, CSV files, and other formats through our Python SDK.
:og:description: Import your existing test data into Giskard Hub. Bring chat test cases, CSV files, and other data formats to build comprehensive test datasets.

=============================
Import existing datasets
Expand All @@ -20,7 +20,7 @@ Let's start by initializing the Hub client or take a look at the :doc:`/hub/sdk/

hub = HubClient()

You can now use the ``hub.datasets`` and ``hub.conversations`` clients to import datasets and conversations!
You can now use the ``hub.datasets`` and ``hub.chat_test_cases`` clients to import datasets and chat_test_cases!

Create a dataset
________________
Expand All @@ -32,20 +32,20 @@ As we have seen in the :doc:`/hub/sdk/datasets/index` section, we can create a d
dataset = hub.datasets.create(
project_id="<PROJECT_ID>",
name="Production Data",
description="This dataset contains conversations that " \
description="This dataset contains chats that " \
"are automatically sampled from the production environment.",
)

After having created the dataset, we can import conversations into it.
After having created the dataset, we can import chat test cases (conversations) into it.

Import conversations
Import chat test cases
____________________

We can import conversations into the dataset using the ``hub.conversations.create()`` method.
We can import the chats into the dataset using the ``hub.chat_test_cases.create()`` method.

.. code-block:: python

hub.conversations.create(
hub.chat_test_cases.create(
dataset_id=dataset.id,

# A list of messages, without the last assistant answer
Expand Down Expand Up @@ -98,7 +98,7 @@ We can then format the testset to the correct format and create the dataset usin
dataset = hub.datasets.create(
project_id="<PROJECT_ID>",
name="RAGET Dataset",
description="This dataset contains conversations that are used to evaluate the RAGET model.",
description="This dataset contains chats that are used to evaluate the RAGET model.",
)

for sample in testset.samples:
Expand Down Expand Up @@ -155,7 +155,7 @@ We can then format the testset to the correct format and create the dataset usin
}
)

hub.conversations.create(
hub.chat_test_cases.create(
dataset_id=dataset.id,
messages=messages,
checks=checks,
Expand Down
Loading
Loading