Support for custom LLM clients #1839

mattbit · 2024-03-11T15:37:51Z

Support for custom LLMs (Mistral, OpenAI, Bedrock, etc.) in Giskard:

simplify LLMClient, remove tools and function calls
simplified evaluator prompts using chat mode one-shot
added MistralClient
refactored OpenAI client can also support local models through Ollama, because it now doesn't need function tools

kevinmessiaen

The idea is to remove function_call and tools_call and replace them by asking the model to answer json formatted answers?

I guess this is necessary to allow any model to work but how does it impact the reliability of current scan?

Also, I know it might increase complexity but couldn't we have something like this instead so we keep the feature (not necessary if the switch won't impact scan reliability enough):

class Tool:
  tool_spec: Dict
  examples: Optional[List[Dict]]

  @property
  def to_format_instruction() -> str:
    pass

  def use_tool(chat_response) -> bool
    pass

  def parse_response(chat_response) -> List[Dict] # List because tool might be called more than once
    pass

class LLMClient(ABC):
  @abstractmethod
  def complete(..., tools: List[Tool] = [], tool_choice: Optional[Tool] = None):
     ...

class MistralClient(LLMClient):
  @abstractmethod
  def complete(..., tools: List[Tool] = [], tool_choice: Optional[Tool] = None):
    # This logic should be moved in an adapter so we can reuse for other LLMClient easily (and have cleaner code)
    if (len(tools) > 0):
      tools_instruction = '\n'.join([tool.to_format_instruction for tool in tools])
      if (tool_choice is not None):
        tools_instruction += f'\n You must call {tool_choice.tool_call['name']}'
      # TODO: replace '{tools_instruction}' in `messages`
      # 
    ...

# OpenAIClient use native tool_calls (if used model support it)

I agree that we should keep the LLMClient as simple as possible. But I do believe that the client should be the one responsible to serialising the message to the API format and not the feature that need to be adapted to work for all LLMs (Ideally we should have something easy to understand in the scan/evaluator/... and the Client should be the translation based on the LLM's feature/needs). Feel free to disregard this comment if deemed not worth it to implement.

@kevinmessiaen

@kevinmessiaen please check the test details

mattbit · 2024-03-18T08:32:02Z

The idea is to remove function_call and tools_call and replace them by asking the model to answer json formatted answers?
I guess this is necessary to allow any model to work but how does it impact the reliability of current scan?

Yes, basically to be as flexible as possible in the support. By now, most of the models support some kind JSON format output (through response_format="json_object"), and in our tests the few-shot prompting in chat mode was just as effective.

Also, I know it might increase complexity but couldn't we have something like this instead so we keep the feature (not necessary if the switch won't impact scan reliability enough):
[...]
I agree that we should keep the LLMClient as simple as possible. But I do believe that the client should be the one responsible to serialising the message to the API format and not the feature that need to be adapted to work for all LLMs (Ideally we should have something easy to understand in the scan/evaluator/... and the Client should be the translation based on the LLM's feature/needs). Feel free to disregard this comment if deemed not worth it to implement.

Yeah, it would be nice to abstract tools but then we would end up reimplementing langchain. I think it's wise to keep our code to a minimum for what is not strictly required. I realized that my first implementation with function calls was over engineered. I feel it's worth waiting for for now, maybe at some point there will be some light open-source library replacing langchain ;)

andreybavt · 2024-04-08T09:46:43Z

giskard/llm/client/mistral.py

+        if seed is not None:
+            extra_params["random_seed"] = seed
+
+        if format not in (None, "json", "json_object") and "large" not in self.model:


format could be of Literal type

also pydantic's validate_call could do the validation job

andreybavt · 2024-04-09T12:54:52Z

giskard/llm/embeddings/__init__.py

+        return _default_embedding
+
+    # Try with OpenAI/AzureOpenAI if available
+    try:


nitpick: the two try blocks could be extracted into dedicated methods

andreybavt · 2024-04-09T13:01:11Z

giskard/llm/embeddings/base.py

+        ...
+
+
+def batched(iterable, batch_size):


should be moved to utils.py

andreybavt · 2024-04-09T13:46:36Z

giskard/llm/evaluators/base.py

-    success_examples: Sequence[dict]
-    errors: Sequence[dict]
-    details: Optional[TestResultDetails] = None
+    results: Sequence[EvaluationResultExample] = field(default_factory=list)


since we're calling .append below this should rather be a List, not a Sequence

andreybavt · 2024-04-09T13:51:37Z

giskard/llm/evaluators/base.py

+
+        return result
+
+    def _evaluate_sample(self, model: BaseModel, sample: Dict) -> Tuple[bool, str, Dict]:


return type should be Tuple[bool, str]

andreybavt · 2024-04-09T13:55:47Z

giskard/llm/evaluators/coherency.py


-        if len(out.tool_calls) != 1 or "passed_test" not in out.tool_calls[0].function.arguments:
-            raise LLMGenerationError("Invalid function call arguments received")
+    def _format_messages(self, model: BaseModel, sample: Dict) -> Sequence[ChatMessage]:


missing meta argument

andreybavt · 2024-04-09T14:02:48Z

giskard/llm/evaluators/coherency.py

                outputs_2 = model.predict(dataset_2).prediction

        inputs_1 = dataset_1.df.to_dict("records")
        inputs_2 = dataset_2.df.loc[dataset_1.df.index].to_dict("records")


dataset_2 can be None

andreybavt · 2024-04-09T14:37:36Z

giskard/testing/tests/llm/ground_truth.py

    debug_description=debug_description_prefix + "that are <b>failing the evaluation criteria</b>.",
 )
 def test_llm_as_a_judge_ground_truth_similarity(
    model: BaseModel, dataset: Dataset, prefix: str = "The requirement should be similar to: ", rng_seed: int = 1729


prefix isn't used

# Conflicts: # pdm.lock # tests/utils.py

andreybavt

Generally looks good to me except for minor comments, failing tests and a bunch of typing issues (we should add mypy or something similar)

I tested the new test_llm_ground_truth with the Hub, it worked fine

sonarqubecloud · 2024-04-10T08:13:25Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
80.1% Coverage on New Code
1.7% Duplication on New Code

See analysis details on SonarCloud

kevinmessiaen · 2024-04-10T09:30:54Z

Generally looks good to me except for minor comments, failing tests and a bunch of typing issues (we should add mypy or something similar)

I tested the new test_llm_ground_truth with the Hub, it worked fine

Should be good now, I just not sure about the first feedback regarding Literal

Big refactoring: break everything, and rebuild simpler

035ed40

mattbit changed the title ~~Big refactoring: break everything, and rebuild simpler~~ Support for custom LLMs Mar 11, 2024

mattbit changed the title ~~Support for custom LLMs~~ Support for custom LLM clients Mar 11, 2024

mattbit added 4 commits March 11, 2024 17:39

Start clean up of evaluators

ed615b8

Update prompt for harmful content detector

261a5bf

Fixing base evaluator

2d1257e

Enforce JSON on OpenAI

9860b6b

kevinmessiaen reviewed Mar 12, 2024

View reviewed changes

mattbit added 21 commits March 14, 2024 20:26

Fix response format in OpenAI client

0ca7cef

Refactoring of generators and evaluators

9259921

Merge branch 'main' into feature/llm-support

ba407b4

Fix format of evaluation results

6e7251e

Update test_coherency_evaluator.py

edab964

More bug fixing

5478529

Fix LLMClient test

0ff3ff4

Add language support

7f37c6b

Fix sycophancy test

55a94ff

Improve error for missing mistralai dependency

1385205

Test result from evaluators

ec153a3

@kevinmessiaen please check the test details

Update tests

dd823f8

Fix faithfulness detector

419434e

Fix correctness test

3923435

Fix LLM seed

977b540

Shorten info disclosure category

9737b45

Improve mistral support

f2aa8e2

Adding logger

bbba96d

Fix experimental output formatting in sycophancy

907dfb7

Fix tests

2a8248d

Conditionally enable json response for Mistral models

4fb2a85

Small refactoring

85b21a9

mattbit marked this pull request as ready for review April 5, 2024 14:02

Better requirement parsing to tolerate for LLM errors

575272f

mattbit removed their assignment Apr 5, 2024

Fix type hint

58ba03c

luca-martial removed the request for review from Hartorn April 5, 2024 14:25

andreybavt reviewed Apr 8, 2024

View reviewed changes

andreybavt reviewed Apr 9, 2024

View reviewed changes

giskard/llm/embeddings/base.py Outdated

...

def batched(iterable, batch_size):

Copy link

Contributor

andreybavt Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be moved to utils.py

andreybavt reviewed Apr 9, 2024

View reviewed changes

andreybavt added 3 commits April 9, 2024 16:57

Merge branch 'refs/heads/main' into feature/llm-support

fa68891

# Conflicts: # pdm.lock # tests/utils.py

post merge main

ed850f8

Merge branch 'main' into feature/llm-support

269437a

andreybavt added the Lockfile Temporary label to update pdm.lock label Apr 9, 2024

andreybavt reviewed Apr 9, 2024

View reviewed changes

Regenerating pdm.lock

cdea946

github-actions bot removed the Lockfile Temporary label to update pdm.lock label Apr 9, 2024

kevinmessiaen and others added 3 commits April 10, 2024 13:58

Merge branch 'main' into feature/llm-support

05622ed

Small improvements

d254e50

Fixed coherency test

7a9a9b7

kevinmessiaen requested a review from andreybavt April 10, 2024 09:17

andreybavt approved these changes Apr 10, 2024

View reviewed changes

andreybavt merged commit ad84e47 into main Apr 10, 2024

andreybavt deleted the feature/llm-support branch April 10, 2024 12:12


		return result

		def _evaluate_sample(self, model: BaseModel, sample: Dict) -> Tuple[bool, str, Dict]:

Uh oh!

Support for custom LLM clients #1839

Support for custom LLM clients #1839

Uh oh!

Conversation

mattbit commented Mar 11, 2024

Uh oh!

kevinmessiaen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattbit commented Mar 18, 2024

Uh oh!

andreybavt Apr 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreybavt Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

andreybavt Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

andreybavt Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

andreybavt Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

andreybavt Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

andreybavt Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

andreybavt Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

andreybavt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Apr 10, 2024

Quality Gate passed

Uh oh!

kevinmessiaen commented Apr 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

kevinmessiaen left a comment •

edited

Loading

andreybavt Apr 8, 2024 •

edited

Loading

andreybavt left a comment •

edited

Loading