feat(giskard-checks): add Correctness LLM judge check by kevinmessiaen · Pull Request #2350 · Giskard-AI/giskard-oss

kevinmessiaen · 2026-03-31T02:11:28Z

Added a correctness judge with reference answer, Jinja prompt, public exports, and unit tests.

Register correctness judge with reference answer, Jinja prompt, public exports, and unit tests. Sync uv.lock for editable package versions. Made-with: Cursor

linear · 2026-03-31T02:11:31Z

ENG-1496 [giskard-checks] Correctness check

gemini-code-assist

Code Review

This pull request introduces a new Correctness check that uses an LLM to validate agent answers against reference ground-truth answers. The implementation includes the Correctness class, a corresponding Jinja2 prompt template, and unit tests. Feedback was provided regarding the handling of missing values in get_inputs, specifically to avoid passing the literal string "None" to the LLM and to ensure a reference answer is present before execution.

gemini-code-assist · 2026-03-31T02:13:22Z

+        return {
+            "description": str(
+                provided_or_resolve(
+                    trace,
+                    key=self.description_key,
+                    value=provide_not_none(self.description),
+                )
+            ),
+            "conversation": trace,
+            "answer": str(
+                provided_or_resolve(
+                    trace,
+                    key=self.answer_key,
+                    value=provide_not_none(self.answer),
+                )
+            ),
+            "reference_answer": str(
+                provided_or_resolve(
+                    trace,
+                    key=self.reference_answer_key,
+                    value=provide_not_none(self.reference_answer),
+                )
+            ),
+        }


The current implementation of get_inputs wraps the results of provided_or_resolve in str(). If a value (such as description or reference_answer) is missing from both the check instance and the trace, provided_or_resolve returns None, and str(None) converts it to the literal string "None". This string is then passed to the LLM prompt, which can lead to confusing or incorrect evaluations (e.g., the LLM comparing the agent's answer against the word "None" instead of a real reference answer).

Additionally, since Correctness is a reference-based check, it is highly recommended to validate that a reference_answer has been successfully resolved before proceeding, as the check cannot function correctly without it.

description = provided_or_resolve( trace, key=self.description_key, value=provide_not_none(self.description), ) answer = provided_or_resolve( trace, key=self.answer_key, value=provide_not_none(self.answer), ) reference_answer = provided_or_resolve( trace, key=self.reference_answer_key, value=provide_not_none(self.reference_answer), ) if reference_answer is None: raise ValueError( "Correctness check failed: No reference answer provided or found in trace. " "Please provide 'reference_answer' or ensure it exists in the trace metadata." ) return { "description": str(description) if description is not None else "", "conversation": trace, "answer": str(answer) if answer is not None else "", "reference_answer": str(reference_answer), }

…check

davidberenstein1957

some minor remarks

davidberenstein1957 · 2026-04-02T11:36:39Z

+    """LLM-based check that validates an answer against a reference (ground-truth).
+
+    Uses an LLM to decide whether the agent's answer is correct relative to a
+    reference answer. The judge prompt receives the same ``Trace`` instance that is


is this a reference answer or a reference context?

davidberenstein1957 · 2026-04-02T11:37:00Z

+    **How the trace appears in the prompt**
+
+    Template rendering (via the agents Jinja environment) formats values for the LLM.
+    If the trace type implements ``_repr_prompt_()`` (see
+    ``giskard.agents.templates.LLMFormattable``), that method supplies the
+    conversation text. Otherwise the trace
+    is serialized as indented JSON from ``model_dump()`` (Pydantic).


do we add this everywhere?

davidberenstein1957 · 2026-04-02T11:37:06Z

+    description_key : str
+        JSONPath expression to extract the description from the trace
+        (default: ``trace.annotations.description``).
+


Suggested change

davidberenstein1957 · 2026-04-02T11:40:23Z

+
+## Markers
+
+Markers <AGENT DESCRIPTION>...</AGENT DESCRIPTION>, <CONVERSATION>...</CONVERSATION>, <AGENT ANSWER>...</AGENT ANSWER>, and <REFERENCE ANSWER>...</REFERENCE ANSWER> indicate where each input is. Everything inside a marker belongs to that category.


Is there a reason we don't use add _ in beteween the marker that consist of multiple words like AGENT_DESCRIPTION?

davidberenstein1957 · 2026-04-02T11:40:59Z

+from pydantic import Field
+
+
+class LLMTrace(Trace[str, str], frozen=True):


can we create a shared version of this, perhaps?

Yes it's planned.

It's not yet defined what should be inputs and outputs type (str, openai format, ... ) and how we want to represent tool calls, thinking, ...

davidberenstein1957 · 2026-04-02T11:41:17Z

+        )
+
+
+class MockGenerator(BaseGenerator):


also this could perhaps be shared?

…check

feat(giskard-checks): add Correctness LLM judge check

c89b951

Register correctness judge with reference answer, Jinja prompt, public exports, and unit tests. Sync uv.lock for editable package versions. Made-with: Cursor

kevinmessiaen self-assigned this Mar 31, 2026

kevinmessiaen temporarily deployed to ci March 31, 2026 02:11 — with GitHub Actions Inactive

github-actions Bot added the Scope: Checks label Mar 31, 2026

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

Merge branch 'main' into feature/eng-1496-giskard-checks-correctness-…

79e5a0b

…check

kevinmessiaen temporarily deployed to ci April 1, 2026 07:35 — with GitHub Actions Inactive

davidberenstein1957 reviewed Apr 2, 2026

View reviewed changes

Merge branch 'main' into feature/eng-1496-giskard-checks-correctness-…

0852585

…check

davidberenstein1957 temporarily deployed to ci May 13, 2026 09:14 — with GitHub Actions Inactive

davidberenstein1957 had a problem deploying to ci May 13, 2026 09:14 — with GitHub Actions Failure

davidberenstein1957 temporarily deployed to ci May 13, 2026 09:14 — with GitHub Actions Inactive

davidberenstein1957 had a problem deploying to ci May 13, 2026 09:14 — with GitHub Actions Failure

davidberenstein1957 temporarily deployed to ci May 13, 2026 09:14 — with GitHub Actions Inactive

davidberenstein1957 had a problem deploying to ci May 13, 2026 09:14 — with GitHub Actions Failure

davidberenstein1957 temporarily deployed to ci May 13, 2026 09:14 — with GitHub Actions Inactive

davidberenstein1957 enabled auto-merge (squash) May 13, 2026 09:15

davidberenstein1957 self-requested a review May 13, 2026 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(giskard-checks): add Correctness LLM judge check#2350

feat(giskard-checks): add Correctness LLM judge check#2350
kevinmessiaen wants to merge 3 commits into
mainfrom
feature/eng-1496-giskard-checks-correctness-check

kevinmessiaen commented Mar 31, 2026

Uh oh!

linear Bot commented Mar 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Uh oh!

davidberenstein1957 left a comment

Uh oh!

davidberenstein1957 Apr 2, 2026

Uh oh!

davidberenstein1957 Apr 2, 2026

Uh oh!

davidberenstein1957 Apr 2, 2026

Uh oh!

davidberenstein1957 Apr 2, 2026

Uh oh!

davidberenstein1957 Apr 2, 2026

Uh oh!

kevinmessiaen Apr 3, 2026

Uh oh!

davidberenstein1957 Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants


		## Markers

		Markers <AGENT DESCRIPTION>...</AGENT DESCRIPTION>, <CONVERSATION>...</CONVERSATION>, <AGENT ANSWER>...</AGENT ANSWER>, and <REFERENCE ANSWER>...</REFERENCE ANSWER> indicate where each input is. Everything inside a marker belongs to that category.

		from pydantic import Field


		class LLMTrace(Trace[str, str], frozen=True):

Uh oh!

Conversation

kevinmessiaen commented Mar 31, 2026

Uh oh!

linear Bot commented Mar 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 left a comment

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

kevinmessiaen Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants