Fix and refactor final answer checks #1448

aymeric-roucher · 2025-06-17T16:23:52Z

This refactor fixes #1440 and refactors the logic to address underlying issues such as "agent cannot return None in final_answer"

aymeric-roucher · 2025-06-17T16:52:01Z

src/smolagents/agents.py

            yield action_step
        yield FinalAnswerStep(handle_agent_output_types(final_answer))

-    def _execute_step(self, memory_step: ActionStep) -> Generator[ChatMessageStreamDelta | FinalOutput]:


I think porting this logic into a separate function obfuscates the logic more than it clarifies.

aymeric-roucher · 2025-06-17T16:53:50Z

src/smolagents/agents.py

        tools: list[Tool],
        model: Model,
        prompt_templates: PromptTemplates | None = None,
+        instructions: str | None = None,


This is from #1442, will not appear after merging pr 1442.

src/smolagents/agents.py

Copilot

Pull Request Overview

This PR refactors the final answer handling in agents and updates related tests to use PIL.Image instances and the renamed step_number field.

Refactored ActionStep.dict() to serialize images, add is_final_answer, and replace "step" with "step_number".
Replaced the old FinalOutput type with ActionOutput and updated streaming logic in agents.py.
Updated tests to use Image.new(...), renamed keys in dict assertions, and added scenarios for final answer checks.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
tests/test_memory.py	Switched from string paths to `PIL.Image` objects, renamed `"step"` to `"step_number"`, added checks for `observations_images`.
tests/test_agents.py	Added a test case for `final_answer_checks`, updated `max_steps` usage, and renamed planning step variables.
src/smolagents/memory.py	Updated `ActionStep.dict()` to output `step_number`, serialize `observations_images`, and include `is_final_answer`.
src/smolagents/agents.py	Removed `FinalOutput`, introduced `ActionOutput`, and refactored the streaming loop to validate and flag final answers.

Comments suppressed due to low confidence (2)

tests/test_memory.py:101

Consider adding an assertion to verify that action_step_dict["observations_images"] contains the expected byte data (e.g., matching Image.new(...).tobytes()).

    assert "observations_images" in action_step_dict

tests/test_memory.py:203

Add assertions in test_task_step_to_messages to confirm that the generated messages include an image entry (e.g., check for a content dict with "type": "image").

    task_step = TaskStep(task="This is a task.", task_images=[Image.new("RGB", (100, 100))])

Copilot · 2025-06-18T19:07:20Z

src/smolagents/agents.py

 @dataclass
-class FinalOutput:
-    output: Any | None
+class ActionOutput:


The ActionOutput class is missing the @dataclass decorator, so instances won’t accept constructor arguments. Add @dataclass above this class definition.

No it isn't, go home copilot you're drunk

albertvillanova

Thanks for addressing these issues.

Not sure about the need to replace FinalOutput with ActionOutput and the need to add the is_final_answer...

src/smolagents/agents.py

albertvillanova · 2025-06-19T09:46:03Z

tests/test_agents.py

+        agent = CodeAgent(
+            model=FakeCodeModel(),
+            tools=[],
+            final_answer_checks=[lambda x, y: x == 7.2904],
+            verbosity_level=1000,
+        )
+        output = agent.run("Dummy task.")
+        assert output == 7.2904  # Check that output is correct
+        assert len([step for step in agent.memory.steps if isinstance(step, ActionStep)]) == 2
+        assert "Error raised in check" not in str(agent.write_memory_to_messages())


I think this test passes before the fixes introduced in this PR.

You're right, just fixed it to not be passing before!

albertvillanova

If I understand correctly, the ActionOutput class will be:

Either ActionOutput(output: Any, is_final_answer=True) for final answer
Or ActionOutput(output=None, is_final_answer=False) for non final answer

IMHO it would be simpler to return:

Either FinalOutput(output: Any) for final answer
Or None for non final answer

I find the ActionOutput is non-optimal because it adds complexity without adding relevant information, but feel free to ignore it!

aymeric-roucher · 2025-06-19T13:04:47Z

@Albert pasting my slack message here for visibility:

Thinking in terms of clients that will need to consume streaming events, I think it's simpler for frontends to consume a general object like ActionOutput and use their attributes rather than consuming FinalOutput | None : but I might still rework this, i'm still making lots of changes! The idea is to serve the same kind of streaming agents as other packages like copilotkit or openai agents) do.

albertvillanova · 2025-06-19T17:34:50Z

@aymeric-roucher I think when streaming, users are already consuming Generator[ChatMessageStreamDelta | FinalOutput].

The alternative approach just adds a None: Generator[ChatMessageStreamDelta | FinalOutput | None].

aymeric-roucher added 6 commits June 17, 2025 18:23

Refacto StepOutputs for clearer final answer checks

3a92725

Yield streaming deltas

10a0764

Fix test

cd143f8

Improve variable names

03f17e8

Format

3092d62

Fix ActionStep.to_dict()

4441aec

aymeric-roucher changed the title ~~Refacto StepOutputs for clearer final answer checks~~ Fix and refactor final answer checks Jun 17, 2025

aymeric-roucher commented Jun 17, 2025

View reviewed changes

aymeric-roucher mentioned this pull request Jun 17, 2025

Fix final answer checks #1446

Closed

Fix test which seems to have been faulty before

fc7cc89

aymeric-roucher force-pushed the fix-final-answer-checks branch from a2e464e to fc7cc89 Compare June 17, 2025 18:17

Zoe14 reviewed Jun 17, 2025

View reviewed changes

src/smolagents/agents.py Outdated Show resolved Hide resolved

Zoe14 reviewed Jun 17, 2025

View reviewed changes

src/smolagents/agents.py Outdated Show resolved Hide resolved

Fixes

54dbde6

aymeric-roucher force-pushed the fix-final-answer-checks branch from b5d3d18 to 54dbde6 Compare June 17, 2025 20:17

Merge branch 'main' into fix-final-answer-checks

7b39442

albertvillanova requested a review from Copilot June 18, 2025 19:04

Copilot AI reviewed Jun 18, 2025

View reviewed changes

albertvillanova reviewed Jun 19, 2025

View reviewed changes

Make sure test didn't pass before

ca5a539

aymeric-roucher force-pushed the fix-final-answer-checks branch from e0b2e39 to ca5a539 Compare June 19, 2025 12:07

albertvillanova approved these changes Jun 19, 2025

View reviewed changes

aymeric-roucher merged commit 76ecb9b into main Jun 19, 2025
4 checks passed

albertvillanova linked an issue Jun 19, 2025 that may be closed by this pull request

[BUG] None check is not there for FinalOutput.output #1443

Closed

albertvillanova mentioned this pull request Jun 19, 2025

[BUG] None check is not there for FinalOutput.output #1443

Closed

Fix and refactor final answer checks #1448

Fix and refactor final answer checks #1448

Uh oh!

Conversation

aymeric-roucher commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aymeric-roucher Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

aymeric-roucher Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

aymeric-roucher Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albertvillanova Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aymeric-roucher Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aymeric-roucher commented Jun 19, 2025

Uh oh!

albertvillanova commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aymeric-roucher commented Jun 17, 2025 •

edited

Loading

albertvillanova Jun 19, 2025 •

edited

Loading