Skip to content

feat: Add MVP Demo for LLM UI Action Planning#14

Merged
abrichr merged 11 commits intomainfrom
feat/demo
Mar 29, 2025
Merged

feat: Add MVP Demo for LLM UI Action Planning#14
abrichr merged 11 commits intomainfrom
feat/demo

Conversation

@abrichr
Copy link
Member

@abrichr abrichr commented Mar 28, 2025

Description:

This PR introduces the initial Minimum Viable Product (MVP) demo for planning UI actions using an LLM.

Key Features:

  • Synthetic UI: Generates a sample UI (synthetic_ui.py) to mock visual parser output for rapid prototyping.
  • LLM Planning: Takes the synthetic UI elements and a user goal, prompts an LLM (Anthropic via completions.py), and gets a structured action plan (core.py using Pydantic). The plan includes reasoning, action type, target element ID, and text-to-type.
  • Visualization: Highlights the LLM-chosen target element on the synthetic UI image.
  • Runnable Demo: Includes a top-level demo.py script to execute the flow.

Purpose:

Provides a working baseline demonstrating the core concept of LLM-driven UI action planning based on visual context.

To Run:

  1. Ensure ANTHROPIC_API_KEY is set in your environment or .env file.
  2. Run python demo.py.
  3. Check the demo_output/ directory for generated images.
    login_screen
    login_screen_highlighted

abrichr added 11 commits March 27, 2025 22:37
Introduces a runnable demo (`demo.py`) proving the concept of using an LLM
to plan a UI action based on a user goal and mocked visual elements
(generated by `synthetic_ui.py`).

Includes:
- Core planning logic and prompting (`core.py`)
- Anthropic API integration (`completions.py`)
- Pydantic structured output for LLM response
- Visualization of the target UI element
Updates the demo script to loop through planning and simulating actions
on the synthetic UI (type, click). Includes goal completion check via LLM.
@abrichr abrichr merged commit bc1acbc into main Mar 29, 2025
1 check passed
@abrichr abrichr deleted the feat/demo branch March 29, 2025 19:29
@abrichr
Copy link
Member Author

abrichr commented Mar 29, 2025

This gives us a solid baseline with the multi-step demo, tests, and CI.

The next logical step is definitely moving towards interacting with real UIs. Thinking about what's next, the main paths seem to be:

  1. Integrate Real OmniParser:

    • Replace synthetic_ui.py with actual screen capture (utils.take_screenshot).
    • Get the omniparser/client.py working to call a running OmniParser instance (needs URL config).
    • Map the OmniParser JSON response back to our List[UIElement].
  2. Implement Real Action Execution:

    • Use the MouseController/KeyboardController to actually perform the planned click/type.
    • Handle coordinate conversion (normalized bounds -> absolute screen pixels).
    • Map the LLMActionPlan (action, text_to_type) to controller calls.
    • Replace the simulate_action call in the loop.
  3. Refine LLM Interaction / Error Handling:

    • Improve prompt robustness for real UIs.
    • Handle LLM errors (bad JSON, invalid element IDs).
    • Handle action execution errors.

Suggest we tackle #1 (Real OmniParser integration) first. We need the system to perceive the real UI state before we can reliably plan and execute actions on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant