Skip to content

Commit 1e41334

Browse files
authored
[recipe] feat: Add InfiGUI-G1 recipe for MLLM GUI grounding (#3242)
### What does this PR do? This PR introduces a new recipe, `infigui-g1`, for training Multimodal Large Language Models (MLLMs) in GUI grounding tasks. This recipe implements a reinforcement learning approach that significantly improves the model's ability to understand and interact with graphical user interfaces. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: https://github.com/search?q=repo%3Avolcengine%2Fverl+gui&type=pullrequests - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test The effectiveness of this recipe has been validated through experiments. Key results are as follows: - The training curves for reward, validation accuracy, and exploration success rate all show a upward trend. - After 156 steps of training on sample data, the 3b model achieves a score of **41.2** on the `screenspot-pro` benchmark, a substantial improvement over the base model's score of **18.2**. <img width="345" height="291" alt="Screenshot 2025-08-27 172010" src="https://github.com/user-attachments/assets/9ecd93d5-4f9b-4c40-831c-79a50fd197c4" /> <img width="347" height="292" alt="Screenshot 2025-08-27 171902" src="https://github.com/user-attachments/assets/2e437c1f-9eb0-4106-a6c3-b22125026a79" /> <img width="346" height="293" alt="Screenshot 2025-08-27 171928" src="https://github.com/user-attachments/assets/9c94515d-1501-40f4-979c-95e2f819dc62" /> ### API and Usage Example The recipe is self-contained and can be run using the provided scripts. For example, to run training with the 3B parameter model: ```bash # In verl path bash recipe/infigui-g1/run_3b.sh ``` ### Design & Code Changes This PR adds a new, independent recipe located in `recipe/infigui-g1/`. The changes are fully encapsulated within this directory and do not affect any other part of the codebase. The new files include: - `recipe/infigui-g1/README.md`: An introduction to the recipe. - `recipe/infigui-g1/run_3b.sh`, `run_7b.sh`: Scripts to launch training. - `recipe/infigui-g1/reward_fn.py`: Custom reward function implementation. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
1 parent 53b68c6 commit 1e41334

File tree

4 files changed

+554
-0
lines changed

4 files changed

+554
-0
lines changed

recipe/infigui-g1/README.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Recipe for InfiGUI-G1
2+
3+
This directory contains the official implementation for the paper [InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://arxiv.org/abs/2508.05731).
4+
5+
This work introduces Adaptive Exploration Policy Optimization (AEPO), a policy optimization framework designed to enhance GUI grounding in Multimodal Large Language Models (MLLMs). AEPO improves exploration efficiency by employing a multi-answer generation strategy and a theoretically grounded Adaptive Exploration Reward (AER) function. This approach effectively addresses the challenge of semantic alignment in complex GUI grounding tasks.
6+
7+
We provide training scripts for both 3B and 7B models, configured for a single machine with 8 GPUs by default.
8+
9+
## Environment Setup
10+
11+
Please follow the main environment setup guide for `verl`.
12+
13+
The provided scripts use the following Docker image: `verlai/verl:app-verl0.5-transformers4.55.4-sglang0.4.10.post2-mcore0.13.0-te2.2`
14+
15+
## Data Preparation
16+
17+
Before starting the training, you need to download the example dataset. This dataset is a filtered version of [omniact](https://huggingface.co/datasets/Writer/omniact), containing only grounding tasks and excluding easy samples.
18+
19+
The data is hosted on the Hugging Face. You can download it using the `huggingface-cli`:
20+
21+
```bash
22+
huggingface-cli download --repo-type dataset --resume-download InfiX-ai/omniact_grounding_filtered --local-dir data/omniact_grounding_filtered
23+
```
24+
25+
This command will download the training and validation parquet files into the `data/omniact_grounding_filtered` directory, which is the default path used by the scripts.
26+
27+
## Training
28+
29+
We provide scripts to train the 3B and 7B models. Please run them from the root directory of `verl`.
30+
31+
- **Train the 3B model:**
32+
33+
```bash
34+
bash recipe/infigui-g1/run_3b.sh
35+
```
36+
37+
- **Train the 7B model:**
38+
39+
```bash
40+
bash recipe/infigui-g1/run_7b.sh
41+
```
42+
43+
## Using Custom Data
44+
45+
If you wish to train on your own dataset, please format your data to match the structure of the example files located in `data/omniact_grounding_filtered`.
46+
47+
Once your data is ready, you need to update the data path arguments in the training script.
48+
49+
In `run_3b.sh` or `run_7b.sh`, modify the following lines:
50+
51+
```bash
52+
data.train_files=./path/to/your/train_data.parquet \
53+
data.val_files=./path/to/your/val_data.parquet \
54+
```
55+
56+
Replace the paths with the location of your custom data files.

0 commit comments

Comments
 (0)