[RFC] Verl recipe for image generation model

# Motivation
Reinforcement learning for autoregressive image generation with textual chain-of-though is attracting growing interest. A recent project, [ReasonGen-R1](https://github.com/Franklin-Zhang0/ReasonGen-R1), employs GRPO via Verl to enhance an image generation model’s ability to follow instructions. We plan to contribute this training pipeline as a Verl recipe to benefit the wider community.

# Overall Structure
```
/recipe/image_generation
      config/
           image_generation_rl.yaml
           image_generation_sft.yaml
      main.py
      ray_trainer.py
      sft_trainer.py
      fsdp_worker.py
      dp_actor.py
      hf_rollout.py
      datasets/
            rl_datasets.py
            sft_datasets.py
      Janus/
```

# Proposed Major Changes
## New Function or Classes
`ImageRewardModelWorker` in fsdp_worker.py: We need to implement a reward model that takes in the generated images and prompts to assess the image quality and instruction following. The overall structure is similar to the current RewardModelWorker.
`Janus-model`: An example model class for RL training. Need to implement inside the recipe for interleaved image-text generation and official genrate and forward function is not released by deepseek.
`AdaptiveEntropyCoefficient` in `dp_actor.py`: Adaptive entorpy loss coefficient for stable training in text-image interleaved RL training. It updates using the target entropy and the entropy of output logits.

## Function or Classes that needs modification:
`FSDPSFTTrainer` in FSDPSFTTrainer: support sft training for image generation model
`HFSFTDataset` in `sft_dataset.py` and `RLDataset` in `rl_dataset.py`: support data loading and formating for image generation
`_build_model_optimizer` in `ActorRolloutRefWorker`: modify to support janus loading
`update_actor` in `DataParallelPPOActor`: handle the adaptive entropy and seperate entropy computation for text and image 
`hf_rollout`: return generated images and generated texts as seperate output.
`fit` in ray_trainer.py: support group_filtering in dapo

# CC
@eric-haibin-lin 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Verl recipe for image generation model #2136

Motivation

Overall Structure

Proposed Major Changes

New Function or Classes

Function or Classes that needs modification:

CC

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Verl recipe for image generation model #2136

Description

Motivation

Overall Structure

Proposed Major Changes

New Function or Classes

Function or Classes that needs modification:

CC

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions