Motivation
Reinforcement learning for autoregressive image generation with textual chain-of-though is attracting growing interest. A recent project, ReasonGen-R1, employs GRPO via Verl to enhance an image generation model’s ability to follow instructions. We plan to contribute this training pipeline as a Verl recipe to benefit the wider community.
Overall Structure
/recipe/image_generation
config/
image_generation_rl.yaml
image_generation_sft.yaml
main.py
ray_trainer.py
sft_trainer.py
fsdp_worker.py
dp_actor.py
hf_rollout.py
datasets/
rl_datasets.py
sft_datasets.py
Janus/
Proposed Major Changes
New Function or Classes
ImageRewardModelWorker in fsdp_worker.py: We need to implement a reward model that takes in the generated images and prompts to assess the image quality and instruction following. The overall structure is similar to the current RewardModelWorker.
Janus-model: An example model class for RL training. Need to implement inside the recipe for interleaved image-text generation and official genrate and forward function is not released by deepseek.
AdaptiveEntropyCoefficient in dp_actor.py: Adaptive entorpy loss coefficient for stable training in text-image interleaved RL training. It updates using the target entropy and the entropy of output logits.
Function or Classes that needs modification:
FSDPSFTTrainer in FSDPSFTTrainer: support sft training for image generation model
HFSFTDataset in sft_dataset.py and RLDataset in rl_dataset.py: support data loading and formating for image generation
_build_model_optimizer in ActorRolloutRefWorker: modify to support janus loading
update_actor in DataParallelPPOActor: handle the adaptive entropy and seperate entropy computation for text and image
hf_rollout: return generated images and generated texts as seperate output.
fit in ray_trainer.py: support group_filtering in dapo
CC
@eric-haibin-lin
Motivation
Reinforcement learning for autoregressive image generation with textual chain-of-though is attracting growing interest. A recent project, ReasonGen-R1, employs GRPO via Verl to enhance an image generation model’s ability to follow instructions. We plan to contribute this training pipeline as a Verl recipe to benefit the wider community.
Overall Structure
Proposed Major Changes
New Function or Classes
ImageRewardModelWorkerin fsdp_worker.py: We need to implement a reward model that takes in the generated images and prompts to assess the image quality and instruction following. The overall structure is similar to the current RewardModelWorker.Janus-model: An example model class for RL training. Need to implement inside the recipe for interleaved image-text generation and official genrate and forward function is not released by deepseek.AdaptiveEntropyCoefficientindp_actor.py: Adaptive entorpy loss coefficient for stable training in text-image interleaved RL training. It updates using the target entropy and the entropy of output logits.Function or Classes that needs modification:
FSDPSFTTrainerin FSDPSFTTrainer: support sft training for image generation modelHFSFTDatasetinsft_dataset.pyandRLDatasetinrl_dataset.py: support data loading and formating for image generation_build_model_optimizerinActorRolloutRefWorker: modify to support janus loadingupdate_actorinDataParallelPPOActor: handle the adaptive entropy and seperate entropy computation for text and imagehf_rollout: return generated images and generated texts as seperate output.fitin ray_trainer.py: support group_filtering in dapoCC
@eric-haibin-lin