Skip to content

Conversation

@none0663
Copy link
Contributor

Add the Dual-Clip PPO algorithm to enhance the current PPO implementations. The Dual-Clip PPO introduces a approach by applying a lower bound to the policy ratio when the advantage is less than zero, when multiplied by a huge raito, does not exceed a specified lower bound. The concept is illustrated in the figure below:
Clipboard_Screenshot_1743047374
So, the finall loss of the ppo is
Clipboard_Screenshot_1743047410
This adjustment leads to a modified final loss calculation for the PPO, which could potentially improve training stability and performance in certain scenarios. I believe integrating this feature could provide significant benefits, and I look forward to feedback on this suggestion.

@CLAassistant
Copy link

CLAassistant commented Mar 27, 2025

CLA assistant check
All committers have signed the CLA.

grad_clip: 1.0
clip_ratio: 0.2
use_dual_clip: False # add Dual-clip PPO from https://arxiv.org/pdf/1912.09729
clip_ratio_c: 3i # lower bound of the value for Dual-clip PPO from https://arxiv.org/pdf/1912.09729
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix type error,73c3349

@vermouth1992
Copy link
Collaborator

I guess the lower clip should be added by default, could you remove the use option? Thanks

@none0663
Copy link
Contributor Author

none0663 commented Mar 28, 2025

I guess the lower clip should be added by default, could you remove the use option? Thanks

fix in this commit

vermouth1992
vermouth1992 previously approved these changes Mar 28, 2025
@none0663
Copy link
Contributor Author

none0663 commented Mar 28, 2025

Fix the Dual-Clip Implement bug after refer to the code from PPO dual clip and PPOxFamily, only apply the loss when advantages < 0

@vermouth1992
Copy link
Collaborator

Could you please fix format?

@none0663
Copy link
Contributor Author

none0663 commented Mar 30, 2025

Could you please fix format?

I have already run the sh scripts/format.sh script to fix format before pushing the code in this commit.(with yapf version 0.43.0)
Is there any other formatting needed to fix?

@vermouth1992
Copy link
Collaborator

Could you please fix format?

I have already run the sh scripts/format.sh script to fix format before pushing the code in this commit.(with yapf version 0.43.0) Is there any other formatting needed to fix?

Could you help rebase main? In the main branch, the format issue was fixed

@none0663 none0663 force-pushed the add_dual_clip_ppo branch from 203ef0a to 7fa176d Compare March 31, 2025 02:57
@none0663
Copy link
Contributor Author

Could you please fix format?

I have already run the sh scripts/format.sh script to fix format before pushing the code in this commit.(with yapf version 0.43.0) Is there any other formatting needed to fix?

Could you help rebase main? In the main branch, the format issue was fixed

Done, please rerun the CI test.

@none0663
Copy link
Contributor Author

none0663 commented Apr 1, 2025

How to pass the others checks?
Clipboard_Screenshot_1743520062

@vermouth1992 vermouth1992 merged commit 6272b8c into volcengine:main Apr 2, 2025
22 of 29 checks passed
yuchenwang3 pushed a commit to yuchenwang3/verl that referenced this pull request Apr 25, 2025
Add the [Dual-Clip PPO](https://arxiv.org/pdf/1912.09729) algorithm to
enhance the current PPO implementations. The Dual-Clip PPO introduces a
approach by applying a lower bound to the policy ratio when the
advantage is less than zero, when multiplied by a huge raito, does not
exceed a specified lower bound. The concept is illustrated in the figure
below:
<img width="626" alt="Clipboard_Screenshot_1743047374"
src="https://github.com/user-attachments/assets/93952edc-30c8-477e-bc3d-4770fabe55b8"
/>
So, the finall loss of the ppo is 
<img width="624" alt="Clipboard_Screenshot_1743047410"
src="https://github.com/user-attachments/assets/5900490b-f64a-4bde-87d6-8359615b3337"
/>
This adjustment leads to a modified final loss calculation for the PPO,
which could potentially improve training stability and performance in
certain scenarios. I believe integrating this feature could provide
significant benefits, and I look forward to feedback on this suggestion.
histmeisah pushed a commit to SJTU-IAAR/verl that referenced this pull request Apr 27, 2025
Add the [Dual-Clip PPO](https://arxiv.org/pdf/1912.09729) algorithm to
enhance the current PPO implementations. The Dual-Clip PPO introduces a
approach by applying a lower bound to the policy ratio when the
advantage is less than zero, when multiplied by a huge raito, does not
exceed a specified lower bound. The concept is illustrated in the figure
below:
<img width="626" alt="Clipboard_Screenshot_1743047374"
src="https://github.com/user-attachments/assets/93952edc-30c8-477e-bc3d-4770fabe55b8"
/>
So, the finall loss of the ppo is 
<img width="624" alt="Clipboard_Screenshot_1743047410"
src="https://github.com/user-attachments/assets/5900490b-f64a-4bde-87d6-8359615b3337"
/>
This adjustment leads to a modified final loss calculation for the PPO,
which could potentially improve training stability and performance in
certain scenarios. I believe integrating this feature could provide
significant benefits, and I look forward to feedback on this suggestion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants