Skip to content

Conversation

@yueyang130
Copy link
Contributor

Currently, the reward fn intends to give +0.1 reward to encourage output answer in the format of \boxed{}.

However, in CustomRewardManager class, it uses sequences = torch.cat((valid_prompt_ids, valid_response_ids)) as the input of self.compute_score. In this way, var seqences includes the system prompt "Please reason step by step, and put your final answer within \boxed{}.".

Note that the system prompt already includes \boxed{}. Thus the reward fn gives +0.1 even the model acutally output without \boxed{}. An example is below:

Fix this bug by passing the response without system prompt into self.compute_score

image

@yueyang130
Copy link
Contributor Author

Except training, the bug also effects test score, causing ~0.1 test score higher than the acutal one at the intial stage of training.

@hiyouga hiyouga self-requested a review February 25, 2025 10:29
Copy link
Owner

@hiyouga hiyouga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@hiyouga hiyouga merged commit 5b6b628 into hiyouga:main Feb 25, 2025
hiyouga pushed a commit that referenced this pull request Oct 4, 2025
malhajar17 pushed a commit to malhajar17/EasyR1_ex that referenced this pull request Oct 21, 2025
malhajar17 pushed a commit to malhajar17/EasyR1_ex that referenced this pull request Oct 21, 2025
- Update the CSS
- Change the UI:
  - use tabs
  - use file upload

Refs: ART-95
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants