[Feature Request]Support Dr. GRPO for Unbiased Optimization in RL Training

## Paper
[Dr. GRPO Paper](https://github.com/sail-sg/understand-r1-zero)

## Motivation/Benefits

- Fixes optimization bias while maintaining reasoning performance
- Reduces average incorrect response length by 38% (Fig.5 in paper)
- Backward-compatible with existing GRPO workflows