fix: concat metrics in DataProto.concat #602

wuxibin89 · 2025-03-14T11:32:56Z

Metrics should be in non_tensor_batch instead of meta_info, as DataProto not concat meta_info. If set in meta_info, PPOTrainer only get metrics of DP rank 0.

vermouth1992 · 2025-03-14T11:43:58Z

verl/workers/megatron_workers.py

-        # TODO: here, we should return all metrics
-        output = DataProto(meta_info={'metrics': metrics})
-        output = output.to('cpu')
+        output = DataProto(non_tensor_batch={metric: np.array(value) for metric, value in metrics.items()})


I guess it's better to give metric under a unified namespace in case we would like to add non-metric data into the non_tensor

Also, I guess it's problematic because the non tensor requires to have the same batch size as batch

Yes, actually there's 2 places where tensor and metrics return together, so metrics can't put into non_tensor_batch.

https://github.com/volcengine/verl/blob/main/recipe/prime/prime_fsdp_workers.py#L278

https://github.com/volcengine/verl/blob/main/recipe/prime/prime_fsdp_workers.py#L316

Metric is so general that I suggest to add a new .metrics filed to DataProto, the reason are:

metrics can't be set to .non_tensor_batch along with .batch, since these two fields need same batch size

metrics can't be set to .meta_info, since this field should not be concat

New .metrics field should be much like .meta_info, except that it will be concat in DataProto.concat.

Furthermore, should we remove .non_tensor_batch? Since it's designed as a dictionary contains numpy arrays with same batch size to .batch. For now it's mainly used to store multi_modal_inputs, can we also store them in .batch? cc @hiyouga

We cannot pad the multi modal features because the pad tensor will consume much VRAM/RAM.

I guess we can either create a new field in DataProto that is concat during collection, or we can modify dispatch function to aggregate meta_info. However, it's hard to generalize what should we aggregate in meta_info.

It's not a good idea to aggregate meta_info, it may contains any python objects.

Yeah. We expect the input and output of dataproto dispatch are identical, simply concatnating the meta_info breaks such rule

It seems that the best approach might be to perform allgather/allreduce inside workers as if there is no single controller :)

I don't tend to allreduce metrics inside workers, cause it introduce a strong barrier across all workers and make async impossible to driver.

hiyouga · 2025-03-26T07:32:31Z

@wuxibin89 is this pr ready for merge?

When multiple workers return metrics, DataProto.concat() now flattens the list of metric dicts into a dict of lists using list_of_dict_to_dict_of_list(). This ensures metrics have a consistent structure regardless of whether they come from 1 or N workers, and allows reduce_metrics() to work without modification. This is a cleaner solution than handling list input in reduce_metrics(), as it: - Keeps metrics aggregation logic in the data layer (single responsibility) - Maintains a consistent API where meta_info["metrics"] is always dict[str, list] - Avoids leaking DataProto's concat behavior to all metrics consumers Changes: - DataProto.concat(): Flatten list of metric dicts to dict of lists - Update tests to expect flattened metrics format Fixes the error: File "verl/trainer/ppo/ray_trainer.py", line 1129, in fit critic_output_metrics = reduce_metrics(critic_output.meta_info["metrics"]) AttributeError: 'list' object has no attribute 'items' Related: volcengine#602

fix: set metrics in non_tensor_batch instead of meta_info

ae78ca0

wuxibin89 requested review from PeterSH6, hiyouga and vermouth1992 March 14, 2025 11:32

vermouth1992 reviewed Mar 14, 2025

View reviewed changes

wuxibin89 added 3 commits March 14, 2025 19:54

fix scaler metric

4fdbf85

code format

18c4acb

concat metrics in DataProto.concat

8f9b169

wuxibin89 changed the title ~~fix: set metrics in non_tensor_batch instead of meta_info~~ fix: concat metrics in DataProto.concat Mar 15, 2025

eric-haibin-lin closed this Jun 18, 2025

eric-haibin-lin deleted the wuxibin/fix_metrics branch June 18, 2025 17:14

wuxibin89 mentioned this pull request Oct 9, 2025

[data] fix: merge metrics from all workers in DataProto.concat() #3699

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: concat metrics in DataProto.concat #602

fix: concat metrics in DataProto.concat #602

Uh oh!

wuxibin89 commented Mar 14, 2025

Uh oh!

vermouth1992 Mar 14, 2025

Uh oh!

vermouth1992 Mar 14, 2025

Uh oh!

wuxibin89 Mar 14, 2025

Uh oh!

wuxibin89 Mar 14, 2025

Uh oh!

wuxibin89 Mar 14, 2025

Uh oh!

hiyouga Mar 14, 2025

Uh oh!

wuxibin89 Mar 14, 2025

Uh oh!

hiyouga Mar 14, 2025

Uh oh!

vermouth1992 Mar 14, 2025 •

edited

Loading

Uh oh!

wuxibin89 Mar 14, 2025 •

edited

Loading

Uh oh!

hiyouga commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix: concat metrics in DataProto.concat #602

fix: concat metrics in DataProto.concat #602

Uh oh!

Conversation

wuxibin89 commented Mar 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vermouth1992 Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuxibin89 Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hiyouga commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vermouth1992 Mar 14, 2025 •

edited

Loading

wuxibin89 Mar 14, 2025 •

edited

Loading