-
Notifications
You must be signed in to change notification settings - Fork 2.5k
fix: concat metrics in DataProto.concat #602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
verl/workers/megatron_workers.py
Outdated
| # TODO: here, we should return all metrics | ||
| output = DataProto(meta_info={'metrics': metrics}) | ||
| output = output.to('cpu') | ||
| output = DataProto(non_tensor_batch={metric: np.array(value) for metric, value in metrics.items()}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's better to give metric under a unified namespace in case we would like to add non-metric data into the non_tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I guess it's problematic because the non tensor requires to have the same batch size as batch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, actually there's 2 places where tensor and metrics return together, so metrics can't put into non_tensor_batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metric is so general that I suggest to add a new .metrics filed to DataProto, the reason are:
- metrics can't be set to
.non_tensor_batchalong with.batch, since these two fields need same batch size - metrics can't be set to
.meta_info, since this field should not be concat
New .metrics field should be much like .meta_info, except that it will be concat in DataProto.concat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, should we remove .non_tensor_batch? Since it's designed as a dictionary contains numpy arrays with same batch size to .batch. For now it's mainly used to store multi_modal_inputs, can we also store them in .batch? cc @hiyouga
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot pad the multi modal features because the pad tensor will consume much VRAM/RAM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can either create a new field in DataProto that is concat during collection, or we can modify dispatch function to aggregate meta_info. However, it's hard to generalize what should we aggregate in meta_info.
It's not a good idea to aggregate meta_info, it may contains any python objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. We expect the input and output of dataproto dispatch are identical, simply concatnating the meta_info breaks such rule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the best approach might be to perform allgather/allreduce inside workers as if there is no single controller :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't tend to allreduce metrics inside workers, cause it introduce a strong barrier across all workers and make async impossible to driver.
|
@wuxibin89 is this pr ready for merge? |
When multiple workers return metrics, DataProto.concat() now flattens the list
of metric dicts into a dict of lists using list_of_dict_to_dict_of_list(). This
ensures metrics have a consistent structure regardless of whether they come from
1 or N workers, and allows reduce_metrics() to work without modification.
This is a cleaner solution than handling list input in reduce_metrics(), as it:
- Keeps metrics aggregation logic in the data layer (single responsibility)
- Maintains a consistent API where meta_info["metrics"] is always dict[str, list]
- Avoids leaking DataProto's concat behavior to all metrics consumers
Changes:
- DataProto.concat(): Flatten list of metric dicts to dict of lists
- Update tests to expect flattened metrics format
Fixes the error:
File "verl/trainer/ppo/ray_trainer.py", line 1129, in fit
critic_output_metrics = reduce_metrics(critic_output.meta_info["metrics"])
AttributeError: 'list' object has no attribute 'items'
Related: volcengine#602
Metrics should be in
non_tensor_batchinstead ofmeta_info, as DataProto not concat meta_info. If set in meta_info, PPOTrainer only get metrics of DP rank 0.