Support the combination of multiple datasets with the dataloader#99
Support the combination of multiple datasets with the dataloader#99
Conversation
|
@claude please review this PR (I'll review afterwards) |
|
Claude finished @forklady42's task in 1m 58s —— View job PR Review: Support the combination of multiple datasets with the dataloader
Overall this is a well-structured approach to multi-dataset support. The Bug:
|
forklady42
left a comment
There was a problem hiding this comment.
Minor comment about clarifying the config interface. Also worth reviewing Claude's comments.
| datasets: | ||
| - root: /scratch/gpfs/ROSENGROUP/common/globus_share_OA/mp/chg_datasets/dataset_2/mp_filelist.txt | ||
| split_file: /scratch/gpfs/ROSENGROUP/common/globus_share_OA/mp/chg_datasets/dataset_2/split.json | ||
| val_frac: 0.005 |
There was a problem hiding this comment.
val_frac isn't used when split_file is provided. this is at least worth a comment so no one is misled
Problem
The current dataloader only supports a single dataset. As a result, it is not possible to combine data coming from different paths.
Solution
This PR extends the dataloader to support multiple datasets. The
root,split_file, andval_fracparameters retain the same meaning as before. In addition, this introduces a new parameter,dataset_id, which serves as an identifier for each dataset and will be useful for future development of multi-head models.Example