Fit chinese wwm to new datasets#9887
Conversation
|
@sgugger @LysandreJik |
sgugger
left a comment
There was a problem hiding this comment.
Hi there! Thanks for updating your example. We have now created a research_projects project for the examples not directly maintained by the core team, and I think the run_mlm_wwm script and the chine_ref file could all go there in a new folder. Would you mind adjusting your PR in that direction?
Sure, maybe move |
|
The |
done! |
|
Last thing is to run |
yeah, seem my previous PR also failed in format :( Maybe you could help me do this part ? |
sgugger
left a comment
There was a problem hiding this comment.
Done the restyling! Re-reading one last time, I notice I didn't catch you changed the general Trainer. Those changes should be avoided if there is another way (which there is here) so please revert that part.
| # And we need chinese reference inf when run wwm in chinese. | ||
| signature_columns += ["label", "label_ids", "chinese_ref"] |
There was a problem hiding this comment.
No we can't have this here. Use remove_unused_columns = True to avoid the Trainer remove this column in your script but it should not change the code for every user of the library.
There was a problem hiding this comment.
Sorry I don't read these params carefully, now it has been fixed.
And I wonder why we need those LineByLineTextDataset if we have a dataset repo?
There was a problem hiding this comment.
No we can't have this here. Use
remove_unused_columns = Trueto avoid theTrainerremove this column in your script but it should not change the code for every user of the library.
remove_unused_columns = False ? :)
sgugger
left a comment
There was a problem hiding this comment.
Thanks! Looks ready to merge to me!
|
@sgugger My pleasure. Maybe you could help me fix the formate error :( |
LysandreJik
left a comment
There was a problem hiding this comment.
This is great! It's cool that it is in the research projects now. The issue isn't related to style, merging!

Sorry for my later update.
I make my code(especially in chinese mlm_wwm) fit the newest code.
Here are the changes:
chinese_refkey to avoid miss ref inf.data_collator.pyrun_chinese_ref.pycause it could run with the newest version code (4.2.2).