Fit chinese wwm to new datasets by wlhgtc · Pull Request #9887 · huggingface/transformers

wlhgtc · 2021-01-29T10:54:46Z

Sorry for my later update.
I make my code(especially in chinese mlm_wwm) fit the newest code.
Here are the changes:

add chinese_ref key to avoid miss ref inf.
fix the type bug in data_collator.py
re-add run_chinese_ref.py cause it could run with the newest version code (4.2.2).
update readme

wlhgtc · 2021-01-29T10:56:24Z

@sgugger @LysandreJik
Could you help me review these code ?

sgugger

Hi there! Thanks for updating your example. We have now created a research_projects project for the examples not directly maintained by the core team, and I think the run_mlm_wwm script and the chine_ref file could all go there in a new folder. Would you mind adjusting your PR in that direction?

wlhgtc · 2021-01-29T13:59:48Z

Hi there! Thanks for updating your example. We have now created a research_projects project for the examples not directly maintained by the core team, and I think the run_mlm_wwm script and the chine_ref file could all go there in a new folder. Would you mind adjusting your PR in that direction?

Sure, maybe move run_chinese_ref.py to research_projects folder and leave run_mlm_wwm.py in where it was would be better ? And I don't know which folder is better ?
The two files are independent, we could move it to anywhere.

sgugger · 2021-01-29T14:08:39Z

The run_mlm_wwm file is not maintained by us directly and it only works for BERT-models, compared to the other examples, so I think it can all go together there. You can create a new folder named mlm_wwm (since it's not just Chinese) for instance and have the specific requirements in the requirements.txt file there?

wlhgtc · 2021-01-29T14:31:57Z

The run_mlm_wwm file is not maintained by us directly and it only works for BERT-models, compared to the other examples, so I think it can all go together there. You can create a new folder named mlm_wwm (since it's not just Chinese) for instance and have the specific requirements in the requirements.txt file there?

done!

sgugger · 2021-01-29T14:54:05Z

Last thing is to run make style to make sure the files are properly formatted, let me know if you have any issue doing this!

wlhgtc · 2021-01-29T15:07:01Z

Last thing is to run make style to make sure the files are properly formatted, let me know if you have any issue doing this!

yeah, seem my previous PR also failed in format :(
I got error as follow:

#!/bin/bash -eo pipefail
black --check examples tests src utils
would reformat /home/circleci/transformers/examples/research_projects/mlm_wwm/run_chinese_ref.py
would reformat /home/circleci/transformers/src/transformers/trainer.py
Oh no! 💥 💔 💥
2 files would be reformatted, 706 files would be left unchanged.

Exited with code exit status 1

But I formate my code.

Maybe you could help me do this part ?

sgugger

Done the restyling! Re-reading one last time, I notice I didn't catch you changed the general Trainer. Those changes should be avoided if there is another way (which there is here) so please revert that part.

sgugger · 2021-01-29T16:28:17Z

+        # And we need chinese reference inf when run wwm in chinese.
+        signature_columns += ["label", "label_ids", "chinese_ref"]


No we can't have this here. Use remove_unused_columns = True to avoid the Trainer remove this column in your script but it should not change the code for every user of the library.

Sorry I don't read these params carefully, now it has been fixed.
And I wonder why we need those LineByLineTextDataset if we have a dataset repo?

No we can't have this here. Use remove_unused_columns = True to avoid the Trainer remove this column in your script but it should not change the code for every user of the library.

remove_unused_columns = False ? :)

sgugger

Thanks! Looks ready to merge to me!

wlhgtc · 2021-01-30T01:17:27Z

@sgugger My pleasure. Maybe you could help me fix the formate error :(
My python version 3.9.1 black 20.8b1, why I got diff result in CI.

LysandreJik

This is great! It's cool that it is in the research projects now. The issue isn't related to style, merging!

MOD: fit chinese wwm to new datasets

5a63a9c

sgugger suggested changes Jan 29, 2021

View reviewed changes

MOD: move wwm to new folder

5345e29

MOD: formate code

5d9f076

Styling

a99d096

sgugger reviewed Jan 29, 2021

View reviewed changes

wlhgtc added 2 commits January 30, 2021 07:42

Merge branch 'master' of https://github.com/wlhgtc/transformers

cc87259

MOD add param and recover trainer

0294397

sgugger requested a review from LysandreJik January 30, 2021 01:07

sgugger approved these changes Jan 30, 2021

View reviewed changes

LysandreJik approved these changes Feb 1, 2021

View reviewed changes

LysandreJik merged commit 1682804 into huggingface:master Feb 1, 2021

LysandreJik mentioned this pull request Dec 22, 2021

[chinese wwm] load_datasets behavior not as expected when using run_mlm_wwm.py script huggingface/datasets#3411

Open

		# And we need chinese reference inf when run wwm in chinese.
		signature_columns += ["label", "label_ids", "chinese_ref"]

Conversation

wlhgtc commented Jan 29, 2021

Uh oh!

wlhgtc commented Jan 29, 2021

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

wlhgtc commented Jan 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Jan 29, 2021

Uh oh!

wlhgtc commented Jan 29, 2021

Uh oh!

sgugger commented Jan 29, 2021

Uh oh!

wlhgtc commented Jan 29, 2021

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger Jan 29, 2021

Choose a reason for hiding this comment

Uh oh!

wlhgtc Jan 30, 2021

Choose a reason for hiding this comment

Uh oh!

wlhgtc Jan 30, 2021

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

wlhgtc commented Jan 30, 2021

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wlhgtc commented Jan 29, 2021 •

edited

Loading