Skip to content

Fit chinese wwm to new datasets#9887

Merged
LysandreJik merged 6 commits into
huggingface:masterfrom
wlhgtc:master
Feb 1, 2021
Merged

Fit chinese wwm to new datasets#9887
LysandreJik merged 6 commits into
huggingface:masterfrom
wlhgtc:master

Conversation

@wlhgtc
Copy link
Copy Markdown
Contributor

@wlhgtc wlhgtc commented Jan 29, 2021

Sorry for my later update.
I make my code(especially in chinese mlm_wwm) fit the newest code.
Here are the changes:

  1. add chinese_ref key to avoid miss ref inf.
  2. fix the type bug in data_collator.py
  3. re-add run_chinese_ref.py cause it could run with the newest version code (4.2.2).
  4. update readme

@wlhgtc
Copy link
Copy Markdown
Contributor Author

wlhgtc commented Jan 29, 2021

@sgugger @LysandreJik
Could you help me review these code ?

Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there! Thanks for updating your example. We have now created a research_projects project for the examples not directly maintained by the core team, and I think the run_mlm_wwm script and the chine_ref file could all go there in a new folder. Would you mind adjusting your PR in that direction?

@wlhgtc
Copy link
Copy Markdown
Contributor Author

wlhgtc commented Jan 29, 2021

Hi there! Thanks for updating your example. We have now created a research_projects project for the examples not directly maintained by the core team, and I think the run_mlm_wwm script and the chine_ref file could all go there in a new folder. Would you mind adjusting your PR in that direction?

Sure, maybe move run_chinese_ref.py to research_projects folder and leave run_mlm_wwm.py in where it was would be better ? And I don't know which folder is better ?
The two files are independent, we could move it to anywhere.

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Jan 29, 2021

The run_mlm_wwm file is not maintained by us directly and it only works for BERT-models, compared to the other examples, so I think it can all go together there. You can create a new folder named mlm_wwm (since it's not just Chinese) for instance and have the specific requirements in the requirements.txt file there?

@wlhgtc
Copy link
Copy Markdown
Contributor Author

wlhgtc commented Jan 29, 2021

The run_mlm_wwm file is not maintained by us directly and it only works for BERT-models, compared to the other examples, so I think it can all go together there. You can create a new folder named mlm_wwm (since it's not just Chinese) for instance and have the specific requirements in the requirements.txt file there?

done!

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Jan 29, 2021

Last thing is to run make style to make sure the files are properly formatted, let me know if you have any issue doing this!

@wlhgtc
Copy link
Copy Markdown
Contributor Author

wlhgtc commented Jan 29, 2021

Last thing is to run make style to make sure the files are properly formatted, let me know if you have any issue doing this!

yeah, seem my previous PR also failed in format :(
I got error as follow:

#!/bin/bash -eo pipefail
black --check examples tests src utils
would reformat /home/circleci/transformers/examples/research_projects/mlm_wwm/run_chinese_ref.py
would reformat /home/circleci/transformers/src/transformers/trainer.py
Oh no! 💥 💔 💥
2 files would be reformatted, 706 files would be left unchanged.

Exited with code exit status 1

But I formate my code.
image

Maybe you could help me do this part ?

Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done the restyling! Re-reading one last time, I notice I didn't catch you changed the general Trainer. Those changes should be avoided if there is another way (which there is here) so please revert that part.

Comment thread src/transformers/trainer.py Outdated
Comment on lines +424 to +425
# And we need chinese reference inf when run wwm in chinese.
signature_columns += ["label", "label_ids", "chinese_ref"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we can't have this here. Use remove_unused_columns = True to avoid the Trainer remove this column in your script but it should not change the code for every user of the library.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't read these params carefully, now it has been fixed.
And I wonder why we need those LineByLineTextDataset if we have a dataset repo?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we can't have this here. Use remove_unused_columns = True to avoid the Trainer remove this column in your script but it should not change the code for every user of the library.

remove_unused_columns = False ? :)

@sgugger sgugger requested a review from LysandreJik January 30, 2021 01:07
Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks ready to merge to me!

@wlhgtc
Copy link
Copy Markdown
Contributor Author

wlhgtc commented Jan 30, 2021

@sgugger My pleasure. Maybe you could help me fix the formate error :(
My python version 3.9.1 black 20.8b1, why I got diff result in CI.

Copy link
Copy Markdown
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! It's cool that it is in the research projects now. The issue isn't related to style, merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants