Skip to content

Conversation

@yoquankara
Copy link
Contributor

What does this PR do?

Because both GPT2 and BERT share the same underlying issue of different tensor ordering, similar modifications in convert_megatron_gpt2_checkpoint.py are needed to convert newer Megatron-LM BERT models.

I actually tested that this fix was necessary to fine tune Megatron-LM BERT correctly with transformers API.

Who can review?

@LysandreJik @jdemouth

@yoquankara yoquankara changed the title Fix convert for newer megatron-lm models Fix convert for newer megatron-lm bert model Oct 20, 2021
@LysandreJik
Copy link
Member

I believe @stas00 has worked on a conversion between Megatron and HF Transformers - Stas, can you confirm you have used it and if so, can you take a look at this PR?

@stas00
Copy link
Contributor

stas00 commented Oct 25, 2021

Oh, I completely forgot I had a related PR waiting in Draft mode ;) Switched it from Draft #13928

So, yes, did a lot of work on GPT2/Megatron recently.

No, haven't done any work on Bert/Megatron, so I'm not aware of the nuances to qualify as a reviewer for this one.

@yoquankara, from a quick look I'd only suggested to add a proper way of saving config, which should be:

config.save_pretrained(path)

instead of the manual saving, which misses some important bits.

and may be the tokenizer file addition code as well?

For reference of the 2 changes see the very end of my PR https://github.com/huggingface/transformers/pull/13928/files

Perhaps setting config.tokenizer_class as well.

and you need to run make fixup and push again to fix the style.

@yoquankara
Copy link
Contributor Author

@stas00 Thank you for your review and the pointer to a proper way of saving config!

Regarding tokenizer, Nvidia's Megatron Bert models are using their owns models while their GPT2 tokenizer is using the default one for gpt2. So I didn't add a similar tokenizer_model_name here.
https://huggingface.co/nvidia/megatron-bert-cased-345m
https://huggingface.co/nvidia/megatron-bert-uncased-345m

I've also run make fixup but nothing was wrong...
make: Nothing to be done for `src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py'.

I will investigate more about ci/circleci: check_code_quality.

@yoquankara
Copy link
Contributor Author

Code style has been fixed.

@yoquankara
Copy link
Contributor Author

@LysandreJik @stas00
What else should I do to make progress on this PR?

@stas00
Copy link
Contributor

stas00 commented Nov 8, 2021

I suppose we just want to make sure that updated script works with:

  1. the old official nvidia bert checkpoint release
  2. with a model trained on the modern Meg-LM code-base

so to test for both:

  1. convert to HF
  2. load in HF and test it works. the definition of works could be checking that it gives the same loss? or generates the same output on Meg and HF sides

At least that's the validation process I did before proposing the GPT2 PR.

@yoquankara
Copy link
Contributor Author

yoquankara commented Nov 9, 2021

Thank you, totally makes sense. I'll find time to also test the old official model and post the validation result when finished.

@huggingface huggingface deleted a comment from github-actions bot Dec 3, 2021
@stas00
Copy link
Contributor

stas00 commented Dec 3, 2021

@LysandreJik - should we just merge this? As even if there are issue here it's an improvement over the original version.

@LysandreJik
Copy link
Member

@jdemouth, could you comment on this if you have a bit of bandwidth?

Otherwise, let's go ahead and merge this next week.

@stas00
Copy link
Contributor

stas00 commented Dec 15, 2021

ping

@jdemouth
Copy link
Contributor

@yoquankara were you able to run tests to validate that it does not break the code for the older models?

@huggingface huggingface deleted a comment from github-actions bot Jan 8, 2022
@stas00
Copy link
Contributor

stas00 commented Jan 8, 2022

I have a feeling @yoquankara has either given up on our process or is busy with work/life.

@jdemouth, Should we merge it and deal with any potential problems if they arise?

@jdemouth
Copy link
Contributor

jdemouth commented Jan 8, 2022

@stas00 - I agree. I think we should merge and we'll fix things if something breaks.

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bearing with us, @yoquankara

@stas00 stas00 merged commit 768e6c1 into huggingface:master Jan 8, 2022
@yoquankara
Copy link
Contributor Author

@stas00 @jdemouth
I apologize for not being able to proceed. This was always in my mind, but I have been quite occupied by other things.
Thank you for your understanding and merging decision. I will try my best to follow up when things go better.

@stas00
Copy link
Contributor

stas00 commented Jan 19, 2022

Thank you for taking the time to share your needs, @yoquankara

That is totally understandable - you had no obligation to do anything - we just wanted to know whether you wanted to continue being involved. But otherwise, we hope to see you in another PR in the future.

@kaushalshetty
Copy link

kaushalshetty commented Mar 15, 2022

io.UnsupportedOperation: seek. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.

Getting this error while converting the BERT checkpoint

@stas00
Copy link
Contributor

stas00 commented Mar 15, 2022

You probably don't realize it but your comment is not actionable, @kaushalshetty, since we have no idea what you did.

Please file a proper bug report so that we could reproduce the problem and then it'd be possible to act on it and help you with your need. Make sure to include the full traceback in your report.

Thank you!

@kaushalshetty
Copy link

kaushalshetty commented Mar 15, 2022

I am so sorry. I understand that. My bad !
So here's what I have :

  • transformers version: 4.17.0
  • Platform:
  • Python version: 3.6
  • PyTorch version (GPU?): 1.10.1+cu102
  • Tensorflow version (GPU?): 2.6

Who can help

@stas00 @LysandreJik

Information

Model I am using Megatron-BERT(megatron-bert-uncased-345m):

The problem arises when using:

The tasks I am working on is:

  • to get megatron embeddings

To reproduce

Steps to reproduce the behavior:

  1. export MYDIR=$HOME
  2. git clone https://github.com/huggingface/transformers.git $MYDIR/transformers
  3. mkdir -p $MYDIR/nvidia/megatron-bert-uncased-345m
  4. wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip -O $MYDIR/nvidia/megatron-bert-uncased-345m/checkpoint.zip
  5. python3 $MYDIR/transformers/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py $MYDIR/nvidia/megatron-bert-uncased-345m/checkpoint.zip. This gives me the below error.
Traceback (most recent call last):
  File "/opt/omniai/software/Miniconda/lib/python3.6/site-packages/torch/serialization.py", line 308, in _check_seekable
    f.seek(f.tell())
io.UnsupportedOperation: seek

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/omniai-jupyter/transformers/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py", line 327, in <module>
    main()
  File "/home/omniai-jupyter/transformers/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py", line 296, in main
    input_state_dict = torch.load(pytorch_dict, map_location="cpu")
  File "/opt/omniai/software/Miniconda/lib/python3.6/site-packages/torch/serialization.py", line 594, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/omniai/software/Miniconda/lib/python3.6/site-packages/torch/serialization.py", line 235, in _open_file_like
    return _open_buffer_reader(name_or_buffer)
  File "/opt/omniai/software/Miniconda/lib/python3.6/site-packages/torch/serialization.py", line 220, in __init__
    _check_seekable(buffer)
  File "/opt/omniai/software/Miniconda/lib/python3.6/site-packages/torch/serialization.py", line 311, in _check_seekable
    raise_err_msg(["seek", "tell"], e)
  File "/opt/omniai/software/Miniconda/lib/python3.6/site-packages/torch/serialization.py", line 304, in raise_err_msg
    raise type(e)(msg)
io.UnsupportedOperation: seek. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead

Expected behavior

Expect megatron checkpoint gets converted to huggingface format.

@stas00
Copy link
Contributor

stas00 commented Mar 17, 2022

That's excellent, but in the future please open a new Issue. Once a PR is merged or an Issue is closed it's very difficult to track those.

I tested your use case and it worked for me with python 3.8:

Extracting PyTorch state dictionary from "megatron-bert-uncased-345m/checkpoint.zip"
Converting
Saving config
Saving checkpoint to "megatron-bert-uncased-345m/pytorch_model.bin"

and it indeed fails with python-3.6:

Traceback (most recent call last):
  File "../../transformers-master/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py", line 327, in <module>
    main()
  File "../../transformers-master/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py", line 296, in main
    input_state_dict = torch.load(pytorch_dict, map_location="cpu")
  File "/home/stas/anaconda3/envs/py36-pt18/lib/python3.6/site-packages/torch/serialization.py", line 579, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/stas/anaconda3/envs/py36-pt18/lib/python3.6/site-packages/torch/serialization.py", line 235, in _open_file_like
    return _open_buffer_reader(name_or_buffer)
  File "/home/stas/anaconda3/envs/py36-pt18/lib/python3.6/site-packages/torch/serialization.py", line 220, in __init__
    _check_seekable(buffer)
  File "/home/stas/anaconda3/envs/py36-pt18/lib/python3.6/site-packages/torch/serialization.py", line 311, in _check_seekable
    raise_err_msg(["seek", "tell"], e)
  File "/home/stas/anaconda3/envs/py36-pt18/lib/python3.6/site-packages/torch/serialization.py", line 304, in raise_err_msg
    raise type(e)(msg)
io.UnsupportedOperation: seek. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.

So torch.load w/ python-3.6 doesn't like the zip file handle. So here is a quick workaround for you:

$ unzip megatron-bert-uncased-345m/checkpoint.zip
Archive:  megatron-bert-uncased-345m/checkpoint.zip
  inflating: config.json             
  inflating: latest_checkpointed_iteration.txt  
  inflating: release/mp_rank_00/model_optim_rng.pt
$ python ../../transformers-master/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py release/mp_rank_00/model_optim_rng.pt
Extracting PyTorch state dictionary from "release/mp_rank_00/model_optim_rng.pt"
Converting
Saving config
Saving checkpoint to "release/mp_rank_00/pytorch_model.bin"

I'm passing the actual checkpoint file instead of the zip file in case it wasn't clear from the command line.

or you can upgrade to a higher python version, 3.6 is very old.


@LysandreJik, @sgugger - what do we want to do here as a long term fix?

I propose to catch that it's python-3.6 and refuse to deal with the zipped checkpoint, asserting, asking to unzip it first?

the same will need to be done for megatron_gpt

There is no problem with with py-3.7 and higher.

@sgugger
Copy link
Collaborator

sgugger commented Mar 17, 2022

We might also start saying Transformers requires Python 3.7 or above since Python 3.6 is at the end of its life cycle.

@stas00
Copy link
Contributor

stas00 commented Mar 17, 2022

oh, cool! thanks, @sgugger

so should we just change

"python>=3.6.0",

to 3.7.0? then this Issue will get auto-resolved.

@LysandreJik - is now a good time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants