Fix convert for newer megatron-lm bert model #14082

yoquankara · 2021-10-20T11:36:23Z

What does this PR do?

Because both GPT2 and BERT share the same underlying issue of different tensor ordering, similar modifications in convert_megatron_gpt2_checkpoint.py are needed to convert newer Megatron-LM BERT models.

I actually tested that this fix was necessary to fine tune Megatron-LM BERT correctly with transformers API.

Who can review?

@LysandreJik @jdemouth

LysandreJik · 2021-10-25T20:26:43Z

I believe @stas00 has worked on a conversion between Megatron and HF Transformers - Stas, can you confirm you have used it and if so, can you take a look at this PR?

stas00 · 2021-10-25T20:37:00Z

Oh, I completely forgot I had a related PR waiting in Draft mode ;) Switched it from Draft #13928

So, yes, did a lot of work on GPT2/Megatron recently.

No, haven't done any work on Bert/Megatron, so I'm not aware of the nuances to qualify as a reviewer for this one.

@yoquankara, from a quick look I'd only suggested to add a proper way of saving config, which should be:

config.save_pretrained(path)

instead of the manual saving, which misses some important bits.

and may be the tokenizer file addition code as well?

For reference of the 2 changes see the very end of my PR https://github.com/huggingface/transformers/pull/13928/files

Perhaps setting config.tokenizer_class as well.

and you need to run make fixup and push again to fix the style.

yoquankara · 2021-10-29T04:23:44Z

@stas00 Thank you for your review and the pointer to a proper way of saving config!

Regarding tokenizer, Nvidia's Megatron Bert models are using their owns models while their GPT2 tokenizer is using the default one for gpt2. So I didn't add a similar tokenizer_model_name here.
https://huggingface.co/nvidia/megatron-bert-cased-345m
https://huggingface.co/nvidia/megatron-bert-uncased-345m

I've also run make fixup but nothing was wrong...
make: Nothing to be done for `src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py'.

I will investigate more about ci/circleci: check_code_quality.

yoquankara · 2021-11-01T05:02:42Z

Code style has been fixed.

yoquankara · 2021-11-08T12:59:51Z

@LysandreJik @stas00
What else should I do to make progress on this PR?

stas00 · 2021-11-08T16:21:15Z

I suppose we just want to make sure that updated script works with:

the old official nvidia bert checkpoint release
with a model trained on the modern Meg-LM code-base

so to test for both:

convert to HF
load in HF and test it works. the definition of works could be checking that it gives the same loss? or generates the same output on Meg and HF sides

At least that's the validation process I did before proposing the GPT2 PR.

yoquankara · 2021-11-09T12:39:40Z

Thank you, totally makes sense. I'll find time to also test the old official model and post the validation result when finished.

stas00 · 2021-12-03T17:20:34Z

@LysandreJik - should we just merge this? As even if there are issue here it's an improvement over the original version.

LysandreJik · 2021-12-03T19:56:44Z

@jdemouth, could you comment on this if you have a bit of bandwidth?

Otherwise, let's go ahead and merge this next week.

stas00 · 2021-12-15T04:17:40Z

ping

jdemouth · 2021-12-15T09:47:21Z

@yoquankara were you able to run tests to validate that it does not break the code for the older models?

stas00 · 2022-01-08T17:25:17Z

I have a feeling @yoquankara has either given up on our process or is busy with work/life.

@jdemouth, Should we merge it and deal with any potential problems if they arise?

jdemouth · 2022-01-08T19:26:36Z

@stas00 - I agree. I think we should merge and we'll fix things if something breaks.

stas00

Thank you for bearing with us, @yoquankara

yoquankara · 2022-01-19T00:32:26Z

@stas00 @jdemouth
I apologize for not being able to proceed. This was always in my mind, but I have been quite occupied by other things.
Thank you for your understanding and merging decision. I will try my best to follow up when things go better.

stas00 · 2022-01-19T00:50:54Z

Thank you for taking the time to share your needs, @yoquankara

That is totally understandable - you had no obligation to do anything - we just wanted to know whether you wanted to continue being involved. But otherwise, we hope to see you in another PR in the future.

kaushalshetty · 2022-03-15T00:37:35Z

io.UnsupportedOperation: seek. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.

Getting this error while converting the BERT checkpoint

stas00 · 2022-03-15T00:42:08Z

You probably don't realize it but your comment is not actionable, @kaushalshetty, since we have no idea what you did.

Please file a proper bug report so that we could reproduce the problem and then it'd be possible to act on it and help you with your need. Make sure to include the full traceback in your report.

Thank you!

kaushalshetty · 2022-03-15T18:22:17Z

I am so sorry. I understand that. My bad !
So here's what I have :

transformers version: 4.17.0
Platform:
Python version: 3.6
PyTorch version (GPU?): 1.10.1+cu102
Tensorflow version (GPU?): 2.6

Who can help

@stas00 @LysandreJik

Information

Model I am using Megatron-BERT(megatron-bert-uncased-345m):

The problem arises when using:

trying to install megatron through https://huggingface.co/nvidia/megatron-bert-uncased-345m .

The tasks I am working on is:

to get megatron embeddings

To reproduce

Steps to reproduce the behavior:

export MYDIR=$HOME
git clone https://github.com/huggingface/transformers.git $MYDIR/transformers
mkdir -p $MYDIR/nvidia/megatron-bert-uncased-345m
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip -O $MYDIR/nvidia/megatron-bert-uncased-345m/checkpoint.zip
python3 $MYDIR/transformers/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py $MYDIR/nvidia/megatron-bert-uncased-345m/checkpoint.zip. This gives me the below error.

Traceback (most recent call last):
  File "/opt/omniai/software/Miniconda/lib/python3.6/site-packages/torch/serialization.py", line 308, in _check_seekable
    f.seek(f.tell())
io.UnsupportedOperation: seek

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/omniai-jupyter/transformers/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py", line 327, in <module>
    main()
  File "/home/omniai-jupyter/transformers/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py", line 296, in main
    input_state_dict = torch.load(pytorch_dict, map_location="cpu")
  File "/opt/omniai/software/Miniconda/lib/python3.6/site-packages/torch/serialization.py", line 594, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/omniai/software/Miniconda/lib/python3.6/site-packages/torch/serialization.py", line 235, in _open_file_like
    return _open_buffer_reader(name_or_buffer)
  File "/opt/omniai/software/Miniconda/lib/python3.6/site-packages/torch/serialization.py", line 220, in __init__
    _check_seekable(buffer)
  File "/opt/omniai/software/Miniconda/lib/python3.6/site-packages/torch/serialization.py", line 311, in _check_seekable
    raise_err_msg(["seek", "tell"], e)
  File "/opt/omniai/software/Miniconda/lib/python3.6/site-packages/torch/serialization.py", line 304, in raise_err_msg
    raise type(e)(msg)
io.UnsupportedOperation: seek. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead

Expected behavior

Expect megatron checkpoint gets converted to huggingface format.

stas00 · 2022-03-17T00:54:57Z

That's excellent, but in the future please open a new Issue. Once a PR is merged or an Issue is closed it's very difficult to track those.

I tested your use case and it worked for me with python 3.8:

Extracting PyTorch state dictionary from "megatron-bert-uncased-345m/checkpoint.zip"
Converting
Saving config
Saving checkpoint to "megatron-bert-uncased-345m/pytorch_model.bin"

and it indeed fails with python-3.6:

Traceback (most recent call last):
  File "../../transformers-master/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py", line 327, in <module>
    main()
  File "../../transformers-master/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py", line 296, in main
    input_state_dict = torch.load(pytorch_dict, map_location="cpu")
  File "/home/stas/anaconda3/envs/py36-pt18/lib/python3.6/site-packages/torch/serialization.py", line 579, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/stas/anaconda3/envs/py36-pt18/lib/python3.6/site-packages/torch/serialization.py", line 235, in _open_file_like
    return _open_buffer_reader(name_or_buffer)
  File "/home/stas/anaconda3/envs/py36-pt18/lib/python3.6/site-packages/torch/serialization.py", line 220, in __init__
    _check_seekable(buffer)
  File "/home/stas/anaconda3/envs/py36-pt18/lib/python3.6/site-packages/torch/serialization.py", line 311, in _check_seekable
    raise_err_msg(["seek", "tell"], e)
  File "/home/stas/anaconda3/envs/py36-pt18/lib/python3.6/site-packages/torch/serialization.py", line 304, in raise_err_msg
    raise type(e)(msg)
io.UnsupportedOperation: seek. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.

So torch.load w/ python-3.6 doesn't like the zip file handle. So here is a quick workaround for you:

$ unzip megatron-bert-uncased-345m/checkpoint.zip
Archive:  megatron-bert-uncased-345m/checkpoint.zip
  inflating: config.json             
  inflating: latest_checkpointed_iteration.txt  
  inflating: release/mp_rank_00/model_optim_rng.pt
$ python ../../transformers-master/src/transformers/models/megatron_bert/convert_megatron_bert_checkpoint.py release/mp_rank_00/model_optim_rng.pt
Extracting PyTorch state dictionary from "release/mp_rank_00/model_optim_rng.pt"
Converting
Saving config
Saving checkpoint to "release/mp_rank_00/pytorch_model.bin"

I'm passing the actual checkpoint file instead of the zip file in case it wasn't clear from the command line.

or you can upgrade to a higher python version, 3.6 is very old.

@LysandreJik, @sgugger - what do we want to do here as a long term fix?

I propose to catch that it's python-3.6 and refuse to deal with the zipped checkpoint, asserting, asking to unzip it first?

the same will need to be done for megatron_gpt

There is no problem with with py-3.7 and higher.

sgugger · 2022-03-17T11:36:32Z

We might also start saying Transformers requires Python 3.7 or above since Python 3.6 is at the end of its life cycle.

stas00 · 2022-03-17T15:59:14Z

oh, cool! thanks, @sgugger

so should we just change

transformers/setup.py

Line 135 in 8481ece

"python>=3.6.0",

to 3.7.0? then this Issue will get auto-resolved.

@LysandreJik - is now a good time?

Fix convert for newer megatron-lm models

a750ebb

yoquankara changed the title ~~Fix convert for newer megatron-lm models~~ Fix convert for newer megatron-lm bert model Oct 20, 2021

Save megatron-bert config in a proper way

e39adec

Fix code style

e45cd16

huggingface deleted a comment from github-actions bot Dec 3, 2021

huggingface deleted a comment from github-actions bot Jan 8, 2022

stas00 approved these changes Jan 8, 2022

View reviewed changes

stas00 merged commit 768e6c1 into huggingface:master Jan 8, 2022

Fix convert for newer megatron-lm bert model #14082

Fix convert for newer megatron-lm bert model #14082

Uh oh!

Conversation

yoquankara commented Oct 20, 2021

What does this PR do?

Who can review?

Uh oh!

LysandreJik commented Oct 25, 2021

Uh oh!

stas00 commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoquankara commented Oct 29, 2021

Uh oh!

yoquankara commented Nov 1, 2021

Uh oh!

yoquankara commented Nov 8, 2021

Uh oh!

stas00 commented Nov 8, 2021

Uh oh!

yoquankara commented Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Dec 3, 2021

Uh oh!

LysandreJik commented Dec 3, 2021

Uh oh!

stas00 commented Dec 15, 2021

Uh oh!

jdemouth commented Dec 15, 2021

Uh oh!

stas00 commented Jan 8, 2022

Uh oh!

jdemouth commented Jan 8, 2022

Uh oh!

stas00 left a comment

Choose a reason for hiding this comment

Uh oh!

yoquankara commented Jan 19, 2022

Uh oh!

stas00 commented Jan 19, 2022

Uh oh!

kaushalshetty commented Mar 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Mar 15, 2022

Uh oh!

kaushalshetty commented Mar 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Who can help

Information

To reproduce

Expected behavior

Uh oh!

stas00 commented Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Mar 17, 2022

Uh oh!

stas00 commented Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

stas00 commented Oct 25, 2021 •

edited

Loading

yoquankara commented Nov 9, 2021 •

edited

Loading

kaushalshetty commented Mar 15, 2022 •

edited

Loading

kaushalshetty commented Mar 15, 2022 •

edited

Loading

stas00 commented Mar 17, 2022 •

edited

Loading

stas00 commented Mar 17, 2022 •

edited

Loading