Allow converting HF Falcon models with only one shard in memory at a time#1
Merged
jploski merged 1 commit intojploski:falcon40bfrom Jun 14, 2023
Conversation
|
Just as a sidenote, I could convert the 40B model on 64GB RAM, you just need plenty of fast swap and a bit patience. In any case, great work to split it up :) |
Author
It's a shame I possess neither of those things! But yeah, assuming you're willing to throw enough time and swap at the problem there's no memory issue that's insurmountable. :) |
bc4dadb to
ac64e94
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Some changes to the conversion script to allow loading just one part of the multi-part model into memory at a time. From what I have heard, Transformers shards don't split tensors so this should be safe.
Without this change, I wasn't able to convert the real 40B Falcon model in 64GB RAM. I was able to verify that the mini-Shakespeare model gets converted correctly and I get reasonable output from the quantized version.