fix llama model text generation error#1402
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| # Only compute necessary logits, and do not upcast them to float if we are not computing the loss | ||
| # TODO: remove the float() operation in v4.46 | ||
| logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float() | ||
| if reuse_cache: |
There was a problem hiding this comment.
Can you also check if mixtral and mistral needs this code.
Prior to this fix for command: pytest -s -v tests/transformers/tests/models/ -k "contrastive_generate_dynamic_shapes" , we saw three failures:
FAILED tests/transformers/tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_contrastive_generate_dynamic_shapes - AssertionError: Lists differ: [[47, 95, 92, 35, 83, 82]] != [[47, 95, 92, 35, 66, 8]]
FAILED tests/transformers/tests/models/mistral/test_modeling_mistral.py::MistralModelTest::test_contrastive_generate_dynamic_shapes - AssertionError: Lists differ: [[55, 43, 83, 75, 15, 60]] != [[55, 43, 83, 19, 15, 94]]
FAILED tests/transformers/tests/models/mixtral/test_modeling_mixtral.py::MixtralModelTest::test_contrastive_generate_dynamic_shapes - AssertionError: Lists differ: [[41, 73, 42, 57, 65, 26]] != [[41, 73, 42, 17, 89, 3]]
==================================================================== 3 failed, 7 passed, 1425 deselected, 15 warnings in 19.74s ==
They pass if I add the same change to mixtral and mistral:
========================================================================= 10 passed, 1425 deselected, 15 warnings in 22.51s =========================================================================
There was a problem hiding this comment.
Added same change to mixtral, mistral, qwen, starcoder and phi, resolved the text generation on these models:
‘mistralai/Mistral-7B-Instruct-v0.2'
’Qwen/Qwen2.5-Coder-1.5B‘
’bigcode/starcoder2-15b‘
Not sure if this is best way but a quick search shows the following models have this line
But I am not sure if other models like t5 or gpt2 need this, they use a different implementation i.e don't index into anything. Another list based on reuse_cache usage across models:
Does seem like gemma needs it but not sure about the others. |
|
@vidyasiv @zongwave : Discussed further with @ssarkar2 , and this logic is actually exactly same as our --trim_logits. We should actually remove num_logits_to_keep from HPU and go back to old logic( logits = self.lm_head(hidden_states) ) since this will interfere with our --trim_logits and also --use_hpu_graphs. We might not gain anything here with num_logits_to_keep as long as --trim_logits are on. We can later work on to combine the trim_logits and num_logits_to_keep to one so it's easier for later transformer upmerge. @regisss could you also check on my command above? |
I removed "num_logis_to_keep" indexing for "hidden_states" in PR #1359 |
| # No upscaling to float was ever done for Persimmon | ||
| logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]) | ||
| logits = self.lm_head(hidden_states) | ||
|
|
There was a problem hiding this comment.
This model doesn't seem to have trim_logits, we can notify the author to add the trim logits support.
use trim_logits in HPU to save memory (comment out the num_logits_to_keep in utils.py)
|
For now, we decided to just modify utils.py file not to set this num_logits_to_keep to 1, so it doesn't make any effect on the run.(default is 0). We should revisit this to see if we can merge this new feature with trim_logits together. It seems HPU call flows are all different with combination of these three arguments (use_hpu_graphs, trim_logits, reuse_cache), so we should test all this flow and make sure these are all working. |
I verified output for one model for these cases but we should be adding a test for relevant model architectures for functional accuracy: |
|
@regisss , @jiminha is it possible to add something like #1411. It is a mockup but am not sure of the variability of output across SW versions. If we are sure it should be same I can continue that PR. |
Definitely useful! I've got this in my mind for a while and I fully agree that testing the outputs becomes more and more critical. Happy to review this PR when it's ready 🙂 |
@vidyasiv I haven't checked out the details of the code, but this is absolutely needed. We are only testing the perf for text-gen , but no output token generation, and missed the opportunity to catch the bug early on. I would prefer though to extend the current text-gen tests to check at least 1st output token rather than adding a new one unless there is specific reason to have a separate test just for accuracy check. |
The main reason to make it separate is current text gen tests are slow test and sometimes cover more than 1 case per model. We need a fast test that functionally checks select key models so it can run with every PR. It's not going to be exhaustive but we have to start someplace so thought of basing it on this particular failure- next time we fix a functional issue we can update this test file with relevant options etc. |
What does this PR do?
PR #1359 introduced error in some models text generation:
’meta-llama/Llama-2-7b-hf‘
‘mistralai/Mistral-7B-Instruct-v0.2'
’Qwen/Qwen2-7B‘
’bigcode/starcoder2-15b‘
Reproduce command:
python examples/text-generation/run_generation.py --use_hpu_graphs --model_name_or_path meta-llama/Llama-2-7b-hfInput/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning frameworkЉЉЉЉЉЉЉЉЉЉЉЉЉЉЉ\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n.............................\n\n............\n......',)
Fixes # (issue)
Only use slicing operation on hidden_states for logits computing in case of kv_cache and reuse_cache is enabled.
Before submitting