-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Closed
Closed
Copy link
Labels
Description
System Info
The hardware environment and other major software packages used are as follows.
gpu mi300
rocm 6.10.5
python 3.12.11
numpy 2.1.3
tokenizers 0.21.4
torch 2.7.1+rocm6.3
torchaudio 2.7.1+rocm6.3
torchvision 0.22.1+rocm6.3
triton 3.2.0+gite5begpu
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Loading the model
model = Llama4ForConditionalGeneration.from_pretrained(
ckpt_path="meta-llama/Llama-4-Scout-17B-16E-Instruct",
device_map="auto",
torch_dtype="auto"
)
- Using llm_eval library to do the evaluation
Expected behavior
When transformers==4.53.0, the harmness perplexity of Llama-4-Scout-17B-16E-Instruct was as below.
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|--------|------:|------|-----:|---------------|---|-----:|---|------|
|wikitext| 2|none | 0|bits_per_byte |_ |0.6006|_ | N/A|
| | |none | 0|byte_perplexity|_ |1.5164|_ | N/A|
| | |none | 0|word_perplexity|_ |9.2650|_ | N/A|
After transformers version upgrade to 4.55.1, the perplexity:
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext| 2|none | 0|bits_per_byte |_ | 1.2306|_ | N/A|
| | |none | 0|byte_perplexity|_ | 2.3466|_ | N/A|
| | |none | 0|word_perplexity|_ |95.7053|_ | N/A|
All accuracy metrics have deteriorated.