Now, i quantize llama3-8b model using QAT. When I tried model inference, I encountered the following error.
Logs:
une run generate --config llama3_generation_config.yaml
2024-10-09:08:06:20,408 INFO [_logging.py:101] Running InferenceRecipe with resolved config:
chat_format: null
checkpointer:
_component_: torchtune.training.FullModelTorchTuneCheckpointer
checkpoint_dir: /QAT/output/llama3-8B/
checkpoint_files:
- meta_model_2-8da4w.pt
model_type: LLAMA3
output_dir: /QAT/output/llama3-8B/
device: cuda
dtype: bf16
enable_kv_cache: true
instruct_template: null
max_new_tokens: 300
model:
_component_: torchtune.models.llama3.llama3_8b
prompt: Tell me a joke?
quantizer:
_component_: torchtune.training.quantization.Int8DynActInt4WeightQuantizer
groupsize: 256
seed: 42
temperature: 0.6
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /QAT/Meta-Llama-3-8B/original/tokenizer.model
top_k: 300
2024-10-09:08:06:20,812 DEBUG [seed.py:60] Setting manual seed to local seed 42. Local seed is seed + rank = 42 + 0
Traceback (most recent call last):
File "/usr/local/bin/tune", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/run.py", line 196, in _run_cmd
self._run_single_device(args, is_builtin=is_builtin)
File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/run.py", line 102, in _run_single_device
runpy.run_path(str(args.recipe), run_name="__main__")
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 211, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torchtune/config/_parse.py", line 99, in wrapper
sys.exit(recipe_main(conf))
File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 206, in main
recipe.setup(cfg=cfg)
File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 55, in setup
self._model = self._setup_model(
File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 73, in _setup_model
model.load_state_dict(model_state_dict)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerDecoder:
While copying the parameter named "layers.0.attn.q_proj.weight", whose dimensions in the model are torch.Size([4096, 4096]) and whose dimensions in the checkpoint are torch.Size([4096, 4096]), an exception occurred : ('LinearActivationQuantizedTensor dispatch: attempting to run unimplemented operator/function
: aten.copy_.default',).
While copying the parameter named "layers.0.attn.k_proj.weight", whose dimensions in the model are torch.Size([1024, 4096]) and whose dimensions in the checkpoint are torch.Size([1024, 4096]), an exception occurred : ('LinearActivationQuantizedTensor dispatch: attempting to run unimplemented operator/function
: aten.copy_.default',).
While copying the parameter named "layers.0.attn.v_proj.weight", whose dimensions in the model are torch.Size([1024, 4096]) and whose dimensions in the checkpoint are torch.Size([1024, 4096]), an exception occurred : ('LinearActivationQuantizedTensor dispatch: attempting to run unimplemented operator/function
: aten.copy_.default',).
While copying the parameter named "layers.0.attn.output_proj.weight", whose dimensions in the model are torch.Size([4096, 4096]) and whose dimensions in the checkpoint are torch.Size([4096, 4096]), an exception occurred : ('LinearActivationQuantizedTensor dispatch: attempting to run unimplemented operator/fun
ction: aten.copy_.default',).
While copying the parameter named "layers.0.mlp.w1.weight", whose dimensions in the model are torch.Size([14336, 4096]) and whose dimensions in the checkpoint are torch.Size([14336, 4096]), an exception occurred : ('LinearActivationQuantizedTensor dispatch: attempting to run unimplemented operator/function: a
ten.copy_.default',).
While copying the parameter named "layers.0.mlp.w2.weight", whose dimensions in the model are torch.Size([4096, 14336]) and whose
generation.yaml is
#config for running the InferenceRecipe in generate.py to generate output from an LLM
#
# To launch, run the following command from root torchtune directory:
# tune run generate --config generation
# Model arguments
model:
_component_: torchtune.models.llama3.llama3_8b
checkpointer:
_component_: torchtune.training.FullModelTorchTuneCheckpointer
checkpoint_dir: /QAT/output/llama3-8B/
checkpoint_files: [
meta_model_2-8da4w.pt
]
output_dir: /QAT/output/llama3-8B/
model_type: LLAMA3
device: cuda
dtype: bf16
seed: 42
# Tokenizer arguments
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: /QAT/Meta-Llama-3-8B/original/tokenizer.model
max_seq_len: null
# Generation arguments; defaults taken from gpt-fast
prompt: "Tell me a joke?"
instruct_template: null
chat_format: null
max_new_tokens: 300
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300
enable_kv_cache: True
quantizer:
_component_: torchtune.training.quantization.Int8DynActInt4WeightQuantizer
groupsize: 256
Anyone can help me?? thanks very much
Now, i quantize llama3-8b model using QAT. When I tried model inference, I encountered the following error.
Logs:
generation.yaml is
Anyone can help me?? thanks very much