LinearActivationQuantizedTensor dispatch error when model quantized by QAT generate

Now, i quantize llama3-8b model using QAT. When I tried model inference, I encountered the following error.
Logs:
```bash
une run generate --config llama3_generation_config.yaml
2024-10-09:08:06:20,408 INFO     [_logging.py:101] Running InferenceRecipe with resolved config:

chat_format: null
checkpointer:
  _component_: torchtune.training.FullModelTorchTuneCheckpointer
  checkpoint_dir: /QAT/output/llama3-8B/
  checkpoint_files:
  - meta_model_2-8da4w.pt
  model_type: LLAMA3
  output_dir: /QAT/output/llama3-8B/
device: cuda
dtype: bf16
enable_kv_cache: true
instruct_template: null
max_new_tokens: 300
model:
  _component_: torchtune.models.llama3.llama3_8b
prompt: Tell me a joke?
quantizer:
  _component_: torchtune.training.quantization.Int8DynActInt4WeightQuantizer
  groupsize: 256
seed: 42
temperature: 0.6
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /QAT/Meta-Llama-3-8B/original/tokenizer.model
top_k: 300

2024-10-09:08:06:20,812 DEBUG    [seed.py:60] Setting manual seed to local seed 42. Local seed is seed + rank = 42 + 0
Traceback (most recent call last):
  File "/usr/local/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/run.py", line 196, in _run_cmd
    self._run_single_device(args, is_builtin=is_builtin)
  File "/usr/local/lib/python3.10/dist-packages/torchtune/_cli/run.py", line 102, in _run_single_device
    runpy.run_path(str(args.recipe), run_name="__main__")
  File "/usr/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 211, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torchtune/config/_parse.py", line 99, in wrapper
    sys.exit(recipe_main(conf))
  File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 206, in main
    recipe.setup(cfg=cfg)
  File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 55, in setup
    self._model = self._setup_model(
  File "/usr/local/lib/python3.10/dist-packages/recipes/generate.py", line 73, in _setup_model
    model.load_state_dict(model_state_dict)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerDecoder:
        While copying the parameter named "layers.0.attn.q_proj.weight", whose dimensions in the model are torch.Size([4096, 4096]) and whose dimensions in the checkpoint are torch.Size([4096, 4096]), an exception occurred : ('LinearActivationQuantizedTensor dispatch: attempting to run unimplemented operator/function
: aten.copy_.default',).
        While copying the parameter named "layers.0.attn.k_proj.weight", whose dimensions in the model are torch.Size([1024, 4096]) and whose dimensions in the checkpoint are torch.Size([1024, 4096]), an exception occurred : ('LinearActivationQuantizedTensor dispatch: attempting to run unimplemented operator/function
: aten.copy_.default',).
        While copying the parameter named "layers.0.attn.v_proj.weight", whose dimensions in the model are torch.Size([1024, 4096]) and whose dimensions in the checkpoint are torch.Size([1024, 4096]), an exception occurred : ('LinearActivationQuantizedTensor dispatch: attempting to run unimplemented operator/function
: aten.copy_.default',).
        While copying the parameter named "layers.0.attn.output_proj.weight", whose dimensions in the model are torch.Size([4096, 4096]) and whose dimensions in the checkpoint are torch.Size([4096, 4096]), an exception occurred : ('LinearActivationQuantizedTensor dispatch: attempting to run unimplemented operator/fun
ction: aten.copy_.default',).
        While copying the parameter named "layers.0.mlp.w1.weight", whose dimensions in the model are torch.Size([14336, 4096]) and whose dimensions in the checkpoint are torch.Size([14336, 4096]), an exception occurred : ('LinearActivationQuantizedTensor dispatch: attempting to run unimplemented operator/function: a
ten.copy_.default',).
        While copying the parameter named "layers.0.mlp.w2.weight", whose dimensions in the model are torch.Size([4096, 14336]) and whose 
```
generation.yaml is 
```yaml
#config for running the InferenceRecipe in generate.py to generate output from an LLM
#
# To launch, run the following command from root torchtune directory:
#    tune run generate --config generation

# Model arguments
model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.training.FullModelTorchTuneCheckpointer
  checkpoint_dir: /QAT/output/llama3-8B/
  checkpoint_files: [
    meta_model_2-8da4w.pt
  ]
  output_dir: /QAT/output/llama3-8B/
  model_type: LLAMA3

device: cuda
dtype: bf16
seed: 42

# Tokenizer arguments
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /QAT/Meta-Llama-3-8B/original/tokenizer.model
  max_seq_len: null

# Generation arguments; defaults taken from gpt-fast
prompt: "Tell me a joke?"
instruct_template: null
chat_format: null
max_new_tokens: 300
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300

enable_kv_cache: True

quantizer:
  _component_: torchtune.training.quantization.Int8DynActInt4WeightQuantizer
  groupsize: 256
```
Anyone can help me?? thanks very much

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LinearActivationQuantizedTensor dispatch error when model quantized by QAT generate #1775

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LinearActivationQuantizedTensor dispatch error when model quantized by QAT generate #1775

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions