Update IPEX to 2.1.10+xpu by notsyncing · Pull Request #4931 · oobabooga/text-generation-webui

notsyncing · 2023-12-15T05:08:58Z

This will require Intel oneAPI Toolkit 2024.0

Checklist:

I have read the Contributing guidelines.

* This will require Intel oneAPI Toolkit 2024.0

notsyncing · 2023-12-15T05:38:06Z

Tested OK on my A770 16GB. I feel that it's a little faster on ChatGLM3-6B.

oobabooga · 2023-12-15T06:20:15Z

Thanks for the update. @notsyncing since you have an A770, could you try the new HQQ loader at #4888? I think that it should allow you to run bigger models in 4-bit precision with your GPU.

I recommend the following model for testing:

https://huggingface.co/mobiuslabsgmbh/Llama-2-13b-chat-hf-4bit_g64-HQQ

notsyncing · 2023-12-15T13:20:09Z

@oobabooga I'm getting this error with HQQ loader:

2023-12-15 13:11:46 INFO:Loading Llama-2-13b-chat-hf-4bit_g64-HQQ...
2023-12-15 13:11:46 INFO:Loading HQQ model with backend: PYTORCH
[tgwui]    | Failed to load the weights Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
[tgwui]    | Traceback (most recent call last):
[tgwui]    |   File "/opt/text-generation-webui/modules/ui_model_menu.py", line 209, in load_model_wrapper
[tgwui]    |     shared.model, shared.tokenizer = load_model(selected_model, loader)
[tgwui]    |                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[tgwui]    |   File "/opt/text-generation-webui/modules/models.py", line 90, in load_model
[tgwui]    |     output = load_func_map[loader](model_name)
[tgwui]    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[tgwui]    |   File "/opt/text-generation-webui/modules/models.py", line 417, in HQQ_loader
[tgwui]    |     model = HQQModelForCausalLM.from_quantized(str(model_dir))
[tgwui]    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[tgwui]    |   File "/opt/text-generation-webui/installer_files/env/lib/python3.11/site-packages/hqq/engine/base.py", line 71, in from_quantized
[tgwui]    |     cls._make_quantizable(model, quantized=True)
[tgwui]    |   File "/opt/text-generation-webui/installer_files/env/lib/python3.11/site-packages/hqq/engine/hf.py", line 29, in _make_quantizable
[tgwui]    |     model.hqq_quantized  = quantized
[tgwui]    |     ^^^^^^^^^^^^^^^^^^^
[tgwui]    | AttributeError: 'NoneType' object has no attribute 'hqq_quantized'
[tgwui]    |

Looks like it supports CUDA only.

oobabooga · 2023-12-15T14:01:05Z

That's a bummer. Thanks for the test!

mobicham · 2023-12-19T09:54:06Z

@oobabooga I'm getting this error with HQQ loader:

2023-12-15 13:11:46 INFO:Loading Llama-2-13b-chat-hf-4bit_g64-HQQ...
2023-12-15 13:11:46 INFO:Loading HQQ model with backend: PYTORCH
[tgwui]    | Failed to load the weights Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
[tgwui]    | Traceback (most recent call last):
[tgwui]    |   File "/opt/text-generation-webui/modules/ui_model_menu.py", line 209, in load_model_wrapper
[tgwui]    |     shared.model, shared.tokenizer = load_model(selected_model, loader)
[tgwui]    |                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[tgwui]    |   File "/opt/text-generation-webui/modules/models.py", line 90, in load_model
[tgwui]    |     output = load_func_map[loader](model_name)
[tgwui]    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[tgwui]    |   File "/opt/text-generation-webui/modules/models.py", line 417, in HQQ_loader
[tgwui]    |     model = HQQModelForCausalLM.from_quantized(str(model_dir))
[tgwui]    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[tgwui]    |   File "/opt/text-generation-webui/installer_files/env/lib/python3.11/site-packages/hqq/engine/base.py", line 71, in from_quantized
[tgwui]    |     cls._make_quantizable(model, quantized=True)
[tgwui]    |   File "/opt/text-generation-webui/installer_files/env/lib/python3.11/site-packages/hqq/engine/hf.py", line 29, in _make_quantizable
[tgwui]    |     model.hqq_quantized  = quantized
[tgwui]    |     ^^^^^^^^^^^^^^^^^^^
[tgwui]    | AttributeError: 'NoneType' object has no attribute 'hqq_quantized'
[tgwui]    |

Looks like it supports CUDA only.

Can you replace the from_quantized call with this:

from hqq.core.quantize import *
from hqq.models.hf.llama import LlamaHQQ

class LlamaHQQXPU(LlamaHQQ):
	@classmethod
	def from_quantized(cls, save_dir_or_hub, cache_dir=''):
		#Get directory path
		save_dir = cls.try_snapshot_download(save_dir_or_hub, cache_dir)

		#Load model from config
		model = cls.create_model(save_dir)

		#Name the layers
		cls.autoname_modules(model) 

		#Load weights
		try:
			weights = cls.load_weights(save_dir, map_location='xpu')
		except Exception as error:
			print("Failed to load the weights", error)
			return
		
		#load_state_dict() doesn't work with modules initialized with init_empty_weights(), so we need to do this manually
		@torch.no_grad()
		def _load_module(module, params=None):
			if(module.name not in weights): 
				return module.half().to('xpu')

			state_dict = weights[module.name]
			if(('W_q' in state_dict) and ('meta' in state_dict)):
				module = HQQLinear(linear_layer=None, quant_config=None)
				module.load_state_dict(state_dict)
			else:
				for key in state_dict:
					setattr(module, key, torch.nn.Parameter(state_dict[key], requires_grad=False))

			return module 

		#Load modules
		cls.patch_model(model, _load_module, _load_module, dict([(k, None) for k in cls.get_linear_tags()]))
		#Load other weights that are not part of any module
		cls.post_module_load(model, weights) 
		
		return model


model = LlamaHQQXPU.from_quantized(model_path)

Update IPEX to 2.1.10+xpu

69c81b7

* This will require Intel oneAPI Toolkit 2024.0

oobabooga merged commit 127c71a into oobabooga:dev Dec 15, 2023

oobabooga mentioned this pull request Dec 19, 2023

Add HQQ quant loader for ooba #4888

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update IPEX to 2.1.10+xpu#4931

Update IPEX to 2.1.10+xpu#4931
oobabooga merged 1 commit intooobabooga:devfrom
notsyncing:dev

notsyncing commented Dec 15, 2023

Uh oh!

notsyncing commented Dec 15, 2023

Uh oh!

oobabooga commented Dec 15, 2023 •

edited

Loading

Uh oh!

notsyncing commented Dec 15, 2023

Uh oh!

oobabooga commented Dec 15, 2023

Uh oh!

mobicham commented Dec 19, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

notsyncing commented Dec 15, 2023

Checklist:

Uh oh!

notsyncing commented Dec 15, 2023

Uh oh!

oobabooga commented Dec 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

notsyncing commented Dec 15, 2023

Uh oh!

oobabooga commented Dec 15, 2023

Uh oh!

mobicham commented Dec 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oobabooga commented Dec 15, 2023 •

edited

Loading

mobicham commented Dec 19, 2023 •

edited

Loading