Skip to content

Update IPEX to 2.1.10+xpu#4931

Merged
oobabooga merged 1 commit intooobabooga:devfrom
notsyncing:dev
Dec 15, 2023
Merged

Update IPEX to 2.1.10+xpu#4931
oobabooga merged 1 commit intooobabooga:devfrom
notsyncing:dev

Conversation

@notsyncing
Copy link
Copy Markdown
Contributor

  • This will require Intel oneAPI Toolkit 2024.0

Checklist:

  * This will require Intel oneAPI Toolkit 2024.0
@notsyncing
Copy link
Copy Markdown
Contributor Author

Tested OK on my A770 16GB. I feel that it's a little faster on ChatGLM3-6B.

@oobabooga oobabooga merged commit 127c71a into oobabooga:dev Dec 15, 2023
@oobabooga
Copy link
Copy Markdown
Owner

oobabooga commented Dec 15, 2023

Thanks for the update. @notsyncing since you have an A770, could you try the new HQQ loader at #4888? I think that it should allow you to run bigger models in 4-bit precision with your GPU.

I recommend the following model for testing:

https://huggingface.co/mobiuslabsgmbh/Llama-2-13b-chat-hf-4bit_g64-HQQ

@notsyncing
Copy link
Copy Markdown
Contributor Author

@oobabooga I'm getting this error with HQQ loader:

2023-12-15 13:11:46 INFO:Loading Llama-2-13b-chat-hf-4bit_g64-HQQ...
2023-12-15 13:11:46 INFO:Loading HQQ model with backend: PYTORCH
[tgwui]    | Failed to load the weights Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
[tgwui]    | Traceback (most recent call last):
[tgwui]    |   File "/opt/text-generation-webui/modules/ui_model_menu.py", line 209, in load_model_wrapper
[tgwui]    |     shared.model, shared.tokenizer = load_model(selected_model, loader)
[tgwui]    |                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[tgwui]    |   File "/opt/text-generation-webui/modules/models.py", line 90, in load_model
[tgwui]    |     output = load_func_map[loader](model_name)
[tgwui]    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[tgwui]    |   File "/opt/text-generation-webui/modules/models.py", line 417, in HQQ_loader
[tgwui]    |     model = HQQModelForCausalLM.from_quantized(str(model_dir))
[tgwui]    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[tgwui]    |   File "/opt/text-generation-webui/installer_files/env/lib/python3.11/site-packages/hqq/engine/base.py", line 71, in from_quantized
[tgwui]    |     cls._make_quantizable(model, quantized=True)
[tgwui]    |   File "/opt/text-generation-webui/installer_files/env/lib/python3.11/site-packages/hqq/engine/hf.py", line 29, in _make_quantizable
[tgwui]    |     model.hqq_quantized  = quantized
[tgwui]    |     ^^^^^^^^^^^^^^^^^^^
[tgwui]    | AttributeError: 'NoneType' object has no attribute 'hqq_quantized'
[tgwui]    | 

Looks like it supports CUDA only.

@oobabooga
Copy link
Copy Markdown
Owner

That's a bummer. Thanks for the test!

@oobabooga oobabooga mentioned this pull request Dec 19, 2023
1 task
@mobicham
Copy link
Copy Markdown

mobicham commented Dec 19, 2023

@oobabooga I'm getting this error with HQQ loader:

2023-12-15 13:11:46 INFO:Loading Llama-2-13b-chat-hf-4bit_g64-HQQ...
2023-12-15 13:11:46 INFO:Loading HQQ model with backend: PYTORCH
[tgwui]    | Failed to load the weights Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
[tgwui]    | Traceback (most recent call last):
[tgwui]    |   File "/opt/text-generation-webui/modules/ui_model_menu.py", line 209, in load_model_wrapper
[tgwui]    |     shared.model, shared.tokenizer = load_model(selected_model, loader)
[tgwui]    |                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[tgwui]    |   File "/opt/text-generation-webui/modules/models.py", line 90, in load_model
[tgwui]    |     output = load_func_map[loader](model_name)
[tgwui]    |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[tgwui]    |   File "/opt/text-generation-webui/modules/models.py", line 417, in HQQ_loader
[tgwui]    |     model = HQQModelForCausalLM.from_quantized(str(model_dir))
[tgwui]    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[tgwui]    |   File "/opt/text-generation-webui/installer_files/env/lib/python3.11/site-packages/hqq/engine/base.py", line 71, in from_quantized
[tgwui]    |     cls._make_quantizable(model, quantized=True)
[tgwui]    |   File "/opt/text-generation-webui/installer_files/env/lib/python3.11/site-packages/hqq/engine/hf.py", line 29, in _make_quantizable
[tgwui]    |     model.hqq_quantized  = quantized
[tgwui]    |     ^^^^^^^^^^^^^^^^^^^
[tgwui]    | AttributeError: 'NoneType' object has no attribute 'hqq_quantized'
[tgwui]    | 

Looks like it supports CUDA only.

Can you replace the from_quantized call with this:

from hqq.core.quantize import *
from hqq.models.hf.llama import LlamaHQQ

class LlamaHQQXPU(LlamaHQQ):
	@classmethod
	def from_quantized(cls, save_dir_or_hub, cache_dir=''):
		#Get directory path
		save_dir = cls.try_snapshot_download(save_dir_or_hub, cache_dir)

		#Load model from config
		model = cls.create_model(save_dir)

		#Name the layers
		cls.autoname_modules(model) 

		#Load weights
		try:
			weights = cls.load_weights(save_dir, map_location='xpu')
		except Exception as error:
			print("Failed to load the weights", error)
			return
		
		#load_state_dict() doesn't work with modules initialized with init_empty_weights(), so we need to do this manually
		@torch.no_grad()
		def _load_module(module, params=None):
			if(module.name not in weights): 
				return module.half().to('xpu')

			state_dict = weights[module.name]
			if(('W_q' in state_dict) and ('meta' in state_dict)):
				module = HQQLinear(linear_layer=None, quant_config=None)
				module.load_state_dict(state_dict)
			else:
				for key in state_dict:
					setattr(module, key, torch.nn.Parameter(state_dict[key], requires_grad=False))

			return module 

		#Load modules
		cls.patch_model(model, _load_module, _load_module, dict([(k, None) for k in cls.get_linear_tags()]))
		#Load other weights that are not part of any module
		cls.post_module_load(model, weights) 
		
		return model


model = LlamaHQQXPU.from_quantized(model_path)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants