Skip to content
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
a69fd22
add gptqmodel support
jiqing-feng Dec 3, 2024
3b87bae
enable gptqmodel tests
jiqing-feng Dec 4, 2024
e17c4ea
Fix Peft compat (#1)
LRL-ModelCloud Dec 5, 2024
01c7429
update gptqmodel version (#2)
ZX-ModelCloud Dec 16, 2024
c80fffd
check model has attr 'hf_device_map'
LRL-ModelCloud Jan 17, 2025
e887299
update gptqmodel version
Qubitium Jan 17, 2025
60c03af
revert test_common_gpu.py and test_gpu_examples.py
LRL-ModelCloud Jan 17, 2025
ec4d6fe
add test_gptqmodel.py
LRL-ModelCloud Jan 17, 2025
b54d034
PeftGPTQModelCommonTests require gptqmodel
LRL-ModelCloud Jan 17, 2025
12ab8a0
cleanup
LRL-ModelCloud Jan 17, 2025
c8c3d8e
use peft_model.device
LRL-ModelCloud Jan 17, 2025
946d1d7
format code
LRL-ModelCloud Jan 17, 2025
4f11d86
update copyright notice
LRL-ModelCloud Jan 17, 2025
4f13f7b
device_map is optional
LRL-ModelCloud Jan 17, 2025
9fcdd02
update Makefle, add test_gptqmodel_gpu
LRL-ModelCloud Jan 17, 2025
17440c4
Merge branch 'huggingface:main' into gptq
LRL-ModelCloud Jan 17, 2025
c206e7b
add get_gptqmodel_quant_linear to __all__
LRL-ModelCloud Jan 17, 2025
e0439fd
add gptq to quantization.md
Qubitium Jan 17, 2025
1f79dae
Update quantization.md
Qubitium Jan 17, 2025
c15a302
cleanup
LRL-ModelCloud Jan 17, 2025
6a3adc6
Merge remote-tracking branch 'origin/gptq' into gptq
LRL-ModelCloud Jan 17, 2025
59932fd
cleanup
LRL-ModelCloud Jan 17, 2025
75ddd5e
Update docs/source/developer_guides/quantization.md
Qubitium Jan 18, 2025
dab7a54
Update docs/source/developer_guides/quantization.md
Qubitium Jan 18, 2025
d5e55b6
Merge branch 'huggingface:main' into gptq
LRL-ModelCloud Jan 22, 2025
540e5af
pass device_map as a keyword argument
LRL-ModelCloud Jan 22, 2025
fa3ab05
add optimum version check for gptqmodel compatibility
LRL-ModelCloud Jan 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ tests_common_gpu:
python -m pytest tests/test_decoder_models.py $(if $(IS_GITHUB_CI),--report-log "common_decoder.log",)
python -m pytest tests/test_encoder_decoder_models.py $(if $(IS_GITHUB_CI),--report-log "common_encoder_decoder.log",)

test_gptqmodel_gpu:
Copy link
Member

@BenjaminBossan BenjaminBossan Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work, as this make argument is never called anywhere. What I meant is to just add the line to tests_common_gpu above, which is already called in the appropriate setting.

python -m pytest tests/test_gptqmodel.py $(if $(IS_GITHUB_CI),--report-log "gptqmodel_gpu.log",)

tests_examples_multi_gpu_bnb:
python -m pytest -m "multi_gpu_tests and bitsandbytes" tests/test_gpu_examples.py $(if $(IS_GITHUB_CI),--report-log "multi_gpu_examples.log",)

Expand Down
26 changes: 26 additions & 0 deletions docs/source/developer_guides/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,32 @@ QLoRA adds trainable weights to all the linear layers in the transformer archite
config = LoraConfig(target_modules="all-linear", ...)
```

## GPTQ quantization

You can learn more about gptq based `[2, 3, 4, 8]` bits quantization at [GPTQModel](https://github.com/ModelCloud/GPTQModel) and [HF GPTQ Doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/quantization/gptq.md). PEFT post-quant training can use both [GPTQModel](https://github.com/ModelCloud/GPTQModel) or [AutoGPTQ](https://github.com/autogptq/autogptq) libraries but we recommend GPTQModel as AutoGPTQ will be deprecated in a future release.

```bash
# gptqmodel install
pip install gptqmodel --no-build-isolation
```

```py
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)

gptq_config = GPTQConfig(bits=4, group_size=128, dataset="wikitext2", tokenizer=tokenizer)

quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When trying to run this locally with 2 CUDA devices, I encountered a CUDA error after 50% progress:

File ~/work/forks/transformers/src/transformers/models/opt/modeling_opt.py:559, in OPTDecoderLayer.forward(self, hidden_states, attention_mask, layer_head_mask, past_key_value, output_attentions, use_cache, position_ids)
    556 hidden_states = self.fc2(hidden_states)
    557 hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
--> 559 hidden_states = (residual + hidden_states).view(hidden_states_shape)
    561 # 350m applies layer norm AFTER attention
    562 if not self.do_layer_norm_before:

RuntimeError: CUDA error: an illegal memory access was encountered

Is this a known problem? Using 1 CUDA device or setting CUDA_LAUNCH_BLOCKING=1 resolves the error.

I suspect that the error occurs at the "switch" from GPU 0 to GPU 1, since that's exactly after half the layers when using device_map="auto".

Copy link
Contributor

@Qubitium Qubitium Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will double check this to see if a) accelerate specific or b) OPT specific.

  • for GPTQModel, we do not test for multi-gpu quantization since it's net-negative in terms of quantization speed.
  • For optimum, the gpu splitting is performed by accelerate so maybe this is related to accelerate or OPT model

For next GPTQModel CI tests PR, I would recommend we move all model testings from OPT to Llama 1B. I believe OPT was chosen due to the tiny size but in our experience, but there are some strange issues with the OPT modeling code (that I can't recall) that causes strange issues here and there. We recently dropped all CI OPT tests in factor of Llama for this reason. Again, I can't seem to recall the exact reasons. =(

Basically no one uses OPT anymore and any modeling changes is heavily favoriing Llama so any fringe bugs are much less likely to occur on llama class models.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that opt is very outdated at this point, and we mainly use it since it's small, but at least for PEFT it hasn't caused any problems yet.

I ran the code above using meta-llama/Llama-3.2-1B and again got an error after 50%:

File ~/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/linear.py:125, in Linear.forward(self, input)
    124 def forward(self, input: Tensor) -> Tensor:
--> 125     return F.linear(input, self.weight, self.bias)

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

Thus it's unlikely to be related to the model architecture. CUDA_LAUNCH_BLOCKING=1 again was enough to resolve the issue.


# save quantized model
quantized_model.save_pretrained("./opt-125m-gptq")
tokenizer.save_pretrained("./opt-125m-gptq")
```

Once quantized, you can post-train gptq models using normal PEFT apis.

## AQLM quantization

Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes. This allows it to compress models down to as low as 2-bit with considerably low accuracy losses.
Expand Down
14 changes: 14 additions & 0 deletions src/peft/import_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,20 @@ def is_auto_gptq_available():
)


@lru_cache
def is_gptqmodel_available():
if importlib.util.find_spec("gptqmodel") is not None:
GPTQMODEL_MINIMUM_VERSION = packaging.version.parse("1.7.0")
version_gptqmodel = packaging.version.parse(importlib_metadata.version("gptqmodel"))
if GPTQMODEL_MINIMUM_VERSION <= version_gptqmodel:
return True
else:
raise ImportError(
f"Found an incompatible version of gptqmodel. Found version {version_gptqmodel}, "
f"but only versions above {GPTQMODEL_MINIMUM_VERSION} are supported"
)


@lru_cache
def is_optimum_available() -> bool:
return importlib.util.find_spec("optimum") is not None
Expand Down
16 changes: 11 additions & 5 deletions src/peft/tuners/adalora/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,15 @@
import torch
from transformers.pytorch_utils import Conv1D

from peft.import_utils import is_bnb_4bit_available, is_bnb_available
from peft.import_utils import is_bnb_4bit_available, is_bnb_available, is_gptqmodel_available
from peft.tuners.lora import LoraConfig, LoraModel
from peft.tuners.tuners_utils import BaseTunerLayer
from peft.utils import (
TRANSFORMERS_MODELS_TO_ADALORA_TARGET_MODULES_MAPPING,
_freeze_adapter,
_get_submodules,
get_auto_gptq_quant_linear,
get_gptqmodel_quant_linear,
get_quantization_config,
)
from peft.utils.integrations import gather_params_ctx
Expand Down Expand Up @@ -135,7 +136,8 @@ def _create_and_replace(

# If it is not an AdaLoraLayer, create a new module, else update it with new adapters
if not isinstance(target, AdaLoraLayer):
new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
device_map = self.model.hf_device_map if hasattr(self.model, "hf_device_map") else None
new_module = self._create_new_module(lora_config, adapter_name, target, device_map, **kwargs)
if adapter_name not in self.active_adapters:
# adding an additional adapter: it is not automatically trainable
new_module.requires_grad_(False)
Expand All @@ -150,7 +152,7 @@ def _create_and_replace(
)

@staticmethod
def _create_new_module(lora_config, adapter_name, target, **kwargs):
def _create_new_module(lora_config, adapter_name, target, device_map=None, **kwargs):
# avoid eager bnb import
if is_bnb_available():
import bitsandbytes as bnb
Expand All @@ -160,7 +162,11 @@ def _create_new_module(lora_config, adapter_name, target, **kwargs):
from .bnb import SVDLinear4bit

gptq_quantization_config = kwargs.get("gptq_quantization_config", None)
AutoGPTQQuantLinear = get_auto_gptq_quant_linear(gptq_quantization_config)

if is_gptqmodel_available():
QuantLinear = get_gptqmodel_quant_linear(gptq_quantization_config, device_map)
else:
QuantLinear = get_auto_gptq_quant_linear(gptq_quantization_config)

loaded_in_8bit = kwargs.pop("loaded_in_8bit", False)
loaded_in_4bit = kwargs.pop("loaded_in_4bit", False)
Expand Down Expand Up @@ -189,7 +195,7 @@ def _create_new_module(lora_config, adapter_name, target, **kwargs):
}
)
new_module = SVDLinear4bit(target, adapter_name, **fourbit_kwargs)
elif AutoGPTQQuantLinear is not None and isinstance(target, AutoGPTQQuantLinear):
elif QuantLinear is not None and isinstance(target, QuantLinear):
new_module = SVDQuantLinear(target, adapter_name, **kwargs)
else:
if isinstance(target_base_layer, torch.nn.Linear):
Expand Down
14 changes: 10 additions & 4 deletions src/peft/tuners/lora/gptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,10 @@

import torch

from peft.import_utils import is_gptqmodel_available
from peft.tuners.lora.layer import LoraLayer
from peft.tuners.tuners_utils import BaseTunerLayer
from peft.utils import get_auto_gptq_quant_linear
from peft.utils import get_auto_gptq_quant_linear, get_gptqmodel_quant_linear


class QuantLinear(torch.nn.Module, LoraLayer):
Expand Down Expand Up @@ -106,10 +107,15 @@ def dispatch_gptq(
else:
target_base_layer = target

gptq_quantization_config = kwargs.get("gptq_quantization_config", None)
AutoGPTQQuantLinear = get_auto_gptq_quant_linear(gptq_quantization_config)
cfg = kwargs.get("gptq_quantization_config", None)

if AutoGPTQQuantLinear is not None and isinstance(target_base_layer, AutoGPTQQuantLinear):
if is_gptqmodel_available():
device_map = kwargs.get("device_map", None)
quant_linear = get_gptqmodel_quant_linear(cfg, device_map=device_map)
else:
quant_linear = get_auto_gptq_quant_linear(cfg)

if quant_linear is not None and isinstance(target_base_layer, quant_linear):
new_module = QuantLinear(target, adapter_name, **kwargs)
target.qweight = target_base_layer.qweight

Expand Down
3 changes: 2 additions & 1 deletion src/peft/tuners/lora/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -232,7 +232,8 @@ def _create_and_replace(
lora_bias=lora_config.lora_bias,
)
else:
new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
device_map = self.model.hf_device_map if hasattr(self.model, "hf_device_map") else None
new_module = self._create_new_module(lora_config, adapter_name, target, device_map=device_map, **kwargs)
if adapter_name not in self.active_adapters:
# adding an additional adapter: it is not automatically trainable
new_module.requires_grad_(False)
Expand Down
2 changes: 2 additions & 0 deletions src/peft/utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
bloom_model_postprocess_past_key_value,
cast_mixed_precision_params,
get_auto_gptq_quant_linear,
get_gptqmodel_quant_linear,
get_quantization_config,
id_tensor_storage,
infer_device,
Expand Down Expand Up @@ -77,6 +78,7 @@
"bloom_model_postprocess_past_key_value",
"cast_mixed_precision_params",
"get_auto_gptq_quant_linear",
"get_gptqmodel_quant_linear",
"get_peft_model_state_dict",
"get_quantization_config",
"id_tensor_storage",
Expand Down
89 changes: 66 additions & 23 deletions src/peft/utils/other.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
from packaging import version
from safetensors.torch import storage_ptr, storage_size

from ..import_utils import is_auto_gptq_available, is_torch_tpu_available
from ..import_utils import is_auto_gptq_available, is_gptqmodel_available, is_torch_tpu_available
from .constants import (
CONFIG_NAME,
EMBEDDING_LAYER_NAMES,
Expand Down Expand Up @@ -608,30 +608,73 @@ def get_auto_gptq_quant_linear(gptq_quantization_config):
"""
Get the right AutoGPTQQuantLinear class based on the quantization config file
"""
if gptq_quantization_config is not None and is_auto_gptq_available():
if gptq_quantization_config is None:
return None

if is_auto_gptq_available():
from auto_gptq.utils.import_utils import dynamically_import_QuantLinear
else:
return None

desc_act = gptq_quantization_config.desc_act
group_size = gptq_quantization_config.group_size
bits = gptq_quantization_config.bits
if hasattr(gptq_quantization_config, "use_exllama"):
use_exllama = gptq_quantization_config.use_exllama
else:
use_exllama = not gptq_quantization_config.disable_exllama
if hasattr(gptq_quantization_config, "exllama_config"):
exllama_version = gptq_quantization_config.exllama_config["version"]
else:
exllama_version = 1
AutoGPTQQuantLinear = dynamically_import_QuantLinear(
use_triton=False,
desc_act=desc_act,
group_size=group_size,
bits=bits,
disable_exllama=not (use_exllama and exllama_version == 1),
disable_exllamav2=not (use_exllama and exllama_version == 2),
)
return AutoGPTQQuantLinear
return None
desc_act = gptq_quantization_config.desc_act
group_size = gptq_quantization_config.group_size
bits = gptq_quantization_config.bits
if hasattr(gptq_quantization_config, "use_exllama"):
use_exllama = gptq_quantization_config.use_exllama
else:
use_exllama = not gptq_quantization_config.disable_exllama
if hasattr(gptq_quantization_config, "exllama_config"):
exllama_version = gptq_quantization_config.exllama_config["version"]
else:
exllama_version = 1

QuantLinear = dynamically_import_QuantLinear(
use_triton=False,
desc_act=desc_act,
group_size=group_size,
bits=bits,
disable_exllama=not (use_exllama and exllama_version == 1),
disable_exllamav2=not (use_exllama and exllama_version == 2),
)

return QuantLinear


def get_gptqmodel_quant_linear(gptq_quantization_config, device_map=None):
"""
Get the right GPTQQuantLinear class based on the quantization config file
"""
if gptq_quantization_config is None:
return None

if not is_gptqmodel_available():
return None

from gptqmodel.utils.importer import hf_select_quant_linear

desc_act = gptq_quantization_config.desc_act
group_size = gptq_quantization_config.group_size
bits = gptq_quantization_config.bits
checkpoint_format = (
gptq_quantization_config.checkpoint_format
if hasattr(gptq_quantization_config, "checkpoint_format")
else "gptq"
)
sym = gptq_quantization_config.sym
meta = gptq_quantization_config.meta if hasattr(gptq_quantization_config, "meta") else None

QuantLinear = hf_select_quant_linear(
bits=bits,
group_size=group_size,
desc_act=desc_act,
sym=sym,
device_map=device_map,
checkpoint_format=checkpoint_format,
meta=meta,
backend="auto_trainable",
)

return QuantLinear


def id_tensor_storage(tensor: torch.Tensor) -> tuple[torch.device, int, int]:
Expand Down
6 changes: 3 additions & 3 deletions tests/test_common_gpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,19 +403,19 @@ def test_lora_gptq_quantization_from_pretrained_safetensors(self):

config = LoraConfig(task_type="CAUSAL_LM")
peft_model = get_peft_model(model, config)
peft_model.generate(input_ids=torch.LongTensor([[0, 2, 3, 1]]).to(0))
peft_model.generate(input_ids=torch.LongTensor([[0, 2, 3, 1]]).to(peft_model.device))

with tempfile.TemporaryDirectory() as tmp_dir:
peft_model.save_pretrained(tmp_dir)
model = AutoModelForCausalLM.from_pretrained(**kwargs)
model = PeftModel.from_pretrained(model, tmp_dir)
model = prepare_model_for_kbit_training(model)
model.generate(input_ids=torch.LongTensor([[0, 2, 3, 1]]).to(0))
model.generate(input_ids=torch.LongTensor([[0, 2, 3, 1]]).to(peft_model.device))

# loading a 2nd adapter works, #1239
model.load_adapter(tmp_dir, "adapter2")
model.set_adapter("adapter2")
model.generate(input_ids=torch.LongTensor([[0, 2, 3, 1]]).to(0))
model.generate(input_ids=torch.LongTensor([[0, 2, 3, 1]]).to(peft_model.device))

# check that both adapters are in the same layer
assert "default" in model.base_model.model.model.decoder.layers[0].self_attn.q_proj.lora_A
Expand Down
Loading