Skip to content

Conversation

@achew010
Copy link
Contributor

@achew010 achew010 commented May 29, 2024

Description

This PR addresses #18 with the following contributions

  • Introduce patch on AutoGPTQ's make_sure_no_tensor_in_meta_device to avoid raising an error when model has no bias in low memory mode
  • Workaround to configuring device_map to cpu when loading checkpoints to avoid gpu memory consumption before trainer initialization.
    Note: This approach diverts consumption to cpu mem which could still bottleneck, a better approach could be to load it to meta device. QLoRA currently loads quantized models to cpu in low memory mode as well. See here.

TODO:

  • Actual device mapping to meta device

Tests

Reproduction command

accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path TheBloke/Llama-2-70B-GPTQ --acceleration_framework_config_file /data/aaron/experimental/test3/scripts/benchmarks/../../sample-configurations/accelerated-peft-autogptq-sample-configuration.yaml --packing True --max_seq_len 4096 --learning_rate 2e-4 --fp16 True --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0.0 --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '\n### Response:' --dataset_text_field 'output' --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 10 --training_data_path /data/aaron/experimental/test3/benchmark_outputs_final/data/cache.json --per_device_train_batch_size 2 --output_dir benchmark_outputs/exp_57/hf --skip_memory_metrics False

Comparison

Before Fix:

Memory Explosion in GPTQ-LoRA without low memory mode observed in the memory metrics, Nvidia (78.80 GiB) and Torch (36.1 GiB) compared to QLoRA with low memory mode enabled.

model
name
framework
config
num
gpus
per device
train
batch
size
nvidia
mem reserved
(GiB)
peak torch
mem alloc
(GiB)
torch
mem alloc
(GiB)
throughput (toks/sec)
NousResearch/Llama-2-70b-hf accelerated-peft-bnb 2 2 51.40 46.52 19.17 417
TheBloke/Llama-2-70B-GPTQ accelerated-peft-autogptq 2 2 78.80 45.40 36.14 429

After Fix:

With Low Memory mode enabled, GPTQ-LoRA now has lower memory consumption Nvidia (49.4 GiB) and Torch (18.1 GiB) and is comparable with QLoRA

model
name
framework
config
num
gpus
per device
train
batch
size
nvidia
mem reserved
(GiB)
peak torch
mem alloc
(GiB)
torch
mem alloc
(GiB)
throughput (toks/sec)
NousResearch/Llama-2-70b-hf accelerated-peft-bnb 2 2 51.40 46.52 19.17 414
TheBloke/Llama-2-70B-GPTQ accelerated-peft-autogptq 2 2 49.44 44.87 18.13 428

@achew010 achew010 requested a review from fabianlim as a code owner May 29, 2024 02:43
@achew010 achew010 force-pushed the gptq-low-mem-mode-fix branch from e4e32b6 to 6764755 Compare May 29, 2024 03:28
@achew010 achew010 self-assigned this May 29, 2024
@fabianlim
Copy link
Contributor

@achew010 can you update the top-level comment, with what was the previous memory allocation, and verify that the new measurements are obtained after reversing the hack in 80d631e

@fabianlim fabianlim linked an issue May 29, 2024 that may be closed by this pull request
@achew010 achew010 removed their assignment May 29, 2024
@fabianlim fabianlim merged commit 25171a0 into foundation-model-stack:dev May 29, 2024
fabianlim added a commit to fabianlim/fms-acceleration that referenced this pull request May 31, 2024
fabianlim added a commit that referenced this pull request Jun 2, 2024
* refactor

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* fixes

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* refactor mistral

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* add mixtral

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* some refactoring after introducing mlp

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* remove extranous files

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* add bnb

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* lint + fmt and improvements to readme

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* bench fixes

* need to handle lora adapters device due to #26

* allow replay of failed benches, addressing comment in #14

* update benches (remove l40)

---------

Signed-off-by: Yu Chin Fabian Lim <[email protected]>
@achew010 achew010 deleted the gptq-low-mem-mode-fix branch July 26, 2024 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow AutoGPTQ to work in low cpu memory mode

2 participants