LLMQuantLab

This codebase evaluates the efficiency of three quantization methods, ATOM, QServe, and QuaRot, on GPUs by comparing their performance using industry-standard metrics. It focuses on key factors such as prefill tokens, decode length, and batch size, providing insights into how these methods scale on throughput at different stages and their impact on latency. The evaluation is designed to reflect practical industry benchmarks, offering a deeper understanding of the trade-offs in efficiency when applied to large-scale machine learning models.

Benchmarking Steps

This codebase provides an easy way to set up and compare the quantization methods.

Download models from hugging face

Prepare hugging face user access tokens
To get Llama-7b, use:

python download_models.py

To get Llama2-7b 13b and 70b, use:

python download_models_llama2.py

Build the different quantization and run the benchmark codes!

(Please check out the individual README files for each method. Alternatives are also provided below.)

ATOM

Build and run the docker container

cd ./Atom
sudo docker build -t atom-image ./
sudo docker run --gpus all -v $(pwd)/Atom:/workspace/Atom \
 -v $(pwd)/llama_model:/workspace/llama_model \
 -v $(pwd)/llama_tokenizer:/workspace/llama_tokenizer \
 -v ~/.cache/huggingface:/root/.cache/huggingface \
 -it atom-image

QuaRot

Build and run the docker container

sudo docker build --no-cache -t quarot_image .
sudo docker run --rm -it \
  --gpus all \
  -v ./QuaRot:/workspace/QuaRot \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  quarot_image

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Atom		Atom
QuaRot		QuaRot
qserve		qserve
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
download_models.py		download_models.py
download_models_llama2.py		download_models_llama2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLMQuantLab

Benchmarking Steps

Download models from hugging face

Build the different quantization and run the benchmark codes!

ATOM

QuaRot

QServe

Benchmark Results on RTX 6000

About

Uh oh!

Releases

Packages

Uh oh!

Languages

rubywtl/LLMQuantLab

Folders and files

Latest commit

History

Repository files navigation

LLMQuantLab

Benchmarking Steps

Download models from hugging face

Build the different quantization and run the benchmark codes!

ATOM

QuaRot

QServe

Benchmark Results on RTX 6000

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages