opea-project · ftian1 · Jan 9, 2025 · Dec 26, 2024 · Dec 31, 2024 · Jan 7, 2025
diff --git a/comps/sqft/Dockerfile b/comps/sqft/Dockerfile
@@ -0,0 +1,55 @@
+# Copyright (C) 2025 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+# Use the same python version with ray
+FROM python:3.10.14
+
+ARG HF_TOKEN
+
+ENV HF_TOKEN=$HF_TOKEN
+
+RUN useradd -m -s /bin/bash user && \
+    mkdir -p /home/user && \
+    chown -R user /home/user/
+
+COPY comps /home/user/comps
+
+ENV SQFT_PATH=/home/user/comps/sqft
+
+RUN chown -R user $SQFT_PATH
+
+USER user
+
+ENV PATH=$PATH:/home/user/.local/bin
+
+RUN python -m pip install --no-cache-dir --upgrade pip && \
+    python -m pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu && \
+    python -m pip install --no-cache-dir intel-extension-for-pytorch && \
+    python -m pip install --no-cache-dir oneccl_bind_pt --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/
+
+WORKDIR $SQFT_PATH
+
+# Clone and install GenAIComps
+RUN git clone https://github.com/opea-project/GenAIComps $SQFT_PATH/GenAIComps && \
+cd $SQFT_PATH/GenAIComps && \
+git checkout v1.1 && \
+pip install -e . && \
+pip install --no-cache-dir -r $SQFT_PATH/GenAIComps/comps/finetuning/requirements.txt
+
+# Clone and install peft with patch
+RUN git clone https://github.com/huggingface/peft.git $SQFT_PATH/peft && \
+    cd $SQFT_PATH/peft && \
+    git checkout v0.10.0 && \
+    git apply --ignore-space-change --ignore-whitespace $SQFT_PATH/patches/peft-v0.10.0.patch && \
+    pip install -e .
+
+ENV PYTHONPATH=$PYTHONPATH:$SQFT_PATH/GenAIComps
+
+RUN echo PKGPATH=$(python3 -c "import pkg_resources; print(pkg_resources.get_distribution('oneccl-bind-pt').location)") >> run.sh && \
+    echo 'export LD_LIBRARY_PATH=$PKGPATH/oneccl_bindings_for_pytorch/opt/mpi/lib/:$LD_LIBRARY_PATH' >> run.sh && \
+    echo 'source $PKGPATH/oneccl_bindings_for_pytorch/env/setvars.sh' >> run.sh && \
+    echo ray start --head --dashboard-host=0.0.0.0 >> run.sh && \
+    echo export RAY_ADDRESS=http://localhost:8265 >> run.sh && \
+    echo python sqft_service.py >> run.sh
+
+CMD bash run.sh
diff --git a/comps/sqft/README.md b/comps/sqft/README.md
@@ -0,0 +1,317 @@
+# SQFT Microservice
+
+This repository introduces the microservice for the innovative fine-tuning method **SQFT**:
+- **Paper**: [SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models](https://arxiv.org/abs/2410.03750)
+- **Official implementation**: [https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SQFT](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SQFT)
+
+SQFT is an end-to-end solution for low-precision sparse parameter-efficient fine-tuning of LLMs. It allows for effective model manipulation in resource-constrained environments.
+Specifically, the highlights of SQFT include:
+
+- **SparsePEFT**, an efficient and effective strategy for fine-tuning sparse models. It ensures the preservation of the base model's sparsity during merging through the use of sparse adapters.
+- Introduction of quantization scenarios (sparse and quantization). **QA-SparsePEFT** built on SparsePEFT allows PEFT fine-tuning to achieve a single INT4 and sparse model adapted to the specific domain (pending support).
+- Adopt the **Neural Low-rank Adapter Search (NLS)** strategy for all pipelines and solutions. 
+
+Please refer to the [paper](https://arxiv.org/abs/2410.03750) and official [code](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SQFT) for more details.
+
+## 🚀 1. Start SQFT Microservice with Python (Option 1)
+
+### 1.1 Install Requirements
+
+```bash
+
+SQFT_path=$PWD
+git clone https://github.com/opea-project/GenAIComps
+cd GenAIComps
+git checkout v1.1
+pip install -e .
+pushd ${SQFT_path}
+
+# same as https://github.com/opea-project/GenAIComps/tree/main/comps/finetuning#11-install-requirements
+python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
+python -m pip install intel-extension-for-pytorch
+python -m pip install oneccl_bind_pt --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/
+pip install -r GenAIComps/comps/finetuning/requirements.txt
+
+# Install the peft library and apply modifications to support the SparsePEFT strategy in SQFT
+git clone https://github.com/huggingface/peft.git
+pushd peft
+git checkout v0.10.0 
+git apply --ignore-space-change --ignore-whitespace ${SQFT_path}/patches/peft-v0.10.0.patch 
+pip install -e . 
+```
+
+### 1.2 Start SQFT Service with Python Script
+
+#### 1.2.1 Start Ray Cluster
+
+OneCCL and Intel MPI libraries should be dynamically linked in every node before Ray starts:
+
+```bash
+source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl; print(torch_ccl.cwd)")/env/setvars.sh
+```
+
+Start Ray locally using the following command.
+
+```bash
+ray start --head
+```
+
+For a multi-node cluster, start additional Ray worker nodes with below command.
+
+```bash
+ray start --address='${head_node_ip}:6379'
+```
+
+#### 1.2.2 Start SQFT Service
+
+```bash
+export HF_TOKEN=${your_huggingface_token}
+python sqft_service.py
+```
+
+## 🚀2. Start SQFT Microservice with Docker (Option 2)
+
+### 2.1 Setup on CPU
+
+#### 2.1.1 Build Docker Image
+
+Build a docker image with the below command:
+
+```bash
+export HF_TOKEN=${your_huggingface_token}
+cd ../../
+docker build -t opea/sqft:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy --build-arg HF_TOKEN=$HF_TOKEN -f comps/sqft/Dockerfile .
+```
+
+#### 2.1.2 Run Docker with CLI
+
+Start a docker container with the below command:
+
+```bash
+docker run -d --name="sqft-server" -p 8015:8015 --runtime=runc --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/sqft:latest
+```
+
+## 🚀 3. Consume the SQFT Service
+
+We use the [Arc-E](https://huggingface.co/datasets/allenai/ai2_arc/viewer/ARC-Easy) dataset as a simple example to demonstrate how to use SQFT on OPEA Microservice.
+
+### 3.1 Prepare and upload a training file
+
+First, we need to process the training dataset into the instruction format, for example:
+
+```json
+{
+    "instruction": "Which factor will most likely cause a person to develop a fever?",
+    "input": "",
+    "output": "a bacterial population in the bloodstream"
+}
+```
+Here, we use the Arc-E dataset as an example. The Arc-E training set is processed using the script [example_dataset/preprocess_arc.py](./example_dataset/preprocess_arc.py). 
+After obtaining the processed dataset file [arce_train_instruct.json](./example_dataset/arce_train_instruct.json), we can upload it to the server with this command:
+```bash
+# upload a training file
+curl http://<your ip>:8015/v1/files -X POST -H "Content-Type: multipart/form-data" -F "file=@example_dataset/arce_train_instruct.json" -F purpose="fine-tune"
+```
+
+### 3.2 Create a fine-tuning job
+
+After uploading a training file, use the following commands to launch fine-tuning jobs.
+
+#### 3.2.1 Neural LoRA Search (NLS)
+
+Here is an example of using the [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) model with NLS fine-tuning. We use _elastic_ LoRA target modules for `q_proj`, `k_proj`, and `v_proj` with a low-rank search space of `[16, 12, 8]`. The result of this step is a trained super-adapter.
+
+```bash
+# Max LoRA rank: 16
+#   LoRA target modules            -> Low-rank search space
+#   ["q_proj", "k_proj", "v_proj"] -> [16,12,8]
+curl http://<your ip>:8015/v1/sqft/jobs \
+  -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "training_file": "arce_train_instruct.json",
+    "model": "meta-llama/Llama-3.2-1B",
+    "General": {
+      "lora_config": {
+        "r": 16,
+        "lora_alpha": 32,
+        "target_modules": ["q_proj", "k_proj", "v_proj"],
+        "neural_lora_search": true,
+        "nls_target_modules": ["q_proj", "k_proj", "v_proj"],
+        "search_space": [16, 12, 8]
+      }
+    },
+    "Training": {
+      "learning_rate": 1e-04, "epochs": 5, "batch_size": 16
+    },
+    "Dataset": {
+      "max_length": 256
+    }
+  }'
+```
+
+Below are some explanations for the parameters related to the NLS algorithm:
+
+- `neural_lora_search` indicates whether the Neural LoRA Search (NLS) algorithm is enabled.
+- `nls_target_modules` specifies the target modules for the NLS strategy, indicating which adapters need to become elastic.
+- `search_space` specifies the search space for each target module (adapter). Here, we use `[16, 12, 8]`, meaning that the possible rank for each adapter is [16, 12, 8].
+
+#### 3.2.2 SparsePEFT
+
+SparsePEFT is designed for foundation models that have been sparsified using any sparsification algorithm.
+For a selection of sparse (and quantized) models, SQFT offers several ready-to-use models that you can use as examples to understand how to sparsify your models. 
+Please refer to [here](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SQFT#released-foundation-models-) or the [HuggingFace SQFT Model Collection](https://huggingface.co/collections/IntelLabs/sqft-66cd56f90b240963f9cf1a67).
+Here is an example of enabling SparsePEFT by setting `sparse_adapter` to True, allowing the adapter to be integrated into the base model without losing sparsity.
+
+```bash
+# Max LoRA rank: 16
+#   LoRA target modules            -> Low-rank search space
+#   ["q_proj", "k_proj", "v_proj"] -> [16,12,8]
+curl http://<your ip>:8015/v1/sqft/jobs \
+  -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "training_file": "arce_train_instruct.json",
+    "model": <path to sparse model>,
+    "General": {
+      "lora_config": {
+        "r": 16,
+        "lora_alpha": 32,
+        "target_modules": ["q_proj", "k_proj", "v_proj"],
+        "sparse_adapter": true
+      }
+    },
+    "Training": {
+      "learning_rate": 1e-04, "epochs": 5, "batch_size": 16
+    },
+    "Dataset": {
+      "max_length": 256
+    }
+  }'
+```
+
+Note that NLS strategy can also be applied to SparsePEFT.
+
+### 3.3 Leverage the Fine-tuned Super-Adapter
+
+#### 3.3.1 Extract a Sub-Adapter
+
+After completing the fine-tuning stage and obtaining an NLS super-adapter, the next step is to extract a desired sub-adapter. The following command demonstrates how to extract the heuristic sub-adapter.
+**Additionally, more powerful sub-adapters can be obtained through advanced search algorithms.** (More details can be found [here](#333-search-the-optimal-sub-adapter-configuration))
+
+```bash
+curl http://<your ip>:8015/v1/sqft/extract_sub_adapter \
+  -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "fine_tuning_job_id": <fine tuning job id>,
+    "adapter_version": "heuristic"
+  }'
+```
+
+`adapter_version` can be heuristic, minimal, or a custom name.
+When `adapter_version` is set to a custom name, we need to provide a specific configuration in `custom_config`.
+The extracted adapter will be saved in `<path to the output directory for this job> / <adapter_version>`.
+
+<details>
+<summary>An example of a custom configuration</summary>
+
+```bash
+curl http://<your ip>:8015/v1/sqft/extract_sub_adapter \
+  -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "fine_tuning_job_id": <fine tuning job id>,
+    "adapter_version": "optimal",
+    "custom_config": [8, 16, 8, 12, 16, 12, 12, 12, 12, 12, 8, 12, 12, 12, 12, 12]
+  }'
+```
+
+In the fine-tuning job with the Neural Low-rank adapter Search algorithm,  the `elastic_adapter_config.json` file (which includes the elastic adapter information) will be saved in the job's output directory.
+The `custom_config` must correspond with the `target` (adapter modules) or `search_space`
+(search space for the rank of adapter modules) in `elastic_adapter_config.json`. 
+In the NLS example [here](#321-neural-lora-search-nls), the custom config in the above command `[8, 16, 8, 12, 16, 12, 12, 12, 12, 12, 8, 12, 12, 12, 12, 12]` represents the LoRA rank size of the adapters for `q_proj`, `k_proj`, and `v_proj` in each layer.
+It will save the sub-adapter to `<path to the output directory for this job> / optimal`.
+
+</details>
+
+#### 3.3.2 Merge Adapter to Base Model
+
+The following command demonstrates how to merge a sub-adapter (using the Heuristic sub-adapter as an example) into the base pre-trained model to obtain the final fine-tuned model:
+
+```bash
+curl http://<your ip>:8015/v1/sqft/merge_adapter \
+  -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "fine_tuning_job_id": <fine tuning job id>,
+    "adapter_version": "heuristic"
+  }'
+```
+
+The merged model will be saved in `<path to the output directory for this job> / <adapter_version> / merged_model`.
+
+
+#### 3.3.3 Search the Optimal Sub-Adapter Configuration
+
+To further discover high-performing sub-adapter configurations within the super-adapter, we can utilize more advanced search algorithms on the super-adapter.
+Due to the flexibility and wide range of choices in the search settings, the service does not support the search process (but it supports providing a specific sub-adapter configuration to extract the sub-adapter; refer to [here](#331-extract-a-sub-adapter)).
+The search needs to be conducted service-externally according to user preferences.
+
+In our example, we provide a simple script ([search.py](./search.py)) for the search (hill-climbing algo) with the Arc-E validation set to obtain some optimal sub-adapters.
+The command is as follows:
+
+```bash
+python search.py \
+  --base_model_path meta-llama/Llama-3.2-1B \
+  --super_adapter_model_path <path to super adapter> \
+  --nls_target_modules q_proj k_proj v_proj \
+  --search_space 16 12 8
+```
+
+
+## Toy Experiment Results (NLS)
+
+- w/o tuning
+
+```bash
+lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B --tasks arc_easy --batch_size 32
+
+| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
+|--------|------:|------|-----:|--------|-----:|---|-----:|
+|arc_easy|      1|none  |     0|acc     |0.6528|±  |0.0098|
+|        |       |none  |     0|acc_norm|0.6065|±  |0.0100|
+```
+
+- with vanilla LoRA tuning
+
+```bash
+lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B,peft=<path to adapter> --tasks arc_easy --batch_size 32
+
+| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
+|--------|------:|------|-----:|--------|-----:|---|-----:|
+|arc_easy|      1|none  |     0|acc     |0.6894|±  |0.0095|
+|        |       |none  |     0|acc_norm|0.6852|±  |0.0095|
+```
+
+- with NLS tuning (heuristic, no search)
+
+```bash
+lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B,peft=<path to heuristic adapter> --tasks arc_easy --batch_size 32
+
+| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
+|--------|------:|------|-----:|--------|-----:|---|-----:|
+|arc_easy|      1|none  |     0|acc     |0.6911|±  |0.0095|
+|        |       |none  |     0|acc_norm|0.6881|±  |0.0095|
+```
+
+- with NLS tuning (search)
+
+```bash
+lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B,peft=<path to search optimal adapter> --tasks arc_easy --batch_size 32
+
+| Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
+|--------|------:|------|-----:|--------|-----:|---|-----:|
+|arc_easy|      1|none  |     0|acc     |0.6911|±  |0.0095|
+|        |       |none  |     0|acc_norm|0.6902|±  |0.0095|
+```