|
23 | 23 | # yapf: disable |
24 | 24 | """ |
25 | 25 | tensorize_vllm_model.py is a script that can be used to serialize and |
26 | | -deserialize vLLM models. These models can be loaded using tensorizer directly |
27 | | -to the GPU extremely quickly. Tensor encryption and decryption is also |
28 | | -supported, although libsodium must be installed to use it. Install |
29 | | -vllm with tensorizer support using `pip install vllm[tensorizer]`. |
| 26 | +deserialize vLLM models. These models can be loaded using tensorizer |
| 27 | +to the GPU extremely quickly over an HTTP/HTTPS endpoint, an S3 endpoint, |
| 28 | +or locally. Tensor encryption and decryption is also supported, although |
| 29 | +libsodium must be installed to use it. Install vllm with tensorizer support |
| 30 | +using `pip install vllm[tensorizer]`. |
30 | 31 |
|
31 | | -To serialize a model, you can run something like this: |
| 32 | +To serialize a model, install vLLM from source, then run something |
| 33 | +like this from the root level of this repository: |
32 | 34 |
|
33 | | -python tensorize_vllm_model.py \ |
| 35 | +python -m examples.tensorize_vllm_model \ |
34 | 36 | --model EleutherAI/gpt-j-6B \ |
35 | 37 | --dtype float16 \ |
36 | 38 | serialize \ |
37 | 39 | --serialized-directory s3://my-bucket/ \ |
38 | 40 | --suffix vllm |
39 | 41 | |
40 | 42 | Which downloads the model from HuggingFace, loads it into vLLM, serializes it, |
41 | | -and saves it to your S3 bucket. A local directory can also be used. |
| 43 | +and saves it to your S3 bucket. A local directory can also be used. This |
| 44 | +assumes your S3 credentials are specified as environment variables |
| 45 | +in the form of `S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, and `S3_ENDPOINT`. |
| 46 | +To provide S3 credentials directly, you can provide `--s3-access-key-id` and |
| 47 | +`--s3-secret-access-key`, as well as `--s3-endpoint` as CLI args to this |
| 48 | +script. |
42 | 49 |
|
43 | 50 | You can also encrypt the model weights with a randomly-generated key by |
44 | 51 | providing a `--keyfile` argument. |
45 | 52 |
|
46 | | -To deserialize a model, you can run something like this: |
| 53 | +To deserialize a model, you can run something like this from the root |
| 54 | +level of this repository: |
47 | 55 |
|
48 | | -python tensorize_vllm_model.py \ |
| 56 | +python -m examples.tensorize_vllm_model \ |
49 | 57 | --model EleutherAI/gpt-j-6B \ |
50 | 58 | --dtype float16 \ |
51 | 59 | deserialize \ |
52 | 60 | --path-to-tensors s3://my-bucket/vllm/EleutherAI/gpt-j-6B/vllm/model.tensors |
53 | 61 |
|
54 | 62 | Which downloads the model tensors from your S3 bucket and deserializes them. |
55 | | -To provide S3 credentials, you can provide `--s3-access-key-id` and |
56 | | -`--s3-secret-access-key`, as well as `--s3-endpoint` as CLI args to this script, |
57 | | -the OpenAI entrypoint, as arguments for LLM(), or as environment variables |
58 | | -in the form of `S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, and `S3_ENDPOINT`. |
59 | | -
|
60 | 63 |
|
61 | 64 | You can also provide a `--keyfile` argument to decrypt the model weights if |
62 | 65 | they were serialized with encryption. |
63 | 66 |
|
64 | | -For more information on the available arguments, run |
65 | | -`python tensorize_vllm_model.py --help`. |
| 67 | +For more information on the available arguments for serializing, run |
| 68 | +`python -m examples.tensorize_vllm_model serialize --help`. |
| 69 | +
|
| 70 | +Or for deserializing: |
| 71 | +
|
| 72 | +`python -m examples.tensorize_vllm_model deserialize --help`. |
| 73 | +
|
| 74 | +Once a model is serialized, it can be used to load the model when running the |
| 75 | +OpenAI inference client at `vllm/entrypoints/openai/api_server.py` by providing |
| 76 | +the `--tensorizer-uri` CLI argument that is functionally the same as the |
| 77 | +`--path-to-tensors` argument in this script, along with `--vllm-tensorized`, to |
| 78 | +signify that the model to be deserialized is a vLLM model, rather than a |
| 79 | +HuggingFace `PreTrainedModel`, which can also be deserialized using tensorizer |
| 80 | +in the same inference server, albeit without the speed optimizations. To |
| 81 | +deserialize an encrypted file, the `--encryption-keyfile` argument can be used |
| 82 | +to provide the path to the keyfile used to encrypt the model weights. For |
| 83 | +information on all the arguments that can be used to configure tensorizer's |
| 84 | +deserialization, check out the tensorizer options argument group in the |
| 85 | +`vllm/entrypoints/openai/api_server.py` script with `--help`. |
| 86 | +
|
| 87 | +Tensorizer can also be invoked with the `LLM` class directly to load models: |
| 88 | +
|
| 89 | + llm = LLM(model="facebook/opt-125m", |
| 90 | + load_format="tensorizer", |
| 91 | + tensorizer_uri=path_to_opt_tensors, |
| 92 | + num_readers=3, |
| 93 | + vllm_tensorized=True) |
66 | 94 | """ |
67 | 95 |
|
68 | 96 |
|
|
0 commit comments