Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

Commit 4df9a94

Browse files
sangstarywang96
authored andcommitted
[Doc] Add better clarity for tensorizer usage (vllm-project#4090)
Co-authored-by: Roger Wang <[email protected]>
1 parent b17e123 commit 4df9a94

File tree

3 files changed

+46
-22
lines changed

3 files changed

+46
-22
lines changed

docs/source/models/engine_args.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ Below, you can find an explanation of every engine argument for vLLM:
4545
* "safetensors" will load the weights in the safetensors format.
4646
* "npcache" will load the weights in pytorch format and store a numpy cache to speed up the loading.
4747
* "dummy" will initialize the weights with random values, mainly for profiling.
48-
* "tensorizer" will load serialized weights using `CoreWeave's Tensorizer model deserializer. <https://github.com/coreweave/tensorizer>`_. See `tensorized_vllm_model.py` in the examples folder to serialize a vLLM model, and for more information. Tensorizer support for vLLM can be installed with `pip install vllm[tensorizer]`.
48+
* "tensorizer" will load serialized weights using `CoreWeave's Tensorizer model deserializer. <https://github.com/coreweave/tensorizer>`_ See `examples/tensorize_vllm_model.py <https://github.com/vllm-project/vllm/blob/main/examples/tensorize_vllm_model.py>`_ to serialize a vLLM model, and for more information.
4949

5050
.. option:: --dtype {auto,half,float16,bfloat16,float,float32}
5151

examples/tensorize_vllm_model.py

Lines changed: 44 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -23,46 +23,74 @@
2323
# yapf: disable
2424
"""
2525
tensorize_vllm_model.py is a script that can be used to serialize and
26-
deserialize vLLM models. These models can be loaded using tensorizer directly
27-
to the GPU extremely quickly. Tensor encryption and decryption is also
28-
supported, although libsodium must be installed to use it. Install
29-
vllm with tensorizer support using `pip install vllm[tensorizer]`.
26+
deserialize vLLM models. These models can be loaded using tensorizer
27+
to the GPU extremely quickly over an HTTP/HTTPS endpoint, an S3 endpoint,
28+
or locally. Tensor encryption and decryption is also supported, although
29+
libsodium must be installed to use it. Install vllm with tensorizer support
30+
using `pip install vllm[tensorizer]`.
3031
31-
To serialize a model, you can run something like this:
32+
To serialize a model, install vLLM from source, then run something
33+
like this from the root level of this repository:
3234
33-
python tensorize_vllm_model.py \
35+
python -m examples.tensorize_vllm_model \
3436
--model EleutherAI/gpt-j-6B \
3537
--dtype float16 \
3638
serialize \
3739
--serialized-directory s3://my-bucket/ \
3840
--suffix vllm
3941
4042
Which downloads the model from HuggingFace, loads it into vLLM, serializes it,
41-
and saves it to your S3 bucket. A local directory can also be used.
43+
and saves it to your S3 bucket. A local directory can also be used. This
44+
assumes your S3 credentials are specified as environment variables
45+
in the form of `S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, and `S3_ENDPOINT`.
46+
To provide S3 credentials directly, you can provide `--s3-access-key-id` and
47+
`--s3-secret-access-key`, as well as `--s3-endpoint` as CLI args to this
48+
script.
4249
4350
You can also encrypt the model weights with a randomly-generated key by
4451
providing a `--keyfile` argument.
4552
46-
To deserialize a model, you can run something like this:
53+
To deserialize a model, you can run something like this from the root
54+
level of this repository:
4755
48-
python tensorize_vllm_model.py \
56+
python -m examples.tensorize_vllm_model \
4957
--model EleutherAI/gpt-j-6B \
5058
--dtype float16 \
5159
deserialize \
5260
--path-to-tensors s3://my-bucket/vllm/EleutherAI/gpt-j-6B/vllm/model.tensors
5361
5462
Which downloads the model tensors from your S3 bucket and deserializes them.
55-
To provide S3 credentials, you can provide `--s3-access-key-id` and
56-
`--s3-secret-access-key`, as well as `--s3-endpoint` as CLI args to this script,
57-
the OpenAI entrypoint, as arguments for LLM(), or as environment variables
58-
in the form of `S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, and `S3_ENDPOINT`.
59-
6063
6164
You can also provide a `--keyfile` argument to decrypt the model weights if
6265
they were serialized with encryption.
6366
64-
For more information on the available arguments, run
65-
`python tensorize_vllm_model.py --help`.
67+
For more information on the available arguments for serializing, run
68+
`python -m examples.tensorize_vllm_model serialize --help`.
69+
70+
Or for deserializing:
71+
72+
`python -m examples.tensorize_vllm_model deserialize --help`.
73+
74+
Once a model is serialized, it can be used to load the model when running the
75+
OpenAI inference client at `vllm/entrypoints/openai/api_server.py` by providing
76+
the `--tensorizer-uri` CLI argument that is functionally the same as the
77+
`--path-to-tensors` argument in this script, along with `--vllm-tensorized`, to
78+
signify that the model to be deserialized is a vLLM model, rather than a
79+
HuggingFace `PreTrainedModel`, which can also be deserialized using tensorizer
80+
in the same inference server, albeit without the speed optimizations. To
81+
deserialize an encrypted file, the `--encryption-keyfile` argument can be used
82+
to provide the path to the keyfile used to encrypt the model weights. For
83+
information on all the arguments that can be used to configure tensorizer's
84+
deserialization, check out the tensorizer options argument group in the
85+
`vllm/entrypoints/openai/api_server.py` script with `--help`.
86+
87+
Tensorizer can also be invoked with the `LLM` class directly to load models:
88+
89+
llm = LLM(model="facebook/opt-125m",
90+
load_format="tensorizer",
91+
tensorizer_uri=path_to_opt_tensors,
92+
num_readers=3,
93+
vllm_tensorized=True)
6694
"""
6795

6896

vllm/model_executor/tensorizer_loader.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,6 @@ def __post_init__(self):
126126
"s3_endpoint": self.s3_endpoint,
127127
}
128128

129-
# Omitting self.dtype and self.device as this behaves weirdly
130129
self.deserializer_params = {
131130
"verify_hash": self.verify_hash,
132131
"encryption": self.encryption_keyfile,
@@ -145,7 +144,7 @@ def add_cli_args(
145144
parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
146145
"""Tensorizer CLI arguments"""
147146

148-
# Create the argument group
147+
# Tensorizer options arg group
149148
group = parser.add_argument_group(
150149
'tensorizer options',
151150
description=('Options for configuring the behavior of the'
@@ -205,9 +204,7 @@ def add_cli_args(
205204

206205
@classmethod
207206
def from_cli_args(cls, args: argparse.Namespace) -> "TensorizerArgs":
208-
# Get the list of attributes of this dataclass.
209207
attrs = [attr.name for attr in dataclasses.fields(cls)]
210-
# Set the attributes from the parsed arguments.
211208
tensorizer_args = cls(**{
212209
attr: getattr(args, attr)
213210
for attr in attrs if hasattr(args, attr)
@@ -291,7 +288,6 @@ def deserialize(self):
291288
nn.Module: The deserialized model.
292289
"""
293290
before_mem = get_mem_usage()
294-
# Lazy load the tensors from S3 into the model.
295291
start = time.perf_counter()
296292
with open_stream(
297293
self.tensorizer_args.tensorizer_uri,

0 commit comments

Comments
 (0)