You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A few important things to consider when using the EAGLE based draft models:
192
+
193
+
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be
194
+
used directly with vLLM due to differences in the expected layer names and model definition.
195
+
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
196
+
to convert them. Note that this script does not modify the model's weights.
197
+
198
+
In the above example, use the script to first convert
199
+
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
200
+
and then use the converted checkpoint as the draft model in vLLM.
201
+
202
+
2. The EAGLE based draft models need to be run without tensor parallelism
203
+
(i.e. speculative_draft_tensor_parallel_size is set to 1), although
204
+
it is possible to run the main model using tensor parallelism (see example above).
205
+
206
+
3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
207
+
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
208
+
investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
209
+
210
+
211
+
A variety of EAGLE draft models are available on the Hugging Face hub:
212
+
213
+
| Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
0 commit comments