[Speculative decoding] feat: add EAGLE3 speculative decoding support#18039
[Speculative decoding] feat: add EAGLE3 speculative decoding support#18039ichbinhandsome wants to merge 17 commits intoggml-org:masterfrom
Conversation
EAGLE3 is an encoder-decoder based speculative decoding method: - Extracts features from target model at specific layers - Uses feature fusion layer to compress target features - Generates draft tokens with single-layer decoder - Maps draft vocabulary to target vocabulary via d2t tensor Key changes: - Add LLM_ARCH_EAGLE3 architecture - Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp) - Add feature extraction from target model layers - Add g_embeddings handling for decoder input - Add GGML_TENSOR_FLAG_SYNC for GPU synchronization - Add --eagle3 flag for speculative-simple example - Add EAGLE3 model conversion in convert_hf_to_gguf.py
|
Judging by the description of this PR, I believe many models with multiple-token prediction also have the same strategy of reusing hidden features from the main model. It can be quite interesting to generalize this features to support other models. I would expect some kind of sub- |
I will definitely be looking at refactoring the implementation to become more generic before merging it. The initial results in terms of performance are really great, but we'll need to work on cleaning up the code and reduce the special-casing in several places. I'll try to provide insights how to do that in the next days. |
Thanks @ggerganov @ngxson for your inputs. Definitely, looking forward to hearing your feedback and improving this PR. |
|
|
||
| // TODO: refactor into llm_graph_input | ||
| ggml_tensor * inp_g = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_embd, n_tokens); | ||
| ggml_set_input(inp_g); | ||
| cb(inp_g, "inp_g_embeddings", -1); // TODO: do not change the name! refactor into llm_graph_input | ||
|
|
There was a problem hiding this comment.
I will change this to llm_graph_input in order to remove the extra "set input" logic in llama_context::process_ubatch.
| // EAGLE3: Extract intermediate layer features from target model at layer INPUT | ||
| if (eagle3 && cparams.eagle3_extract_enabled && !eagle3->extract_layer_indices.empty()) { | ||
| static const char * eagle3_extract_names[] = {"eagle3_extract_0", "eagle3_extract_1", "eagle3_extract_2"}; | ||
| for (size_t i = 0; i < eagle3->extract_layer_indices.size() && i < 3; ++i) { | ||
| if (eagle3->extract_layer_indices[i] == il) { | ||
| cb(inpL, eagle3_extract_names[i], il); | ||
| break; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
I will next look to remove this ad hoc logic and generalize it some way. Likely by passing the extraction points in some more generic way during llama_context creation. TBD
|
|
||
| // EAGLE3 draft model - target model hidden size | ||
| uint32_t eagle3_target_hidden_size = 0; | ||
|
|
There was a problem hiding this comment.
This can become more generic by renaming it to n_embd_enc and utilizing the n_embd_inp() call.
| // Get pointer to target model features extracted for EAGLE3 encoder | ||
| // Returns NULL if no features are available | ||
| // Format: [3*n_embd, n_tokens] - use model.hparams.n_embd and batch.n_tokens for dimensions | ||
| LLAMA_API const float * llama_get_eagle3_target_features(struct llama_context * ctx); |
There was a problem hiding this comment.
This call should become more generic and not Eagle3 specific. Will be looking how to achieve this in the best way.
| // Set g_embeddings from EAGLE3 encoder output for decoder input | ||
| // g_embd: pointer to encoder output embeddings | ||
| LLAMA_API void llama_set_eagle3_g_embeddings( | ||
| struct llama_context * ctx, | ||
| const float * g_embd, | ||
| int32_t n_embd, | ||
| int32_t n_tokens); | ||
|
|
There was a problem hiding this comment.
Might be possible to avoid this API if we combine the Eagle encoder and decoder in a single context. TBD
There was a problem hiding this comment.
When combining the Eagle3 encoder and decoder into a single context, note that the Eagle3 encoder is used only to fuse the extracted features from the target model, i.e. it is invoked as many times as the target model itself. The Eagle3 decoder, on the other hand, is solely responsible for generating draft tokens in autoregressive way.
llama_set_eagle3_g_embeddings() sets the g_embedding both from the Eagle3 encoder (used in the first generation step of the Eagle3 decoder) and from the Eagle3 decoder itself (used in subsequent generation steps).
There was a problem hiding this comment.
Yup, I noticed this interaction. We don't have a previous use case similar to this, but I think the enc-dec context could be adapted accordingly.
|
Bumping, is there any progress on this? It's probably one of the more coveted features to have right now. |
I'm currently side-tracked by some graph reallocation optimizations. Will probably come back to this after that. |
|
Eagle3 checkpoints for the Qwen3 series (including both dense and MoE models) are now supported, see the updated PR description for details. |
|
One question: it seems that CUDA Graph is disabled when the input n_tokens > 1. During the target model verification stage of speculative decoding, CUDA Graph is always disabled for the target model, since it’s only used for verification with multiple draft tokens > 1. However, we can fix the number of draft tokens (e.g., by using padding) to make it constant and thus enable CUDA Graph (may need to remove n_tokens > 1 constraint)? @ggerganov Context: I’m testing GPT-OSS-120B Eagle3 with llama.cpp, and I found that even with Eagle3 (accept rate 86%), the performance is worse than the naive llama-cli. After profiling, I discovered that CUDA Graph is consistently disabled for the target model during speculative decoding, whereas it remains enabled in llama-cli. This results in the target model’s verification(prefiling) phase being roughly >5× times slower compared to normal autoregressive decoding step. I’ve only observed this performance issue with GPT-OSS-120B Eagle3. For other models, even without CUDA Graph enabled for target model in Eagle3 speculative decoding, the performance remains great. |
I think the small-batch
Possibly, but to me this sounds like second-order optimization. Optimizing the
Hm, this is a bit surprising observation. Can you run a llama-batched-bench -m [gpt-oss-120b] -c 65536 -b 2048 -ub 512 -npp 1024 -ntg 32 -npl 1,2,3,4,5,6,7,8 |
|
Thanks very much for your inputs! @ggerganov
I double-checked the run today. The previous statement about cuda graph was incorrect due to instability and concurrent CPU activity in my test environment, sorry about that! Currently, enabling or disabling CUDA Graphs doesn’t have much impact in llama-cli for GPT-OSS-120B model. (I am testing on DGX Spark)
Also, the results for llama-batched-bench:
I agree. CUDA graphs could be second-order optimization.
For MoE models, prefilling becomes the main performance bottleneck because more active experts are involved. As a result, the assumption that “processing multiple draft tokens concurrently is as fast as processing a single token” no longer holds, which is an important condition for effective speculative decoding. I also saw that as the draft token length increases, the verification cost of the target model also rises. Do you have any rough ideas that how much performance gain we can get through imporving |
|
(I have to split up my comment otherwise it's too long) My proposal is that we must design this function + the API in a way that it is flexible enough for future models. For EAGLE3, the MTP model is technically a For the API, we must avoid leaking the information about the implementation under the hood. The downstream code must only know about how many tokens can be generated, they don't need to know how to generate these extra tokens. So, an array of API as follow should be enough:
All the info about embeddings and the draft model must be kept private. CC @ggerganov maybe this is helpful for you |
As far as I know, the Eagle3 authors did not discuss their approach to MoE model performance in their paper. I am currently cross-checking the performance of GPT-OSS-120B Eagle3 on DGX Spark using SGLang, which essentially employs the same GPT-OSS-120B-Eagle3 draft model as I used for llama.cpp testing. The running commands I used are as follows: python3 -m sglang.launch_server --model-path gpt-oss-120b --host 0.0.0.0 --port 30000 --trust-remote-code• Eagle3: Set the draft size to 8 and disable tree decoding to ensure a fair comparison with our tests on llama.cpp. python3 -m sglang.launch_server --model gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 8 --speculative-eagle-topk 1 --speculative-num-draft-tokens 8 --trust-remote-code --host 0.0.0.0 --port 30000I am using curl -sS -X POST http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Write a quicksort algorithm in Python. Write code only.",
"sampling_params": {
"max_new_tokens": 256
}
}' | python3 -c "
import sys, json
d = json.load(sys.stdin)
tokens = d['meta_info']['completion_tokens']
latency = d['meta_info']['e2e_latency']
tps = tokens / latency
print(f'completion_tokens: {tokens}')
print(f'e2e_latency: {latency:.3f}s')
print(f'token/s: {tps:.2f}')
"Here are the test results on DGX spark:
I also tested shorter draft sizes using the following command: python3 -m sglang.launch_server --model /home/nvidia/models/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path /home/nvidia/ruixiangw/models/lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --trust-remote-code --host 0.0.0.0 --port 30000The results:
From the tables above, we observed similar performance degradation for GPT-OSS-120B-Eagle3 on a single GPU device in SGLang as well. However, in their blog post, they claimed to have achieved some speedups for GPT-OSS-120B-Eagle3 inference using In summary, I believe that for large MoE models such as GPT-OSS-120B, Eagle3 may not provide a performance gain on single GPU device with single prompt use case. However, this does not apply to all MoE models—for example, we observed a performance improvement with Qwen3-30B-A3B_eagle3. This might be related to the number of active experts per token and the overall model size, where loading active experts (a memory-bound operation) dominates the inference time. |
Thank you very much for taking the time for this insightful proposal. Although we discussed the Eagle3 design(#15902 (reply in thread)) several months ago, it’s still great to hear your perspective. @ggerganov These might be things worth considering. |
|
The mentioned discussion only discuss the internal design, not the public API design. Probably it's best to open a dedicated discussion on the public API design to avoid going to far into a wrong direction. Even after reading #15902, |
| llm_build_eagle3_encode::llm_build_eagle3_encode(const llama_model & model, const llm_graph_params & params) : llm_graph_context(params) { | ||
| ggml_tensor * cur = nullptr; | ||
|
|
||
| cur = build_inp_embd(); | ||
|
|
||
| // Feature fusion layer | ||
| cur = build_lora_mm(model.fc, cur); | ||
| cb(cur, "fc_out", -1); | ||
|
|
||
| // Output: g_embeddings e.g. [4096, n_tokens] | ||
| res->t_embd = cur; | ||
|
|
||
| ggml_build_forward_expand(gf, cur); | ||
| } |
There was a problem hiding this comment.
if the whole point of the encoder is to just do a projection, I think it isn't truly an encoder in transformer terms.
an encoder is responsible for populating KV cache. here, we do not touch the KV cache at all. Instead, I believe this projection can be part of the decoder.
if we need to allow larger input embeddings than n_embd, there is an interface called n_embd_inp that allow doing just that
|
|
||
| // Single decoder layer (il = 0) | ||
| const int il = 0; | ||
| { |
There was a problem hiding this comment.
Hmm ok, I thought that we can fuse this cgraph with the main LLM cgraph. But that won't work very well because we need to call the sampling system to sample a new token each for each decoding pass of eagle3.
In such case, keeping it as a dedicated model seem ok, although I believe that in term of API design, we must keep llama_set_eagle3_g_embeddings private (not exposing it to the public API)
I think the best could be to have a notion of sub-llama_context, where one llama_context can encapsulate another llama_context. Will see if this is something that can easily be implemented or not.
There was a problem hiding this comment.
In such case, keeping it as a dedicated model seem ok, although I believe that in term of API design, we must keep llama_set_eagle3_g_embeddings private (not exposing it to the public API)
I think it can be avoided using an enc-dec context:
There was a problem hiding this comment.
It's not necessary because my comment above suggests that eagle3 is not exactly an enc-dec model, but more like an decoder-only model with n_embd_inp > n_embd
What I'm suggesting here is to pass the embeddings from main LLM to the smaller speculative LLM. Because they are currently on 2 different llama_context, so we currently have no better way than passing them via a public API (which make it less future-proof)
There was a problem hiding this comment.
(I think I'm commenting on the wrong line, this comment should be placed on llama_get_eagle3_target_features)
There was a problem hiding this comment.
I looked deeper into GLM-4.6 implementation today, and I'm pretty confident that eagle3 is almost the same as the MTP model of GLM-4.6
The "encoder" here is basically equivalent to nextn.eh_proj. It is not an enc-dec in transformer terms (i.e. unlike T5), just a bad naming.
And the rest is the same as deepseekv3 MTP style, except that instead of passing the hidden state from one MTP pass to another MTP pass, eagle3 use KV cache
I'm playing around with an implementation on my side that will expose just one single llama_decode_mtp call that will handle hidden state passing under the hood (based on llama_cross), so you can think of the main LLM as the encoder, that will populate the cross, and the MTP as the decoder, in transformer terms.
Will push it when I have a working version.
There was a problem hiding this comment.
In anyways, I'm still not convinced that the linear projection should be a dedicated "encoder" cgraph. As I mentioned, the performance loss in this PR could also be due to the backend synchronization happens between encode and decode pass of eagle3 model
The solution 2 in my last comment seems to be the most feasible, will try to implement that on my PR.
There was a problem hiding this comment.
As I mentioned, the performance loss in this PR could also be due to the backend synchronization happens between encode and decode pass of eagle3 model
No, it is not. As mentioned earlier, the performance degradation occurs only with the MoE model #18039 (comment). This is because the MoE model requires significantly more time for draft token verification compared to the dense model.
If you perform profiling, you will notice that the backend synchronization between the encode and decode passes of the Eagle3 model is relatively negligible.

There was a problem hiding this comment.
Ofc it is negligible if you compare it to the time it takes for the verification pass, but I don't believe that it is negligible if you compare to the time it take to generate one single draft token. The draft model is very small and CPU time can have significant impact on it.
But even if you say that's not important for whatever reason, the more important thing is that copying data to host memory is redundant. At this point, I think it's a better use of my time to just improving this in my implementation instead of arguing here.
There was a problem hiding this comment.
Also, from your profiling screenshot, it seems like there is a big gap between the large cudaMemcpyAsync and the run after it (I suppose that's the encoder pass of eagle3), I'm curious what happens in that big gap, probably some calculations on the CPU?
There was a problem hiding this comment.
I'm curious what happens in that big gap, probably some calculations on the CPU?
Yes. It is the rejection sampling phase during speculative decoding. Once we obtain the logits for the draft tokens from the target model, we need to verify which tokens are accepted and which need be rejected, and prepare these as input for the draft model. Note that the token_id to embedding mappling also happens in CPU.
|
@ichbinhandsome thank you for looking into the Baichuan model.
It took me a bit because I had to download the gguf of Qwen3. It does appear to work, but I'm noticing somewhat of a slowdown:
With EAGLE3: ResultWithout EAGLE3:
By the way, it says "inf tokens per second" under eval time, is that being replaced with the "decoded" section on top? Just making sure I'm reading it correctly. |
Thank you very much for testing this! The slowdown may be due to the short prompt (“Hello!”) or potential MoE performance issues mentioned in this comment. Could you try running the experiments using the same prompts I provided as examples in this PR? I’d expect a higher accept rate with those prompts, which might result in some speedups.
I'm using the same metrics as the original code. I think the reason for the inf value is that the target model is only used for draft token verification (prefill) rather than autoregressive decoding. Since no actual decode steps are performed, the eval time is recorded as 0 ms, resulting in inf t/s. |
|
@ichbinhandsome No problem!
Additional info for Qwen3-235B-A22B EAGLE3Additional info for Qwen3-1.7B EAGLE3 |
|
Thanks for testing! Glad to see the model works. Though the speedup for these models is relatively small or even worse, and the accept rate is quite low. |
|
This was just linked on Reddit today: https://z-lab.ai/projects/dflash/ https://github.com/z-lab/dflash and seems worth thinking about for any future MPT/Eagle API:
|
squash-merge of ggml-org/llama.cpp PR ggml-org#18039 onto main. adds Eagle-3 speculative decoding support for 1.5-2.5x generation speedup with draft model pairing.
checks daily for new llama.cpp releases. auto-rebases cherry-picks (audio ggml-org#18641, outetss ggml-org#12794, eagle-3 ggml-org#18039). creates tagged release on clean rebase, PR on conflicts. PR ggml-org#19460 (GLM-5 DSA) already merged upstream, not in cherry-pick list.
|
I think the conversion script needs to be updated to yield values as they seem to be missing: diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index d59426343..7e84a764a 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -2703,20 +2703,22 @@ class LlamaModel(TextModel):
# Eagle-3 llama checkpoint special weights handling
# fc.weight: feature fusion layer
if name == "fc.weight":
- return [(name, data_torch)]
+ yield (name, data_torch)
+ return
# d2t: draft to target vocabulary mapping
elif name == "d2t":
# Skip parent class processing (store for manual handling in prepare_tensors)
if not hasattr(self, '_eagle3_int_tensors'):
self._eagle3_int_tensors = {}
self._eagle3_int_tensors[name] = data_torch
- return []
+ return
# t2d: target to draft vocabulary mapping (not used, skip completely)
elif name == "t2d":
- return []
+ return
# hidden_norm: EAGLE-3 specific layer normalization
elif name == "model.layers.0.hidden_norm.weight":
- return [("blk.0.hidden_norm.weight", data_torch)]
+ yield ("blk.0.hidden_norm.weight", data_torch)
+ return
n_head = self.find_hparam(["n_heads", "num_attention_heads"])
n_kv_head = self.find_hparam(["n_kv_heads", "num_key_value_heads"])
Without this change the load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: EAGLE3 using d2t mapping (draft_vocab_size = 32000)
llama_model_load: error loading model: missing tensor 'fc.weight'
llama_model_load_from_file_impl: failed to load model
failed to load draft model, 'models/EAGLE3-LLaMA3.1-Instruct-8B_fp16.gguf' |
|
Thanks @danbev Now it should be fixed. Upstream merge converted this function to a generator via |
|
Nvidia released an EAGLE3 model trained to predict Kimi-K2.5. https://huggingface.co/nvidia/Kimi-K2.5-Thinking-Eagle3 |
|
An EAGLE3 model for Qwen3-Coder-Next was released a few days ago and I wanted to give it a try. I got pretty mixed results, I assume I did something wrong. I am running on ubuntu 24.04 with a CPU only setup for a first test. So here is what I did:
I got many of the following errors, but the execution continued nontheless. init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 107
- the tokens for sequence 0 in the input batch have a starting position of Y = 107
it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1It is hard to read the generated code in between all these error messages and it looks like it is not the same between all the different draft runs and the baseline run. The draft runs appear to be generating the same output as the baseline model but then they output some additional (wrong) code. The wrong code is different between the models, but the number of decoded tokens in the end stays the same. Update: I just realized that I am forcing the number of tokens to be 256 (by using the command given by the OP), so that could be the reason for the constant length. I got the following results (i got the baseline values using llama-cli, is that the intended way?):
I am surprised that it looks like there is a twofold increase in speed for draft 4 while only about 1/4 of the output tokens were predicted correctly. I even measured the time with a stopwatch and the t/s are correct, as long as the number of decoded tokens is correct. So does anyone have an idea what I did wrong? I would love a speedup of 2x, however I would have to try it with vulkan as well. Maybe someone else who actually knows what to do could give this a try instead of me? |
As discussed in #15902, Eagle3 represents the current SOTA in speculative decoding and is widely adopted across the industry. Integrating Eagle3 into llama.cpp enhances its performance and strengthens its competitiveness among leading inference frameworks. With Eagle3 speculative decoding now integrated into llama.cpp, inference performance has been significantly improved, achieving a 2–3× speedup.
This enhancement is the result of close collaboration between the NVIDIA and GGML teams, showcasing a strong technical partnership.
The following provides a brief overview of this PR:
EAGLE3 is an encoder-decoder based speculative decoding method:
Key changes:
EAGLE3 Architecture Overview :
How to run EAGLE3 in llama.cpp
Requirements
This PR currently
only supports twosupports following EAGLE3 models:The following eagle3 models should also work out of the box, though they haven’t been tested yet:
Step 1: Convert Models to GGUF Format
Step 2: Compile llama.cpp
[Optional] Step 3: Quantize the GGUF model
Step 4: Run EAGLE3 Speculative Decoding
Performance Evaluation (RTX A6000 48GB)
Note: Using the chat_template for each model version can improve acceptance rates. Always apply the model’s corresponding chat_template when constructing prompts.
BF16, its Eagle3 withFP16Q4_K_M, its Eagle3 withQ4_K_MQ4_K_M, its Eagle3 withQ4_K_MBF16, its Eagle3 withBF16BF16, its Eagle3 withBF16Q4_K_M, its Eagle3 withQ4_K_MBF16, its Eagle3 withBF16(tested on NVIDIA DGX Spark 128GB, speedup might be better on other hardwares)BF16, its Eagle3 withBF16(tested on NVIDIA DGX Spark 128GB, similar performance issue as GPT-OSS-120B Eagle3)Details of GGML backend modifications(Fixed, no longer needed)In the Eagle3 decoder, two parallel inputs are processed:When both RMS_NORM operations run in the same GPU split, a lack of synchronization causes buffer contention and race conditions (CPU execution is fine as it auto‑syncs between subgraphs).Solution:Useggml_set_sync()to add a synchronization point after the first RMS_NORM, forcing the scheduler to create a split boundary and synchronize before continuing.This ensures correct execution and can be applied to any parallel path that needs synchronization, not just Eagle3.Examples results
Future Steps