[Speculative decoding] feat: add EAGLE3 speculative decoding support by ichbinhandsome · Pull Request #18039 · ggml-org/llama.cpp

ichbinhandsome · 2025-12-14T20:00:13Z

As discussed in #15902, Eagle3 represents the current SOTA in speculative decoding and is widely adopted across the industry. Integrating Eagle3 into llama.cpp enhances its performance and strengthens its competitiveness among leading inference frameworks. With Eagle3 speculative decoding now integrated into llama.cpp, inference performance has been significantly improved, achieving a 2–3× speedup.
This enhancement is the result of close collaboration between the NVIDIA and GGML teams, showcasing a strong technical partnership.

The following provides a brief overview of this PR:

EAGLE3 is an encoder-decoder based speculative decoding method:

Extracts features from target model at specific layers
Uses feature fusion layer to compress target features
Generates draft tokens with single-layer decoder
Maps draft vocabulary to target vocabulary via d2t tensor

Key changes:

Add LLM_ARCH_EAGLE3 architecture
Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp)
Add feature extraction from target model layers
Add g_embeddings handling for decoder input
Add GGML_TENSOR_FLAG_SYNC for GPU synchronization
Add --eagle3 flag for speculative-simple example
Add EAGLE3 model conversion in convert_hf_to_gguf.py

EAGLE3 Architecture Overview :

┌─────────────────────────────────────────────────────────────────┐
│                    EAGLE3 Overview                              │
└─────────────────────────────────────────────────────────────────┘

  Target Model          EAGLE3 Encoder         EAGLE3 Decoder
  (LLaMA 8B)              (FC Layer)           (1-layer Transformer)
       │                      │                       │
       │                      │                       │
       ▼                      ▼                       ▼
┌─────────────┐        ┌─────────────┐        ┌─────────────────┐
│  Generate   │        │  Compress   │        │  Generate Draft │
│  Features   │───────►│  Features   │───────►│  Tokens Fast    │
│  [12288]    │        │  [4096]     │        │  [k tokens]     │
└─────────────┘        └─────────────┘        └────────┬────────┘
                                                       │
                                                       ▼
                                              ┌─────────────────┐
                                              │  Verify Drafts  │
                                              │  with Target    │
                                              └─────────────────┘

How to run EAGLE3 in llama.cpp

Requirements

This PR currently ~~only supports two~~ supports following EAGLE3 models:

yuhuili/EAGLE3-LLaMA3.1-Instruct-8B
yuhuili/EAGLE3-LLaMA3.3-Instruct-70B
Tengyunw/qwen3_8b_eagle3
Tengyunw/qwen3_30b_moe_eagle3
AngelSlim/Qwen3-8B_eagle3
AngelSlim/Qwen3-14B_eagle3
AngelSlim/Qwen3-32B_eagle3
AngelSlim/Qwen3-30B-A3B_eagle3
lmsys/EAGLE3-gpt-oss-120b-bf16 (with performance issue due to its MoE arch)
nvidia/gpt-oss-120b-Eagle3-long-context (with performance issue due to its MoE arch)
RedHatAI/Qwen3-8B-speculator.eagle3
RedHatAI/gpt-oss-20b-speculator.eagle3 (with performance issue due to its MoE arch)
RedHatAI/Qwen3-30B-A3B-Instruct-2507-speculator.eagle3

The following eagle3 models should also work out of the box, though they haven’t been tested yet:

Step 1: Convert Models to GGUF Format

Convert Target Model

TARGET_MODEL_HF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct"
TARGET_MODEL_GGUF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct_bf16.gguf"

python convert_hf_to_gguf.py \
    "${TARGET_MODEL_HF}" \
    --outtype bf16 \
    --outfile "${TARGET_MODEL_GGUF}"

Convert EAGLE3 Draft Model

TARGET_MODEL_HF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct"
EAGLE3_MODEL_HF="${MODELS_DIR}/EAGLE3-LLaMA3.1-Instruct-8B"
EAGLE3_MODEL_GGUF="${MODELS_DIR}/EAGLE3-LLaMA3.1-Instruct-8B_fp16.gguf"

python convert_hf_to_gguf.py \
    "${EAGLE3_MODEL_HF}" \
    --outtype f16 \
    --target-model-dir "${TARGET_MODEL_HF}" \
    --outfile "${EAGLE3_MODEL_GGUF}"

Step 2: Compile llama.cpp

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

[Optional] Step 3: Quantize the GGUF model

./build/bin/llama-quantize \
  ${TARGET_MODEL_GGUF} \
  ${TARGET_MODEL_GGUF}_Q4_K_M.gguf \
  Q4_K_M
 
./build/bin/llama-quantize \
  ${EAGLE3_MODEL_GGUF} \
  ${EAGLE3_MODEL_GGUF}_Q4_K_M.gguf \
  Q4_K_M

Step 4: Run EAGLE3 Speculative Decoding

for prompt in \
    "Write a quicksort algorithm in Python. Write code only." \
    "Explain the Pythagorean theorem" \
    "Plan a 1 day trip to DC"; do
  echo "=== Prompt: $prompt ==="
    ./build/bin/llama-speculative-simple \
      -m "${TARGET_MODEL_GGUF}" \
      -md "${EAGLE3_MODEL_GGUF}" \
      --eagle3 -p "$prompt" -n 256 --draft 8 \
      --temp 0 --top-k 1 --seed 42 -ngl 99 -ngld 99 
done

Performance Evaluation (RTX A6000 48GB)

Note: Using the chat_template for each model version can improve acceptance rates. Always apply the model’s corresponding chat_template when constructing prompts.

LLaMA3.1-Instruct-8B with BF16, its Eagle3 with FP16

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	44.5 t/s	146.2 t/s	80.6%	3.28x
Explain the Pythagorean theorem	44.5 t/s	127.1 t/s	77.4%	2.85x
Plan a 1 day trip to DC	44.5 t/s	113.8 t/s	80.9%	2.55x

LLaMA3.1-Instruct-8B with Q4_K_M, its Eagle3 with Q4_K_M

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	121.5 t/s	274.4 t/s	92.5%	2.26x
Explain the Pythagorean theorem	121.4 t/s	238.9 t/s	79.4%	1.97x
Plan a 1 day trip to DC	121.4 t/s	196.5 t/s	77.2%	1.62x

LLaMA3.3-Instruct-70B with Q4_K_M, its Eagle3 with Q4_K_M

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	15.6 t/s	33.4 t/s	73.6%	2.14x
Explain the Pythagorean theorem	15.6 t/s	37.6 t/s	82.0%	2.41x
Plan a 1 day trip to DC	15.6 t/s	28.8 t/s	69.3%	1.85x

Qwen3-8B with BF16, its Eagle3 with BF16

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	43.6 t/s	94.8 t/s	69.8%	2.17x
Explain the Pythagorean theorem	43.6 t/s	86.8 t/s	68.3%	1.99x
Plan a 1 day trip to DC	43.6 t/s	70.7 t/s	57.3%	1.62x

Qwen3-14B with BF16, its Eagle3 with BF16

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	24.4 t/s	35.7 t/s	40.4%	1.46x
Explain the Pythagorean theorem	24.4 t/s	34.5 t/s	41.3%	1.41x
Plan a 1 day trip to DC	24.3 t/s	30.5 t/s	28.0%	1.26x

Qwen3-32B with Q4_K_M, its Eagle3 with Q4_K_M

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	32.0 t/s	39.7 t/s	39.7%	1.24x
Explain the Pythagorean theorem	32.0 t/s	41.5 t/s	43.3%	1.30x
Plan a 1 day trip to DC	32.0 t/s	37.1 t/s	32.6%	1.16x

Qwen3-30B-A3B with BF16, its Eagle3 with BF16 (tested on NVIDIA DGX Spark 128GB, speedup might be better on other hardwares)

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	31.1 t/s	43.3 t/s	64.4%	1.39x
Explain the Pythagorean theorem	31.2 t/s	41.2 t/s	60.6%	1.32x
Plan a 1 day trip to DC	30.9 t/s	38.6 t/s	58.8%	1.25x

GPT-OSS-20B with BF16, its Eagle3 with BF16 (tested on NVIDIA DGX Spark 128GB, similar performance issue as GPT-OSS-120B Eagle3)

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	61.3 t/s	65.05 t/s	74.25%	1.06x
Explain the Pythagorean theorem	61.2 t/s	58.13 t/s	69.23%	0.95x
Plan a 1 day trip to DC	61.4 t/s	54.50 t/s	62.96%	0.89x

Details of GGML backend modifications (Fixed, no longer needed)

~~In the Eagle3 decoder, two parallel inputs are processed:~~

input_embeds ──→ RMS_NORM ──┐
                            ├──→ CONCAT ──→ Transformer Decoder
g_embeddings ──→ RMS_NORM ──┘

~~When both RMS_NORM operations run in the same GPU split, a lack of synchronization causes buffer contention and race conditions (CPU execution is fine as it auto‑syncs between subgraphs).~~

~~Solution:~~
~~Use ggml_set_sync() to add a synchronization point after the first RMS_NORM, forcing the scheduler to create a split boundary and synchronize before continuing.~~

input_embeds ──→ RMS_NORM ──→ [SYNC] ──┐
                                       ├──→ CONCAT ──→ Transformer Decoder
g_embeddings ─────────────→ RMS_NORM ──┘
         (split 1)            |         (split 2)
                           barrier

~~This ensures correct execution and can be applied to any parallel path that needs synchronization, not just Eagle3.~~

Examples results

Prompt: "Write a quicksort algorithm in Python. Write code only."

Prompt: "Explain the Pythagorean theorem"

Prompt: "Plan a 1 day trip to DC"

Future Steps

Support more Eagle3 models
Currently, Eagle3 is integrated only in llama-speculative-simple, support may need to be extended to other APIs if possible
Support context-dependent tree sampling (tree attention) as described in the Eagle3 paper to improve accept rate
Support batch processing (batch size > 1) with Eagle3 speculative decoding

EAGLE3 is an encoder-decoder based speculative decoding method: - Extracts features from target model at specific layers - Uses feature fusion layer to compress target features - Generates draft tokens with single-layer decoder - Maps draft vocabulary to target vocabulary via d2t tensor Key changes: - Add LLM_ARCH_EAGLE3 architecture - Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp) - Add feature extraction from target model layers - Add g_embeddings handling for decoder input - Add GGML_TENSOR_FLAG_SYNC for GPU synchronization - Add --eagle3 flag for speculative-simple example - Add EAGLE3 model conversion in convert_hf_to_gguf.py

src/models/eagle3.cpp

ngxson · 2025-12-15T16:23:42Z

Judging by the description of this PR, I believe many models with multiple-token prediction also have the same strategy of reusing hidden features from the main model.

It can be quite interesting to generalize this features to support other models. I would expect some kind of sub-llama_context that allow both the main and draft models to share the same cgraph, avoiding the need of explicitly passing the intermediate embedding through the host memory.

ggerganov · 2025-12-15T18:39:27Z

It can be quite interesting to generalize this features to support other models.

I will definitely be looking at refactoring the implementation to become more generic before merging it. The initial results in terms of performance are really great, but we'll need to work on cleaning up the code and reduce the special-casing in several places. I'll try to provide insights how to do that in the next days.

ichbinhandsome · 2025-12-16T17:07:40Z

It can be quite interesting to generalize this features to support other models.

I will definitely be looking at refactoring the implementation to become more generic before merging it. The initial results in terms of performance are really great, but we'll need to work on cleaning up the code and reduce the special-casing in several places. I'll try to provide insights how to do that in the next days.

Thanks @ggerganov @ngxson for your inputs. Definitely, looking forward to hearing your feedback and improving this PR.

ggerganov · 2025-12-17T14:03:46Z

src/models/eagle3.cpp

+
+    // TODO: refactor into llm_graph_input
+    ggml_tensor * inp_g = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_embd, n_tokens);
+    ggml_set_input(inp_g);
+    cb(inp_g, "inp_g_embeddings", -1); // TODO: do not change the name! refactor into llm_graph_input
+


I will change this to llm_graph_input in order to remove the extra "set input" logic in llama_context::process_ubatch.

ggerganov · 2025-12-17T14:06:13Z

src/models/llama.cpp

+        // EAGLE3: Extract intermediate layer features from target model at layer INPUT
+        if (eagle3 && cparams.eagle3_extract_enabled && !eagle3->extract_layer_indices.empty()) {
+            static const char * eagle3_extract_names[] = {"eagle3_extract_0", "eagle3_extract_1", "eagle3_extract_2"};
+            for (size_t i = 0; i < eagle3->extract_layer_indices.size() && i < 3; ++i) {
+                if (eagle3->extract_layer_indices[i] == il) {
+                    cb(inpL, eagle3_extract_names[i], il);
+                    break;
+                }
+            }
+        }


I will next look to remove this ad hoc logic and generalize it some way. Likely by passing the extraction points in some more generic way during llama_context creation. TBD

ggerganov · 2025-12-17T14:10:25Z

src/llama-hparams.h

+
+    // EAGLE3 draft model - target model hidden size
+    uint32_t eagle3_target_hidden_size = 0;
+


This can become more generic by renaming it to n_embd_enc and utilizing the n_embd_inp() call.

ggerganov · 2025-12-17T14:13:46Z

include/llama.h

+    // Get pointer to target model features extracted for EAGLE3 encoder
+    // Returns NULL if no features are available
+    // Format: [3*n_embd, n_tokens] - use model.hparams.n_embd and batch.n_tokens for dimensions
+    LLAMA_API const float * llama_get_eagle3_target_features(struct llama_context * ctx);


This call should become more generic and not Eagle3 specific. Will be looking how to achieve this in the best way.

ggerganov · 2025-12-17T14:18:51Z

include/llama.h

+    // Set g_embeddings from EAGLE3 encoder output for decoder input
+    // g_embd: pointer to encoder output embeddings
+    LLAMA_API void llama_set_eagle3_g_embeddings(
+            struct llama_context * ctx,
+                   const float * g_embd,
+                       int32_t   n_embd,
+                       int32_t   n_tokens);
+


Might be possible to avoid this API if we combine the Eagle encoder and decoder in a single context. TBD

When combining the Eagle3 encoder and decoder into a single context, note that the Eagle3 encoder is used only to fuse the extracted features from the target model, i.e. it is invoked as many times as the target model itself. The Eagle3 decoder, on the other hand, is solely responsible for generating draft tokens in autoregressive way.
llama_set_eagle3_g_embeddings() sets the g_embedding both from the Eagle3 encoder (used in the first generation step of the Eagle3 decoder) and from the Eagle3 decoder itself (used in subsequent generation steps).

Yup, I noticed this interaction. We don't have a previous use case similar to this, but I think the enc-dec context could be adapted accordingly.

pwilkin · 2026-01-06T01:11:39Z

Bumping, is there any progress on this? It's probably one of the more coveted features to have right now.

ggerganov · 2026-01-06T06:53:20Z

Bumping, is there any progress on this?

I'm currently side-tracked by some graph reallocation optimizations. Will probably come back to this after that.

ichbinhandsome · 2026-01-09T12:24:04Z

Eagle3 checkpoints for the Qwen3 series (including both dense and MoE models) are now supported, see the updated PR description for details.
Although these Eagle3 checkpoints are from third party, they can still deliver a 1–2× speedup.
Speculative decoding performance for MoE models is not as good as dense models, which is expected, since more experts are invoked during the parallel verification phase than during the target model’s decoding phase.

ichbinhandsome · 2026-01-09T18:38:30Z

One question: it seems that CUDA Graph is disabled when the input n_tokens > 1. During the target model verification stage of speculative decoding, CUDA Graph is always disabled for the target model, since it’s only used for verification with multiple draft tokens > 1. However, we can fix the number of draft tokens (e.g., by using padding) to make it constant and thus enable CUDA Graph (may need to remove n_tokens > 1 constraint)? @ggerganov

Context: I’m testing GPT-OSS-120B Eagle3 with llama.cpp, and I found that even with Eagle3 (accept rate 86%), the performance is worse than the naive llama-cli. After profiling, I discovered that CUDA Graph is consistently disabled for the target model during speculative decoding, whereas it remains enabled in llama-cli. This results in the target model’s verification(prefiling) phase being roughly >5× times slower compared to normal autoregressive decoding step.
After disabling CUDA graphs for llama-cli using GGML_CUDA_DISABLE_GRAPHS=1, the eagle3 achieved roughly a 1.5× speedup.

I’ve only observed this performance issue with GPT-OSS-120B Eagle3. For other models, even without CUDA Graph enabled for target model in Eagle3 speculative decoding, the performance remains great.

…vert GGUF

ggerganov · 2026-01-12T07:32:19Z

Speculative decoding performance for MoE models is not as good as dense models, which is expected, since more experts are invoked during the parallel verification phase than during the target model’s decoding phase.

I think the small-batch mul_mat_id could be improved in the CUDA backend. AFAIR there the performance for batch sizes (1, 8] is not optimal atm. Need double check.

However, we can fix the number of draft tokens (e.g., by using padding) to make it constant and thus enable CUDA Graph (may need to remove n_tokens > 1 constraint)? @ggerganov

Possibly, but to me this sounds like second-order optimization. Optimizing the mul_mat_id for small batches should bring more generic benefits and would likely have larger impact for speculative decoding compared to enabling CUDA graphs.

After disabling CUDA graphs for llama-cli using GGML_CUDA_DISABLE_GRAPHS=1, the eagle3 achieved roughly a 1.5× speedup.

Hm, this is a bit surprising observation. Can you run a llama-batched-bench test on your system with and without CUDA graphs using the commands from #18308 (comment) and share the results. We are interested in batch sizes [1, 4]. So something like this:

llama-batched-bench -m [gpt-oss-120b] -c 65536 -b 2048 -ub 512 -npp 1024 -ntg 32 -npl 1,2,3,4,5,6,7,8

ichbinhandsome · 2026-01-12T19:48:46Z

Thanks very much for your inputs! @ggerganov

After disabling CUDA graphs for llama-cli using GGML_CUDA_DISABLE_GRAPHS=1, the eagle3 achieved roughly a 1.5× speedup.

Hm, this is a bit surprising observation. Can you run a llama-batched-bench test on your system with and without CUDA graphs using the commands from #18308 (comment) and share the results. We are interested in batch sizes [1, 4]. So something like this:

I double-checked the run today. The previous statement about cuda graph was incorrect due to instability and concurrent CPU activity in my test environment, sorry about that! Currently, enabling or disabling CUDA Graphs doesn’t have much impact in llama-cli for GPT-OSS-120B model. (I am testing on DGX Spark)

with cuda graph enabled: [ Prompt: 120.1 t/s | Generation: 47.2 t/s ]
without cuda graph enabled: [ Prompt: 119.2 t/s | Generation: 45.7 t/s ]

Also, the results for llama-batched-bench:

with cuda graph enabled

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
1024	256	1	1280	1.227	834.89	5.255	48.71	6.482	197.47
1024	256	2	2560	1.579	1296.74	9.277	55.19	10.856	235.81
1024	256	3	3840	2.284	1344.72	10.447	73.51	12.731	301.61
1024	256	4	5120	3.031	1351.58	11.550	88.66	14.580	351.16
1024	256	5	6400	3.780	1354.59	12.433	102.96	16.212	394.76
1024	256	6	7680	4.528	1356.95	13.347	115.08	17.874	429.66
1024	256	7	8960	5.304	1351.48	13.982	128.16	19.286	464.59
1024	256	8	10240	6.018	1361.20	14.704	139.28	20.722	494.16

without cuda graph enabled

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
1024	256	1	1280	1.279	800.61	5.758	44.46	7.037	181.90
1024	256	2	2560	1.597	1282.22	9.297	55.07	10.895	234.98
1024	256	3	3840	2.286	1343.84	10.383	73.97	12.669	303.11
1024	256	4	5120	3.031	1351.51	11.547	88.68	14.577	351.23
1024	256	5	6400	3.771	1357.64	12.438	102.91	16.209	394.84
1024	256	6	7680	4.525	1357.71	13.342	115.12	17.868	429.83
1024	256	7	8960	5.289	1355.17	13.986	128.13	19.275	464.84
1024	256	8	10240	5.999	1365.48	14.653	139.77	20.652	495.83

Possibly, but to me this sounds like second-order optimization. Optimizing the mul_mat_id for small batches should bring more generic benefits and would likely have larger impact for speculative decoding compared to enabling CUDA graphs.

I agree. CUDA graphs could be second-order optimization.
Here are the eagle3 GPT-OSS-120B test results on DGX spark: (I will also test this on other hardwares)

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	48.3 t/s	52.2 t/s	85.0%	1.08x
Explain the Pythagorean theorem	47.8 t/s	46.5 t/s	74.0%	0.97x
Plan a 1 day trip to DC	48.4 t/s	40.0 t/s	55.7%	0.83x

For MoE models, prefilling becomes the main performance bottleneck because more active experts are involved. As a result, the assumption that “processing multiple draft tokens concurrently is as fast as processing a single token” no longer holds, which is an important condition for effective speculative decoding. I also saw that as the draft token length increases, the verification cost of the target model also rises.
This explains the results shown in the table above, in some cases, Eagle3 can even degrade performance. To observe improvements, the accept rate must exceed a certain lower bound.

Do you have any rough ideas that how much performance gain we can get through imporving mul_mat_id?

ngxson · 2026-01-14T21:52:15Z

(I have to split up my comment otherwise it's too long)

My proposal is that we must design this function + the API in a way that it is flexible enough for future models.

For EAGLE3, the MTP model is technically a mtp_head shipped as an extension to the main model (note that the eagle3 repo only contains the extra tensors, but does not contain the main LLM), it can be viewed as an adapter, much like how LoRA works.

For the API, we must avoid leaking the information about the implementation under the hood. The downstream code must only know about how many tokens can be generated, they don't need to know how to generate these extra tokens.

So, an array of API as follow should be enough:

llama_model_load_mtp: load the mtp as a llama_adapter_lora or maybe we can add a new struct for it
llama_mtp_set_n_draft: set the max number of draft tokens to be generated in the next llama_decode; set to 0 for verification pass
llama_mtp_get_n_draft_max: get max number of draft tokens that the MTP head can generate
llama_mtp_get_logits_ith: get logits at for i-th token in batch, returns array of float with size n_vocab*n_draft

All the info about embeddings and the draft model must be kept private.

CC @ggerganov maybe this is helpful for you

ichbinhandsome · 2026-01-15T16:59:55Z

Yeah, that's my guess as well. Do we have some references to cross-check this? Does the Eagle3 authors discuss it's performance for MoE models? Do we have sample numbers for gpt-oss-120 with Eagle3 using vllm, trrt?

As far as I know, the Eagle3 authors did not discuss their approach to MoE model performance in their paper. I am currently cross-checking the performance of GPT-OSS-120B Eagle3 on DGX Spark using SGLang, which essentially employs the same GPT-OSS-120B-Eagle3 draft model as I used for llama.cpp testing.

The running commands I used are as follows:
• Baseline: Run the GPT-OSS-120B target model only.

python3 -m sglang.launch_server --model-path gpt-oss-120b --host 0.0.0.0 --port 30000 --trust-remote-code

• Eagle3: Set the draft size to 8 and disable tree decoding to ensure a fair comparison with our tests on llama.cpp.

python3 -m sglang.launch_server --model gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 8 --speculative-eagle-topk 1 --speculative-num-draft-tokens 8  --trust-remote-code --host 0.0.0.0 --port 30000

I am using curl to test prompts, e.g.

curl -sS -X POST http://localhost:30000/generate \
  -H "Content-Type: application/json" \
  -d '{
      "text": "Write a quicksort algorithm in Python. Write code only.",
      "sampling_params": {
          "max_new_tokens": 256
      }
  }' | python3 -c "
import sys, json
d = json.load(sys.stdin)
tokens = d['meta_info']['completion_tokens']
latency = d['meta_info']['e2e_latency']
tps = tokens / latency
print(f'completion_tokens: {tokens}')
print(f'e2e_latency: {latency:.3f}s')
print(f'token/s: {tps:.2f}')
"

Here are the test results on DGX spark:

Prompt	Baseline	EAGLE3 (draft_size=8)	Speedup
Write a quicksort algorithm in Python. Write code only.	52.50 t/s	36.4 t/s	0.69x
Explain the Pythagorean theorem	52.64 t/s	24.4 t/s	0.46x
Plan a 1 day trip to DC	52.69 t/s	24.7 t/s	0.47x

I also tested shorter draft sizes using the following command:

python3 -m sglang.launch_server --model /home/nvidia/models/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path /home/nvidia/ruixiangw/models/lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4  --trust-remote-code --host 0.0.0.0 --port 30000

The results:

Prompt	Baseline	EAGLE3 (draft_size=3)	Speedup
Write a quicksort algorithm in Python. Write code only.	52.50 t/s	37.36 t/s	0.71x
Explain the Pythagorean theorem	52.64 t/s	27.96 t/s	0.53x
Plan a 1 day trip to DC	52.69 t/s	26.53 t/s	0.50x

From the tables above, we observed similar performance degradation for GPT-OSS-120B-Eagle3 on a single GPU device in SGLang as well. However, in their blog post, they claimed to have achieved some speedups for GPT-OSS-120B-Eagle3 inference using tp=4, i.e., Tensor Parallelism across four H200 GPUs. It’s unclear whether this configuration is the key factor contributing to the observed speedup.
In addition, I feel batching (i.e., processing multiple prompts concurrently) may also play an important role in achieving Eagle3 speedups. A larger batch size means more experts are already activated during native decoding, so activating additional experts in the target model may not become a bottleneck during the Eagle3 verification stage.
TODO: This can be verified after merging this PR and extending eagle3 within llama-server to support multi-batch inference.

In summary, I believe that for large MoE models such as GPT-OSS-120B, Eagle3 may not provide a performance gain on single GPU device with single prompt use case. However, this does not apply to all MoE models—for example, we observed a performance improvement with Qwen3-30B-A3B_eagle3. This might be related to the number of active experts per token and the overall model size, where loading active experts (a memory-bound operation) dominates the inference time.
After profiling, I confirmed that draft token verification for the GPT-OSS MoE target model takes at least twice (depends on how many draft tokens to verify) as long as single-token autoregressive generation on the same GPT-OSS MoE model.
@ggerganov

ichbinhandsome · 2026-01-15T21:13:59Z

My proposal is that we must design this function + the API in a way that it is flexible enough for future models.

Thank you very much for taking the time for this insightful proposal. Although we discussed the Eagle3 design(#15902 (reply in thread)) several months ago, it’s still great to hear your perspective. @ggerganov These might be things worth considering.

ngxson · 2026-01-15T22:32:31Z

The mentioned discussion only discuss the internal design, not the public API design. Probably it's best to open a dedicated discussion on the public API design to avoid going to far into a wrong direction.

Even after reading #15902, I still believe that this implementation is just N layers on top of the main model, meaning the MTP model is just an extension to the main model, instead of being an external model. (I'll comment in the code)

ngxson · 2026-01-15T22:49:10Z

src/models/eagle3.cpp

+llm_build_eagle3_encode::llm_build_eagle3_encode(const llama_model & model, const llm_graph_params & params) : llm_graph_context(params) {
+    ggml_tensor * cur = nullptr;
+
+    cur = build_inp_embd();
+
+    // Feature fusion layer
+    cur = build_lora_mm(model.fc, cur);
+    cb(cur, "fc_out", -1);
+
+    // Output: g_embeddings e.g. [4096, n_tokens]
+    res->t_embd = cur;
+
+    ggml_build_forward_expand(gf, cur);
+}


if the whole point of the encoder is to just do a projection, I think it isn't truly an encoder in transformer terms.

an encoder is responsible for populating KV cache. here, we do not touch the KV cache at all. Instead, I believe this projection can be part of the decoder.

if we need to allow larger input embeddings than n_embd, there is an interface called n_embd_inp that allow doing just that

ngxson · 2026-01-15T22:53:38Z

src/models/eagle3.cpp

+
+    // Single decoder layer (il = 0)
+    const int il = 0;
+    {


Hmm ok, I thought that we can fuse this cgraph with the main LLM cgraph. But that won't work very well because we need to call the sampling system to sample a new token each for each decoding pass of eagle3.

In such case, keeping it as a dedicated model seem ok, although I believe that in term of API design, we must keep llama_set_eagle3_g_embeddings private (not exposing it to the public API)

I think the best could be to have a notion of sub-llama_context, where one llama_context can encapsulate another llama_context. Will see if this is something that can easily be implemented or not.

In such case, keeping it as a dedicated model seem ok, although I believe that in term of API design, we must keep llama_set_eagle3_g_embeddings private (not exposing it to the public API)

I think it can be avoided using an enc-dec context:

#18039 (comment)

It's not necessary because my comment above suggests that eagle3 is not exactly an enc-dec model, but more like an decoder-only model with n_embd_inp > n_embd

What I'm suggesting here is to pass the embeddings from main LLM to the smaller speculative LLM. Because they are currently on 2 different llama_context, so we currently have no better way than passing them via a public API (which make it less future-proof)

(I think I'm commenting on the wrong line, this comment should be placed on llama_get_eagle3_target_features)

I looked deeper into GLM-4.6 implementation today, and I'm pretty confident that eagle3 is almost the same as the MTP model of GLM-4.6

The "encoder" here is basically equivalent to nextn.eh_proj. It is not an enc-dec in transformer terms (i.e. unlike T5), just a bad naming.

And the rest is the same as deepseekv3 MTP style, except that instead of passing the hidden state from one MTP pass to another MTP pass, eagle3 use KV cache

I'm playing around with an implementation on my side that will expose just one single llama_decode_mtp call that will handle hidden state passing under the hood (based on llama_cross), so you can think of the main LLM as the encoder, that will populate the cross, and the MTP as the decoder, in transformer terms.

Will push it when I have a working version.

In anyways, I'm still not convinced that the linear projection should be a dedicated "encoder" cgraph. As I mentioned, the performance loss in this PR could also be due to the backend synchronization happens between encode and decode pass of eagle3 model

The solution 2 in my last comment seems to be the most feasible, will try to implement that on my PR.

As I mentioned, the performance loss in this PR could also be due to the backend synchronization happens between encode and decode pass of eagle3 model

No, it is not. As mentioned earlier, the performance degradation occurs only with the MoE model #18039 (comment). This is because the MoE model requires significantly more time for draft token verification compared to the dense model.
If you perform profiling, you will notice that the backend synchronization between the encode and decode passes of the Eagle3 model is relatively negligible.

Ofc it is negligible if you compare it to the time it takes for the verification pass, but I don't believe that it is negligible if you compare to the time it take to generate one single draft token. The draft model is very small and CPU time can have significant impact on it.

But even if you say that's not important for whatever reason, the more important thing is that copying data to host memory is redundant. At this point, I think it's a better use of my time to just improving this in my implementation instead of arguing here.

Also, from your profiling screenshot, it seems like there is a big gap between the large cudaMemcpyAsync and the run after it (I suppose that's the encoder pass of eagle3), I'm curious what happens in that big gap, probably some calculations on the CPU?

I'm curious what happens in that big gap, probably some calculations on the CPU?

Yes. It is the rejection sampling phase during speculative decoding. Once we obtain the logits for the draft tokens from the target model, we need to verify which tokens are accepted and which need be rejected, and prepare these as input for the draft model. Note that the token_id to embedding mappling also happens in CPU.

arch-btw · 2026-01-16T05:51:59Z

@ichbinhandsome thank you for looking into the Baichuan model.

Have you tested the standard Qwen3-EAGLE3 model as well? Does it work well with the current implementation? If yes, could you please share the t/s and speedup you got with eagle3? @arch-btw

It took me a bit because I had to download the gguf of Qwen3. It does appear to work, but I'm noticing somewhat of a slowdown:

./llama-speculative-simple -m Qwen3-235B-A22B.Q2_K.gguf -md Qwen3-235B-A22B-EAGLE3-draft-Q2_K.gguf --eagle3 -p "Hello!" -n 256 --draft 8 --temp 0 --top-k 1 --seed 42 --no-mmap -fa on

With EAGLE3:

Result

encoded   10 tokens in    1.945 seconds, speed:    5.141 t/s
decoded  136 tokens in   28.931 seconds, speed:    4.701 t/s

n_draft   = 8
n_predict = 136
n_drafted = 112
n_accept  = 36
accept    = 32.143%

draft:

 Eagle3 Draft encoder:
llama_perf_context_print:        load time =   24387.75 ms
llama_perf_context_print: prompt eval time =      11.86 ms /    74 tokens (    0.16 ms per token,  6239.99 tokens per second)
llama_perf_context_print:        eval time =      23.77 ms /    69 runs   (    0.34 ms per token,  2902.94 tokens per second)
llama_perf_context_print:       total time =   53318.22 ms /   143 tokens
llama_perf_context_print:    graphs reused =          0

Eagle3 Draft decoder:
llama_perf_context_print:        load time =   24392.05 ms
llama_perf_context_print: prompt eval time =     155.84 ms /    74 tokens (    2.11 ms per token,   474.83 tokens per second)
llama_perf_context_print:        eval time =     386.38 ms /    82 runs   (    4.71 ms per token,   212.23 tokens per second)
llama_perf_context_print:       total time =   53318.22 ms /   156 tokens
llama_perf_context_print:    graphs reused =         62

target:

common_perf_print:    sampling time =      16.16 ms
common_perf_print:    samplers time =       5.71 ms /   136 tokens
common_perf_print:        load time =   24122.19 ms
common_perf_print: prompt eval time =   29381.13 ms /   221 tokens (  132.95 ms per token,     7.52 tokens per second)
common_perf_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
common_perf_print:       total time =   53053.81 ms /   222 tokens
common_perf_print: unaccounted time =   23656.52 ms /  44.6 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =         83
llama_memory_breakdown_print: | memory breakdown [MiB]      | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Vulkan0 (780M Graphics) | 88024 = 82159 + (  951 =     0 +       0 +     951) +        4912 |
llama_memory_breakdown_print: |   - Host                    |                  89323 = 81715 +    7520 +      88                |

Without EAGLE3:

[ Prompt: 5.0 t/s | Generation: 7.4 t/s ]

By the way, it says "inf tokens per second" under eval time, is that being replaced with the "decoded" section on top? Just making sure I'm reading it correctly.

ichbinhandsome · 2026-01-16T12:00:56Z

It took me a bit because I had to download the gguf of Qwen3. It does appear to work, but I'm noticing somewhat of a slowdown:

Thank you very much for testing this! The slowdown may be due to the short prompt (“Hello!”) or potential MoE performance issues mentioned in this comment. Could you try running the experiments using the same prompts I provided as examples in this PR? I’d expect a higher accept rate with those prompts, which might result in some speedups.

By the way, it says "inf tokens per second" under eval time, is that being replaced with the "decoded" section on top? Just making sure I'm reading it correctly.

I'm using the same metrics as the original code. I think the reason for the inf value is that the target model is only used for draft token verification (prefill) rather than autoregressive decoding. Since no actual decode steps are performed, the eval time is recorded as 0 ms, resulting in inf t/s.
To clarify: this inf refers to the pure target model's eval time. The actual end-to-end t/s (EAGLE3 + target model) is shown at the top of the output.
@arch-btw

arch-btw · 2026-01-16T14:12:53Z

@ichbinhandsome No problem!
Here are the results, I've also tested another model from the second list of models that haven't been tested yet:

Qwen3-1.7B regular vs Qwen3-1.7B + AngelSlim/Qwen3-1.7B_eagle3

Task	Qwen_Qwen3-1.7B-Q6_K-EAGLE3 (t/s)	Qwen_Qwen3-1.7B-Q6_K (t/s)	EAGLE speedup (t/s)
Write quicksort algorithm in Python	47.209	46.9	+0.309
Explain Pythagorean theorem	48.109	47.0	+1.109
Plan 1-day trip to DC	45.031	46.8	-1.769

Qwen3-235B-A22B regular vs Qwen3-235B-A22B + lmsys/Qwen3-235B-A22B-EAGLE3

Task	Qwen3-235B-A22B.Q2_K-EAGLE3 (t/s)	Qwen3-235B-A22B.Q2_K (t/s)	EAGLE speedup (t/s)
Write quicksort algorithm in Python	4.299	7.3	-3.001
Explain Pythagorean theorem	4.709	7.3	-2.591
Plan 1-day trip to DC	4.374	7.4	-3.026

Additional info for Qwen3-235B-A22B EAGLE3

Write a quicksort algorithm in Python. Write code only.

encoded   20 tokens in    4.786 seconds, speed:    4.179 t/s
decoded  257 tokens in   59.787 seconds, speed:    4.299 t/s

n_draft   = 8
n_predict = 257
n_drafted = 229
n_accept  = 66
accept    = 28.821%

Explain the Pythagorean theorem

encoded   15 tokens in    5.170 seconds, speed:    2.902 t/s
decoded  258 tokens in   54.794 seconds, speed:    4.709 t/s

n_draft   = 8
n_predict = 258
n_drafted = 219
n_accept  = 96
accept    = 43.836%

Plan a 1 day trip to DC

encoded   16 tokens in    3.757 seconds, speed:    4.258 t/s
decoded  257 tokens in   58.759 seconds, speed:    4.374 t/s

n_draft   = 8
n_predict = 257
n_drafted = 218
n_accept  = 60
accept    = 27.523%

Additional info for Qwen3-1.7B EAGLE3

Write a quicksort algorithm in Python. Write code only.

encoded   20 tokens in    0.121 seconds, speed:  165.070 t/s
decoded  257 tokens in    5.444 seconds, speed:   47.209 t/s

n_draft   = 8
n_predict = 257
n_drafted = 227
n_accept  = 88
accept    = 38.767%

Explain the Pythagorean theorem

encoded   15 tokens in    0.090 seconds, speed:  166.192 t/s
decoded  257 tokens in    5.342 seconds, speed:   48.109 t/s

n_draft   = 8
n_predict = 257
n_drafted = 231
n_accept  = 94
accept    = 40.693%

Plan a 1 day trip to DC

encoded   16 tokens in    0.094 seconds, speed:  169.339 t/s
decoded  258 tokens in    5.729 seconds, speed:   45.031 t/s

n_draft   = 8
n_predict = 258
n_drafted = 241
n_accept  = 81
accept    = 33.610%

ichbinhandsome · 2026-01-16T14:37:00Z

Thanks for testing! Glad to see the model works. Though the speedup for these models is relatively small or even worse, and the accept rate is quite low.
This could be due to quantization — since Eagle3 models are usually trained in BF16/FP16 with the original data types of the target models, quantization may degrade model quality and reduce the accept rate.
Another possible factor is that the Eagle3 models themselves may not be of high quality, as they come from third-party sources. @arch-btw

jukofyork · 2026-01-23T12:02:42Z

This was just linked on Reddit today:

https://z-lab.ai/projects/dflash/

https://github.com/z-lab/dflash

and seems worth thinking about for any future MPT/Eagle API:

We demonstrate that a free lunch does exist. Our key insight is that the large AR target model’s hidden features implicitly contain information about future tokens, a phenomenon also observed by [4].

Instead of asking a tiny diffusion model to reason from scratch, DFlash conditions the draft model on context features extracted from the target model. This fuses the deep reasoning capabilities of the large model with the parallel generation speed of the small diffusion drafter.

squash-merge of ggml-org/llama.cpp PR ggml-org#18039 onto main. adds Eagle-3 speculative decoding support for 1.5-2.5x generation speedup with draft model pairing.

checks daily for new llama.cpp releases. auto-rebases cherry-picks (audio ggml-org#18641, outetss ggml-org#12794, eagle-3 ggml-org#18039). creates tagged release on clean rebase, PR on conflicts. PR ggml-org#19460 (GLM-5 DSA) already merged upstream, not in cherry-pick list.

danbev · 2026-02-20T15:10:01Z

I think the conversion script needs to be updated to yield values as they seem to be missing:

diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index d59426343..7e84a764a 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -2703,20 +2703,22 @@ class LlamaModel(TextModel):
             # Eagle-3 llama checkpoint special weights handling
             # fc.weight: feature fusion layer
             if name == "fc.weight":
-                return [(name, data_torch)]
+                yield (name, data_torch)
+                return
             # d2t: draft to target vocabulary mapping
             elif name == "d2t":
                 # Skip parent class processing (store for manual handling in prepare_tensors)
                 if not hasattr(self, '_eagle3_int_tensors'):
                     self._eagle3_int_tensors = {}
                 self._eagle3_int_tensors[name] = data_torch
-                return []
+                return
             # t2d: target to draft vocabulary mapping (not used, skip completely)
             elif name == "t2d":
-                return []
+                return
             # hidden_norm: EAGLE-3 specific layer normalization
             elif name == "model.layers.0.hidden_norm.weight":
-                return [("blk.0.hidden_norm.weight", data_torch)]
+                yield ("blk.0.hidden_norm.weight", data_torch)
+                return
 
         n_head = self.find_hparam(["n_heads", "num_attention_heads"])
         n_kv_head = self.find_hparam(["n_kv_heads", "num_key_value_heads"])

Without this change the fc.weight, and blk.0.hidden_norm.weight tensors will be missing from the converted model which will result in the following error when running llama-speculative-simple:

load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: EAGLE3 using d2t mapping (draft_vocab_size = 32000)               
llama_model_load: error loading model: missing tensor 'fc.weight'               
llama_model_load_from_file_impl: failed to load model                           
failed to load draft model, 'models/EAGLE3-LLaMA3.1-Instruct-8B_fp16.gguf'

ichbinhandsome · 2026-02-20T18:06:11Z

Thanks @danbev Now it should be fixed. Upstream merge converted this function to a generator via yield from, but eagle3 code paths still used return [...], which silently discards values in a generator context.

wronkiew · 2026-03-13T18:19:39Z

Nvidia released an EAGLE3 model trained to predict Kimi-K2.5. https://huggingface.co/nvidia/Kimi-K2.5-Thinking-Eagle3

MichaelDietzel · 2026-03-15T22:18:50Z

An EAGLE3 model for Qwen3-Coder-Next was released a few days ago and I wanted to give it a try.

I got pretty mixed results, I assume I did something wrong. I am running on ubuntu 24.04 with a CPU only setup for a first test. So here is what I did:

I converted the eagle model to a gguf based on the FP8-HF base model, but then ran the tests on the Q4_K_M gguf by Qwen instead. The reason: The eagle model was made for the FP8-version but I then had issues to convert it to a gguf
I just copied the additional code from qwen3moe.cpp to to qwen3next.cpp to add eagle3 support for it.
Compiled llama.cpp cmake -B build -DGGML_LTO=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ and cmake --build build -j 4
I ran the speculative model ./build/bin/llama-speculative-simple -m /home/michael/models/Qwen/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf -md /home/michael/Downloads/llama.cpp_eagle3-pr/Aurora-Spec-Qwen3-Coder-Next-FP8-GGUF/Aurora-Spec-Qwen3-Coder-Next-FP8-GGUF.gguf --eagle3 -p "Write a quicksort algorithm in Python. Write code only." -n 256 --draft 8 --temp 0 --top-k 1 --seed 42

I got many of the following errors, but the execution continued nontheless.

init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 107
 - the tokens for sequence 0 in the input batch have a starting position of Y = 107
 it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1

It is hard to read the generated code in between all these error messages and it looks like it is not the same between all the different draft runs and the baseline run. The draft runs appear to be generating the same output as the baseline model but then they output some additional (wrong) code. The wrong code is different between the models, but the number of decoded tokens in the end stays the same. Update: I just realized that I am forcing the number of tokens to be 256 (by using the command given by the OP), so that could be the reason for the constant length.

I got the following results (i got the baseline values using llama-cli, is that the intended way?):
Disclaimer for people looking just at the table: These numbers are probably wrong.

draft	decoded	n_drafted	n_accept	% accept	t/s
baseline	90?	N/A	N/A	N/A	5.6
8	257	441	95	21.5	7.4
4	257	543	61	11.2	11
1	257	140	116	82.9	5
0	257	140	116	82.9	4.9

I am surprised that it looks like there is a twofold increase in speed for draft 4 while only about 1/4 of the output tokens were predicted correctly. I even measured the time with a stopwatch and the t/s are correct, as long as the number of decoded tokens is correct.

So does anyone have an idea what I did wrong? I would love a speedup of 2x, however I would have to try it with vulkan as well. Maybe someone else who actually knows what to do could give this a try instead of me?

loci-dev mentioned this pull request Dec 14, 2025

UPSTREAM PR #18039: [Speculative decoding] feat: add EAGLE3 speculative decoding support auroralabs-loci/llama.cpp#568

Open

github-actions bot added model Model specific examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Dec 14, 2025

github-actions bot mentioned this pull request Dec 15, 2025

Reddit News Daily 2025-12-15 gitlawr/reddit-daily-news#94

Open

ggerganov reviewed Dec 15, 2025

View reviewed changes

src/models/eagle3.cpp Outdated Show resolved Hide resolved

fix eagle3 logits sync bug & remove ggml_set_sync()

ac5667d

ggerganov added 2 commits December 17, 2025 12:09

Merge branch 'master' into pr/18039

3e7f376

eagle3 : improve naming

5a79c19

ggerganov reviewed Dec 17, 2025

View reviewed changes

CISC linked an issue Dec 22, 2025 that may be closed by this pull request

Feature Request: Support EAGLE3 models for draft model / speculative decoding use cases #15305

Open

4 tasks

jagusztinl mentioned this pull request Dec 29, 2025

Feature Request: support for EAGLE3 ikawrakow/ik_llama.cpp#1081

Open

4 tasks

srogmann mentioned this pull request Dec 29, 2025

Add self‑speculative decoding (no draft model required) #18471

Merged

pwilkin added the hot Something that is hot label Jan 6, 2026

ichbinhandsome added 2 commits January 8, 2026 23:49

add eagle3 support for Qwen3 series models

c0d99e6

add eagle3 support for Qwen3 MoE models

71ba283

ichbinhandsome and others added 3 commits January 10, 2026 14:09

eagle3: load lm_head from target model if not in draft model when con…

3da288d

…vert GGUF

eagle3: make d2t mapping optional

13a9f31

eagle3: add support for gpt-oss-120B eagle3

75883cd

ngxson reviewed Jan 15, 2026

View reviewed changes

eagle3: add support for RedHtAI eagle3 speculator series models

7b78bfa

ngxson mentioned this pull request Jan 16, 2026

llama : add MTP API #18886

Draft

ggerganov added 2 commits February 5, 2026 15:58

Merge branch 'master' into HEAD

7d4c223

Merge branch 'master' into pr/18039

5e224bc

TimPietruskyRunPod mentioned this pull request Feb 14, 2026

ci: add automated rebase workflow runpod-labs/a2go-llamacpp#1

Closed

2 tasks

ichbinhandsome added 2 commits February 20, 2026 17:54

eagle3: fix model convert issue

b353792

eagle3: fix model convert code format

9fea243

Merge branch 'master' into pr/18039

b8ab2cc

wallentri88 mentioned this pull request Feb 24, 2026

Eval bug: qwen35 and qwen35moe graph split issues (Severe PP impact, crashes) #19864

Closed

eagle3: support --eagle3 in llama-cli

07e2c97

github-actions bot added the server label Feb 28, 2026

Merge branch 'master' into pr/18039

5bb2d50

forforever73 mentioned this pull request Mar 25, 2026

feat: Step3.5 MTP Support #20981

Closed


		// EAGLE3 draft model - target model hidden size
		uint32_t eagle3_target_hidden_size = 0;

Conversation

ichbinhandsome commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to run EAGLE3 in llama.cpp

Requirements

Step 1: Convert Models to GGUF Format

Step 2: Compile llama.cpp

[Optional] Step 3: Quantize the GGUF model

Step 4: Run EAGLE3 Speculative Decoding

Performance Evaluation (RTX A6000 48GB)

Details of GGML backend modifications (Fixed, no longer needed)

Examples results

Future Steps

Uh oh!

Uh oh!

ngxson commented Dec 15, 2025

Uh oh!

ggerganov commented Dec 15, 2025

Uh oh!

ichbinhandsome commented Dec 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ichbinhandsome Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Jan 6, 2026

Uh oh!

ggerganov commented Jan 6, 2026

Uh oh!

ichbinhandsome commented Jan 9, 2026

Uh oh!

ichbinhandsome commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 12, 2026

Uh oh!

ichbinhandsome commented Jan 12, 2026

Uh oh!

ngxson commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ichbinhandsome commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ichbinhandsome commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ichbinhandsome commented Dec 14, 2025 •

edited

Loading

ichbinhandsome Dec 17, 2025 •

edited

Loading

ichbinhandsome commented Jan 9, 2026 •

edited

Loading

ngxson commented Jan 14, 2026 •

edited

Loading

ichbinhandsome commented Jan 15, 2026 •

edited

Loading

ichbinhandsome commented Jan 15, 2026 •

edited

Loading

ngxson commented Jan 15, 2026 •

edited

Loading

ngxson Jan 16, 2026 •

edited

Loading

ngxson Jan 16, 2026 •

edited

Loading

ngxson Jan 18, 2026 •

edited

Loading

ngxson Jan 18, 2026 •

edited

Loading

arch-btw commented Jan 16, 2026 •

edited

Loading

ichbinhandsome commented Feb 20, 2026 •

edited

Loading

MichaelDietzel commented Mar 15, 2026 •

edited

Loading