DFlash: Block Diffusion for Flash Speculative Decoding

DFlash is a lightweight block diffusion model designed for speculative decoding. It enables efficient and high-quality parallel drafting.

DFlash_demo.mp4

Supported Models

Model	DFlash Draft
Kimi-K2.5 (Preview)	z-lab/Kimi-K2.5-DFlash
Qwen3.5-4B	z-lab/Qwen3.5-4B-DFlash
Qwen3.5-9B	z-lab/Qwen3.5-9B-DFlash
Qwen3.5-27B	z-lab/Qwen3.5-27B-DFlash
Qwen3.5-35B-A3B	z-lab/Qwen3.5-35B-A3B-DFlash
Qwen3-Coder-Next	z-lab/Qwen3-Coder-Next-DFlash
Qwen3-Coder-30B-A3B	z-lab/Qwen3-Coder-30B-A3B-DFlash
gpt-oss-20b	z-lab/gpt-oss-20b-DFlash
gpt-oss-120b	z-lab/gpt-oss-120b-DFlash
Qwen3-4B	z-lab/Qwen3-4B-DFlash-b16
Qwen3-8B	z-lab/Qwen3-8B-DFlash-b16
Llama-3.1-8B-Instruct	z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat
Qwen3.5-122B-A10B	Coming soon
Qwen3.5-397B-A17B	Coming soon
GLM-5.1	Coming soon

Feel free to open a GitHub issue to request support for additional models. We will also open-source the training recipe soon, so you can train your own DFlash draft model to accelerate any LLM.

📦 Installation

Use a separate virtual environment for each to avoid conflict.

Backend	Install command
Transformers	`uv pip install -e .`
SGLang	`uv pip install -e ".[sglang]"`
vLLM	See below

vLLM: DFlash support requires the nightly build:

uv pip install -e ".[vllm]"
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

🚀 Quick Start

vLLM

vllm serve Qwen/Qwen3.5-27B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

SGLang

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1

# Optional: enable schedule overlapping (experimental, may not be stable)
# export SGLANG_ENABLE_SPEC_V2=1
# export SGLANG_ENABLE_DFLASH_SPEC_V2=1
# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-35B-A3B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --attention-backend trtllm_mha \
    --speculative-draft-attention-backend fa4 \
    --mem-fraction-static 0.75 \
    --mamba-scheduler-strategy extra_buffer \
    --trust-remote-code

Transformers

Only Qwen3 and LLaMA-3.1 models support the Transformers backend.

from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

draft = AutoModel.from_pretrained("z-lab/Qwen3-8B-DFlash-b16", trust_remote_code=True, dtype="auto", device_map="cuda:0").eval()
target = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", dtype="auto", device_map="cuda:0").eval()
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [{"role": "user", "content": "How many positive whole-number divisors does 196 have?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).to(draft.device)

output = draft.spec_generate(input_ids=input_ids, max_new_tokens=2048, temperature=0.0, target=target, stop_token_ids=[tokenizer.eos_token_id])
print(tokenizer.decode(output[0], skip_special_tokens=False))

📊 Evaluation

All benchmarks share the same datasets (gsm8k, math500, humaneval, mbpp, mt-bench). Datasets are automatically downloaded and cached as JSONL in cache/ on first run.

vLLM:

python -m dflash.benchmark --backend vllm \
    --base-url http://127.0.0.1:8000 --model Qwen/Qwen3.5-27B \
    --dataset gsm8k --num-prompts 128 --concurrency 1 --enable-thinking

SGLang:

python -m dflash.benchmark --backend sglang \
    --base-url http://127.0.0.1:30000 --model Qwen/Qwen3.5-35B-A3B \
    --dataset gsm8k --num-prompts 128 --concurrency 1 --enable-thinking

Transformers (Qwen3 and LLaMA only):

torchrun --nproc_per_node=8 -m dflash.benchmark --backend transformers \
    --model Qwen/Qwen3-8B --draft-model z-lab/Qwen3-8B-DFlash-b16 \
    --dataset gsm8k --max-samples 128

Acknowledgement

Huge thanks to @dcw02, @gongy, and the team at @modal-labs for their fast, high-quality support in bringing DFlash to SGLang. And huge thanks as well to @benchislett at NVIDIA for his work in bringing DFlash to vLLM and helping make it available to the broader serving community.

Citation

If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form: DFlash Feedback.

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
dflash		dflash
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DFlash: Block Diffusion for Flash Speculative Decoding

Supported Models

📦 Installation

🚀 Quick Start

vLLM

SGLang

Transformers

📊 Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DFlash: Block Diffusion for Flash Speculative Decoding

Supported Models

📦 Installation

🚀 Quick Start

vLLM

SGLang

Transformers

📊 Evaluation

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages