spiritbuun/Qwen3.5-27B-DFlash-GGUF
spiritbuun • generalQwen3.5-27B-DFlash — GGUF (Q4_K_M)
Q4_K_M llama.cpp quantization of z-lab/Qwen3.5-27B-DFlash, the block-diffusion drafter for DFlash speculative decoding. Pair it with any Qwen3.5-27B target model (e.g. Qwen/Qwen3.5-27B).
Only Q4_K_M is published here — on our RTX 3090 benchmarks Q4_K_M was the optimal operating point for this drafter; Q8_0 and F16 did not produce better end-to-end decode throughput, so they're intentionally omitted to keep the pairing simple.
Requirements
DFlash speculative decoding is not yet in upstream llama.cpp. You need the fork that carries the tape-replay rollback, hidden-state capture, and tree-aware SSM kernels:
- Fork:
spiritbuun/buun-llama-cpp(branchmaster) - Built with:
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON
Usage
llama-server
./build/bin/llama-server \
-m /path/to/Qwen3.5-27B-target.Q4_K_M.gguf \
-md /path/to/dflash-draft-q4_k_m.gguf \
--spec-type dflash \
-ngl 99 -ngld 99 \
-np 1 -c 6048 -cd 256 \
-fa on -b 256 -ub 64 \
--host 0.0.0.0 --port 8080 --jinja
llama-speculative-simple
./build/bin/llama-speculative-simple \
-m /path/to/Qwen3.5-27B-target.Q4_K_M.gguf \
-md /path/to/dflash-draft-q4_k_m.gguf \
--spec-type dflash \
-ngl 99 -ngld 99 \
-c 4096 --draft-max 16 --draft-min 1 \
-p "Write a Python mergesort."
Observed performance (RTX 3090, Qwen3.5-27B-heretic target Q4_K_M)
| Workload | Draft ON (t/s) | Draft OFF (t/s) | Acceptance |
|---|---|---|---|
| Code (raw prompt) | 140 | 40 | 69 % |
| Code (chat template) | 99 | 40 | 37 % |
| Prose (raw prompt) | 60 | 40 | 77 % |
Heads-up on chat templates. The drafter was trained on raw continuations; wrapping the prompt in Qwen chat-template tokens (<|im_start|> etc.) shifts the hidden-state distribution the drafter's cross-attention expects. Acceptance on code drops from 69 % → 37 % when the template is applied, and end-to-end throughput drops correspondingly. Raw-mode generation is still a strong speedup; chat-mode is a more modest one. A drafter retrained on chat-formatted data would be needed to recover the full speedup under a chat template.
Quantization details
- Source:
z-lab/Qwen3.5-27B-DFlash(BF16 safetensors, 2 B parameters) - Converter:
convert_hf_to_gguf.pyfromspiritbuun/buun-llama-cpp(addsDFlashDraftModelarch) - Quant:
llama-quantize→Q4_K_M - File size: ~1.0 GB
- Tensors: drafter transformer + projection heads + cross-attention layers targeting Qwen3.5-27B layer ids
[1, 16, 31, 46, 61]
Reproducing the conversion
Heads-up (tokenizer error): the upstream
z-lab/Qwen3.5-27B-DFlashrepo ships onlyconfig.json,model.safetensors,dflash.py, and a README — no tokenizer files. The drafter shares the target model's tokenizer. If you runconvert_hf_to_gguf.pydirectly against the z-lab repo you'll get a vocab / tokenizer error. Copy the Qwen3.5 tokenizer files into the drafter directory first.
# 1. Pull the DFlash drafter weights
hf download z-lab/Qwen3.5-27B-DFlash --local-dir ./dflash-drafter
# 2. Pull tokenizer files from the target model into the same directory
hf download Qwen/Qwen3.5-27B \
tokenizer.json tokenizer_config.json vocab.json merges.txt \
special_tokens_map.json \
--local-dir ./dflash-drafter
# 3. Convert to GGUF (BF16/F16 first, then quantize)
python convert_hf_to_gguf.py ./dflash-drafter \
--outtype f16 \
--outfile dflash-draft-f16.gguf
# 4. Quantize to Q4_K_M
./build/bin/llama-quantize dflash-draft-f16.gguf dflash-draft-q4_k_m.gguf Q4_K_M
Required files in ./dflash-drafter/ before step 3:
| File | Source |
|---|---|
config.json | z-lab/Qwen3.5-27B-DFlash (has architectures: ["DFlashDraftModel"]) |
model.safetensors | z-lab/Qwen3.5-27B-DFlash |
tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.json | Qwen/Qwen3.5-27B |
No special converter flag is needed — convert_hf_to_gguf.py auto-detects the DFlashDraftModel architecture from config.json and registers the correct tensor mappings + GGUF hparams (dflash.block_size, dflash.mask_token_id, dflash.target_layer_ids, dflash.n_target_features).
Original model card — z-lab/Qwen3.5-27B-DFlash
Reproduced from the upstream model page. License: MIT.
Overview
Qwen3.5-27B-DFlash is a lightweight drafter component for DFlash speculative decoding. It must be used with the target model Qwen/Qwen3.5-27B.
- Paper: https://arxiv.org/abs/2602.06036
- GitHub: https://github.com/z-lab/dflash
- Blog: https://z-lab.ai/projects/dflash/
- Model Size: 2B parameters (BF16)
- Context Length: 4096 tokens
What is DFlash?
DFlash is a novel speculative decoding method using a lightweight block diffusion model for drafting, enabling efficient, high-quality parallel drafting that significantly speeds up inference.
Upstream Quick Start (vLLM / SGLang)
vLLM
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
vllm serve Qwen/Qwen3.5-27B \
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
--attention-backend flash_attn \
--max-num-batched-tokens 32768
SGLang
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-27B \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3.5-27B-DFlash \
--speculative-num-draft-tokens 16 \
--tp-size 1 \
--attention-backend fa3 \
--mem-fraction-static 0.75 \
--trust-remote-code
Upstream Benchmarks (NVIDIA B200, BF16)
Up to 5.2× speedup over autoregressive baseline at concurrency 1 on HumanEval with block size 16.
| Task | Concurrency | AR | DFlash (B16) | Speedup |
|---|---|---|---|---|
| Math500 | 1 | 84 | 397 | 4.7× |
| HumanEval | 1 | 83 | 427 | 5.2× |
| GSM8K | 1 | 83 | 330 | 4.0× |
Citation
@article{chen2026dflash,
title = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
journal = {arXiv preprint arXiv:2602.06036},
year = {2026}
}
License
MIT — inherited from the upstream model. This repository redistributes a quantized derivative under the same terms.