Back to Models
spiritbuun logo

spiritbuun/Qwen3.5-27B-DFlash-GGUF

spiritbuungeneral

Qwen3.5-27B-DFlash — GGUF (Q4_K_M)

Q4_K_M llama.cpp quantization of z-lab/Qwen3.5-27B-DFlash, the block-diffusion drafter for DFlash speculative decoding. Pair it with any Qwen3.5-27B target model (e.g. Qwen/Qwen3.5-27B).

Only Q4_K_M is published here — on our RTX 3090 benchmarks Q4_K_M was the optimal operating point for this drafter; Q8_0 and F16 did not produce better end-to-end decode throughput, so they're intentionally omitted to keep the pairing simple.

Requirements

DFlash speculative decoding is not yet in upstream llama.cpp. You need the fork that carries the tape-replay rollback, hidden-state capture, and tree-aware SSM kernels:

  • Fork: spiritbuun/buun-llama-cpp (branch master)
  • Built with: cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON

Usage

llama-server

./build/bin/llama-server \
    -m   /path/to/Qwen3.5-27B-target.Q4_K_M.gguf \
    -md  /path/to/dflash-draft-q4_k_m.gguf \
    --spec-type dflash \
    -ngl 99 -ngld 99 \
    -np 1 -c 6048 -cd 256 \
    -fa on -b 256 -ub 64 \
    --host 0.0.0.0 --port 8080 --jinja

llama-speculative-simple

./build/bin/llama-speculative-simple \
    -m   /path/to/Qwen3.5-27B-target.Q4_K_M.gguf \
    -md  /path/to/dflash-draft-q4_k_m.gguf \
    --spec-type dflash \
    -ngl 99 -ngld 99 \
    -c 4096 --draft-max 16 --draft-min 1 \
    -p "Write a Python mergesort."

Observed performance (RTX 3090, Qwen3.5-27B-heretic target Q4_K_M)

WorkloadDraft ON (t/s)Draft OFF (t/s)Acceptance
Code (raw prompt)1404069 %
Code (chat template)994037 %
Prose (raw prompt)604077 %

Heads-up on chat templates. The drafter was trained on raw continuations; wrapping the prompt in Qwen chat-template tokens (<|im_start|> etc.) shifts the hidden-state distribution the drafter's cross-attention expects. Acceptance on code drops from 69 % → 37 % when the template is applied, and end-to-end throughput drops correspondingly. Raw-mode generation is still a strong speedup; chat-mode is a more modest one. A drafter retrained on chat-formatted data would be needed to recover the full speedup under a chat template.

Quantization details

  • Source: z-lab/Qwen3.5-27B-DFlash (BF16 safetensors, 2 B parameters)
  • Converter: convert_hf_to_gguf.py from spiritbuun/buun-llama-cpp (adds DFlashDraftModel arch)
  • Quant: llama-quantizeQ4_K_M
  • File size: ~1.0 GB
  • Tensors: drafter transformer + projection heads + cross-attention layers targeting Qwen3.5-27B layer ids [1, 16, 31, 46, 61]

Reproducing the conversion

Heads-up (tokenizer error): the upstream z-lab/Qwen3.5-27B-DFlash repo ships only config.json, model.safetensors, dflash.py, and a README — no tokenizer files. The drafter shares the target model's tokenizer. If you run convert_hf_to_gguf.py directly against the z-lab repo you'll get a vocab / tokenizer error. Copy the Qwen3.5 tokenizer files into the drafter directory first.

# 1. Pull the DFlash drafter weights
hf download z-lab/Qwen3.5-27B-DFlash --local-dir ./dflash-drafter

# 2. Pull tokenizer files from the target model into the same directory
hf download Qwen/Qwen3.5-27B \
    tokenizer.json tokenizer_config.json vocab.json merges.txt \
    special_tokens_map.json \
    --local-dir ./dflash-drafter

# 3. Convert to GGUF (BF16/F16 first, then quantize)
python convert_hf_to_gguf.py ./dflash-drafter \
    --outtype f16 \
    --outfile dflash-draft-f16.gguf

# 4. Quantize to Q4_K_M
./build/bin/llama-quantize dflash-draft-f16.gguf dflash-draft-q4_k_m.gguf Q4_K_M

Required files in ./dflash-drafter/ before step 3:

FileSource
config.jsonz-lab/Qwen3.5-27B-DFlash (has architectures: ["DFlashDraftModel"])
model.safetensorsz-lab/Qwen3.5-27B-DFlash
tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.jsonQwen/Qwen3.5-27B

No special converter flag is needed — convert_hf_to_gguf.py auto-detects the DFlashDraftModel architecture from config.json and registers the correct tensor mappings + GGUF hparams (dflash.block_size, dflash.mask_token_id, dflash.target_layer_ids, dflash.n_target_features).


Original model card — z-lab/Qwen3.5-27B-DFlash

Reproduced from the upstream model page. License: MIT.

Overview

Qwen3.5-27B-DFlash is a lightweight drafter component for DFlash speculative decoding. It must be used with the target model Qwen/Qwen3.5-27B.

What is DFlash?

DFlash is a novel speculative decoding method using a lightweight block diffusion model for drafting, enabling efficient, high-quality parallel drafting that significantly speeds up inference.

Upstream Quick Start (vLLM / SGLang)

vLLM

uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

vllm serve Qwen/Qwen3.5-27B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

SGLang

uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-27B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3.5-27B-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --trust-remote-code

Upstream Benchmarks (NVIDIA B200, BF16)

Up to 5.2× speedup over autoregressive baseline at concurrency 1 on HumanEval with block size 16.

TaskConcurrencyARDFlash (B16)Speedup
Math5001843974.7×
HumanEval1834275.2×
GSM8K1833304.0×

Citation

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}

License

MIT — inherited from the upstream model. This repository redistributes a quantized derivative under the same terms.

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes23
Downloads
📝

No reviews yet

Be the first to review spiritbuun/Qwen3.5-27B-DFlash-GGUF!

Model Info

Providerspiritbuun
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes23
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor