Qwen3.5-27B-DFlash — GGUF (Q4_K_M)

Q4_K_M llama.cpp quantization of z-lab/Qwen3.5-27B-DFlash, the block-diffusion drafter for DFlash speculative decoding. Pair it with any Qwen3.5-27B target model (e.g. Qwen/Qwen3.5-27B).

Only Q4_K_M is published here — on our RTX 3090 benchmarks Q4_K_M was the optimal operating point for this drafter; Q8_0 and F16 did not produce better end-to-end decode throughput, so they're intentionally omitted to keep the pairing simple.

Requirements

DFlash speculative decoding is not yet in upstream llama.cpp. You need the fork that carries the tape-replay rollback, hidden-state capture, and tree-aware SSM kernels:

Fork: spiritbuun/buun-llama-cpp (branch master)
Built with: cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON

Usage

`llama-server`

./build/bin/llama-server \
    -m   /path/to/Qwen3.5-27B-target.Q4_K_M.gguf \
    -md  /path/to/dflash-draft-q4_k_m.gguf \
    --spec-type dflash \
    -ngl 99 -ngld 99 \
    -np 1 -c 6048 -cd 256 \
    -fa on -b 256 -ub 64 \
    --host 0.0.0.0 --port 8080 --jinja

`llama-speculative-simple`

./build/bin/llama-speculative-simple \
    -m   /path/to/Qwen3.5-27B-target.Q4_K_M.gguf \
    -md  /path/to/dflash-draft-q4_k_m.gguf \
    --spec-type dflash \
    -ngl 99 -ngld 99 \
    -c 4096 --draft-max 16 --draft-min 1 \
    -p "Write a Python mergesort."

Observed performance (RTX 3090, Qwen3.5-27B-heretic target Q4_K_M)

Workload	Draft ON (t/s)	Draft OFF (t/s)	Acceptance
Code (raw prompt)	140	40	69 %
Code (chat template)	99	40	37 %
Prose (raw prompt)	60	40	77 %

Heads-up on chat templates. The drafter was trained on raw continuations; wrapping the prompt in Qwen chat-template tokens (<|im_start|> etc.) shifts the hidden-state distribution the drafter's cross-attention expects. Acceptance on code drops from 69 % → 37 % when the template is applied, and end-to-end throughput drops correspondingly. Raw-mode generation is still a strong speedup; chat-mode is a more modest one. A drafter retrained on chat-formatted data would be needed to recover the full speedup under a chat template.

Quantization details

Source: z-lab/Qwen3.5-27B-DFlash (BF16 safetensors, 2 B parameters)
Converter: convert_hf_to_gguf.py from spiritbuun/buun-llama-cpp (adds DFlashDraftModel arch)
Quant: llama-quantize → Q4_K_M
File size: ~1.0 GB
Tensors: drafter transformer + projection heads + cross-attention layers targeting Qwen3.5-27B layer ids [1, 16, 31, 46, 61]

Reproducing the conversion

Heads-up (tokenizer error): the upstream z-lab/Qwen3.5-27B-DFlash repo ships only config.json, model.safetensors, dflash.py, and a README — no tokenizer files. The drafter shares the target model's tokenizer. If you run convert_hf_to_gguf.py directly against the z-lab repo you'll get a vocab / tokenizer error. Copy the Qwen3.5 tokenizer files into the drafter directory first.

# 1. Pull the DFlash drafter weights
hf download z-lab/Qwen3.5-27B-DFlash --local-dir ./dflash-drafter

# 2. Pull tokenizer files from the target model into the same directory
hf download Qwen/Qwen3.5-27B \
    tokenizer.json tokenizer_config.json vocab.json merges.txt \
    special_tokens_map.json \
    --local-dir ./dflash-drafter

# 3. Convert to GGUF (BF16/F16 first, then quantize)
python convert_hf_to_gguf.py ./dflash-drafter \
    --outtype f16 \
    --outfile dflash-draft-f16.gguf

# 4. Quantize to Q4_K_M
./build/bin/llama-quantize dflash-draft-f16.gguf dflash-draft-q4_k_m.gguf Q4_K_M

Required files in ./dflash-drafter/ before step 3:

File	Source
`config.json`	`z-lab/Qwen3.5-27B-DFlash` (has `architectures: ["DFlashDraftModel"]`)
`model.safetensors`	`z-lab/Qwen3.5-27B-DFlash`
`tokenizer.json`, `tokenizer_config.json`, `vocab.json`, `merges.txt`, `special_tokens_map.json`	`Qwen/Qwen3.5-27B`

No special converter flag is needed — convert_hf_to_gguf.py auto-detects the DFlashDraftModel architecture from config.json and registers the correct tensor mappings + GGUF hparams (dflash.block_size, dflash.mask_token_id, dflash.target_layer_ids, dflash.n_target_features).

Original model card — `z-lab/Qwen3.5-27B-DFlash`

Reproduced from the upstream model page. License: MIT.

Overview

Qwen3.5-27B-DFlash is a lightweight drafter component for DFlash speculative decoding. It must be used with the target model Qwen/Qwen3.5-27B.

Paper: https://arxiv.org/abs/2602.06036
GitHub: https://github.com/z-lab/dflash
Blog: https://z-lab.ai/projects/dflash/
Model Size: 2B parameters (BF16)
Context Length: 4096 tokens

What is DFlash?

DFlash is a novel speculative decoding method using a lightweight block diffusion model for drafting, enabling efficient, high-quality parallel drafting that significantly speeds up inference.

Upstream Quick Start (vLLM / SGLang)

vLLM

uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

vllm serve Qwen/Qwen3.5-27B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

SGLang

uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-27B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3.5-27B-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --trust-remote-code

Upstream Benchmarks (NVIDIA B200, BF16)

Up to 5.2× speedup over autoregressive baseline at concurrency 1 on HumanEval with block size 16.

Task	Concurrency	AR	DFlash (B16)	Speedup
Math500	1	84	397	4.7×
HumanEval	1	83	427	5.2×
GSM8K	1	83	330	4.0×

Citation

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}

License

MIT — inherited from the upstream model. This repository redistributes a quantized derivative under the same terms.

spiritbuun/Qwen3.5-27B-DFlash-GGUF

Qwen3.5-27B-DFlash — GGUF (Q4_K_M)

Requirements

Usage

`llama-server`

`llama-speculative-simple`

Observed performance (RTX 3090, Qwen3.5-27B-heretic target Q4_K_M)

Quantization details

Reproducing the conversion

Original model card — `z-lab/Qwen3.5-27B-DFlash`

Overview

What is DFlash?

Upstream Quick Start (vLLM / SGLang)

vLLM

SGLang

Upstream Benchmarks (NVIDIA B200, BF16)

Citation

License

No reviews yet

Model Info

Community

Rating Guidelines

spiritbuun/Qwen3.5-27B-DFlash-GGUF

Qwen3.5-27B-DFlash — GGUF (Q4_K_M)

Requirements

Usage

llama-server

llama-speculative-simple

Observed performance (RTX 3090, Qwen3.5-27B-heretic target Q4_K_M)

Quantization details

Reproducing the conversion

Original model card — z-lab/Qwen3.5-27B-DFlash

Overview

What is DFlash?

Upstream Quick Start (vLLM / SGLang)

vLLM

SGLang

Upstream Benchmarks (NVIDIA B200, BF16)

Citation

License

No reviews yet

Model Info

Community

Rating Guidelines

`llama-server`

`llama-speculative-simple`

Original model card — `z-lab/Qwen3.5-27B-DFlash`