Qwen3.6-27B-DFlash — GGUF (Q4_K_M + Q8_0)

llama.cpp quantizations of z-lab/Qwen3.6-27B-DFlash, the block-diffusion drafter for DFlash speculative decoding. Pair it with Qwen/Qwen3.6-27B (or a quant of it).

Two quants are published:

File	Size	Recommended?
`dflash-draft-3.6-q8_0.gguf`	1.75 GB	Yes — use this. Matches F16 acceptance.
`dflash-draft-3.6-q4_k_m.gguf`	1.03 GB	Only if VRAM-constrained; acceptance drops ~17 points.

Unlike the 3.5 drafter (all full-attention, Q4-robust), the 3.6 drafter introduces causal sliding-window attention layers (pattern [S,S,S,S,F], window = 2048). Those SWA layers are Q4-fragile — Q4_K_M collapses acceptance from ~43 % → ~28 % on the same workload. Q8_0 is the smallest quant that preserves F16 quality and happens to run slightly faster than F16 in our benchmarks.

Requirements

DFlash speculative decoding is not yet in upstream llama.cpp. You need the fork:

Fork: spiritbuun/buun-llama-cpp (branch master)
SWA support for the DFlash drafter landed in commit b9d01582b (SD-073). Older checkpoints will load the drafter but produce garbage.
Built with: cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON

Usage

`llama-server`

./build/bin/llama-server \
    -m   /path/to/Qwen3.6-27B-target.Q4_K_M.gguf \
    -md  /path/to/dflash-draft-3.6-q8_0.gguf \
    --spec-type dflash \
    -ngl 99 -ngld 99 \
    -np 1 -c 6048 -cd 256 \
    -fa on -b 256 -ub 64 \
    --host 0.0.0.0 --port 8080 --jinja \
    --chat-template-kwargs '{"enable_thinking": false}'

Thinking footgun: the Qwen3.6 chat template enables <think>…</think> by default. That collapses DFlash acceptance because the drafter wasn't trained on the think-wrapped distribution. Pass --chat-template-kwargs '{"enable_thinking": false}' to disable it (≈1.8× throughput uplift).

`llama-speculative-simple`

./build/bin/llama-speculative-simple \
    -m   /path/to/Qwen3.6-27B-target.Q4_K_M.gguf \
    -md  /path/to/dflash-draft-3.6-q8_0.gguf \
    --spec-type dflash \
    -ngl 99 -ngld 99 \
    -c 4096 --draft-max 16 --draft-min 1 \
    -p "Write a Python mergesort."

Observed performance (RTX 3090, llama-server, Qwen3.6-27B UD-Q4_K_XL target, Python BST code prompt, temp = 0, 400 tokens, thinking OFF)

Drafter quant	Raw (t/s)	Raw accept	Chat (t/s)	Chat accept
Q8_0 (recommended)	87	37 %	97	43 %
F16	80	36 %	93	45 %
Q4_K_M	73	29 %	70	28 %

Q8_0 tracks F16 within noise and is half the size.

Note on comparison with the 3.5 drafter

Short-context code prompts do not exercise the sliding-window attention (most queries fall inside the 2048-token window anyway), so the 3.6 drafter's architectural change doesn't produce a dramatic win on this benchmark. The SWA infrastructure is expected to matter on longer-context workloads (> 2 k generated tokens). On short code, Q8_0 on 3.6 is ≈1.3× the throughput of Q4_K_M on 3.5 because the 3.6 target pairs slightly better with the retrained drafter.

Quantization details

Source: z-lab/Qwen3.6-27B-DFlash (BF16 safetensors, 2 B parameters)
Converter: convert_hf_to_gguf.py from spiritbuun/buun-llama-cpp — emits qwen35.attention.sliding_window + qwen35.attention.sliding_window_pattern so the runtime builds per-layer SWA masks
Quants: llama-quantize → Q4_K_M, Q8_0
Tensors: drafter transformer (5 layers, pattern [S,S,S,S,F], window = 2048) + projection heads + cross-attention layers targeting Qwen3.6-27B layer ids [1, 16, 31, 46, 61]

Reproducing the conversion

Tokenizer heads-up: the upstream z-lab/Qwen3.6-27B-DFlash repo ships only config.json, model.safetensors, and a README — no tokenizer files. The drafter shares the target model's tokenizer. Copy the Qwen3.6 tokenizer files into the drafter directory first.

# 1. Pull the DFlash drafter weights
hf download z-lab/Qwen3.6-27B-DFlash --local-dir ./dflash-drafter-3.6

# 2. Pull tokenizer files from the target model into the same directory
hf download Qwen/Qwen3.6-27B \
    tokenizer.json tokenizer_config.json vocab.json merges.txt \
    special_tokens_map.json \
    --local-dir ./dflash-drafter-3.6

# 3. Convert to GGUF (F16 first, then quantize)
python convert_hf_to_gguf.py ./dflash-drafter-3.6 \
    --outtype f16 \
    --outfile dflash-draft-3.6-f16.gguf

# 4. Quantize
./build/bin/llama-quantize dflash-draft-3.6-f16.gguf dflash-draft-3.6-q8_0.gguf Q8_0
./build/bin/llama-quantize dflash-draft-3.6-f16.gguf dflash-draft-3.6-q4_k_m.gguf Q4_K_M

Required files in ./dflash-drafter-3.6/ before step 3:

File	Source
`config.json`	`z-lab/Qwen3.6-27B-DFlash` (has `architectures: ["DFlashDraftModel"]`, `use_sliding_window: true`, `layer_types: [...]`)
`model.safetensors`	`z-lab/Qwen3.6-27B-DFlash`
`tokenizer.json`, `tokenizer_config.json`, `vocab.json`, `merges.txt`, `special_tokens_map.json`	`Qwen/Qwen3.6-27B`

The converter auto-detects DFlashDraftModel from config.json and emits the SWA metadata when use_sliding_window is set.

Original model card — `z-lab/Qwen3.6-27B-DFlash`

Reproduced from the upstream model page. License: MIT.

Overview

Qwen3.6-27B-DFlash is a lightweight drafter component for DFlash speculative decoding. It must be used with the target model Qwen/Qwen3.6-27B.

Paper: https://arxiv.org/abs/2602.06036
GitHub: https://github.com/z-lab/dflash
Blog: https://z-lab.ai/projects/dflash/
Model Size: 2B parameters (BF16)

What is DFlash?

DFlash is a novel speculative decoding method using a lightweight block diffusion model for drafting, enabling efficient, high-quality parallel drafting that significantly speeds up inference.

Upstream Quick Start (vLLM / SGLang)

vLLM

uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

vllm serve Qwen/Qwen3.6-27B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

SGLang

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.6-27B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3.6-27B-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --trust-remote-code

Citation

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}

License

MIT — inherited from the upstream model. This repository redistributes quantized derivatives under the same terms.

spiritbuun/Qwen3.6-27B-DFlash-GGUF

Qwen3.6-27B-DFlash — GGUF (Q4_K_M + Q8_0)

Requirements

Usage

`llama-server`

`llama-speculative-simple`

Observed performance (RTX 3090, llama-server, Qwen3.6-27B UD-Q4_K_XL target, Python BST code prompt, temp = 0, 400 tokens, thinking OFF)

Note on comparison with the 3.5 drafter

Quantization details

Reproducing the conversion

Original model card — `z-lab/Qwen3.6-27B-DFlash`

Overview

What is DFlash?

Upstream Quick Start (vLLM / SGLang)

vLLM

SGLang

Citation

License

No reviews yet

Model Info

Community

Rating Guidelines

spiritbuun/Qwen3.6-27B-DFlash-GGUF

Qwen3.6-27B-DFlash — GGUF (Q4_K_M + Q8_0)

Requirements

Usage

llama-server

llama-speculative-simple

Observed performance (RTX 3090, llama-server, Qwen3.6-27B UD-Q4_K_XL target, Python BST code prompt, temp = 0, 400 tokens, thinking OFF)

Note on comparison with the 3.5 drafter

Quantization details

Reproducing the conversion

Original model card — z-lab/Qwen3.6-27B-DFlash

Overview

What is DFlash?

Upstream Quick Start (vLLM / SGLang)

vLLM

SGLang

Citation

License

No reviews yet

Model Info

Community

Rating Guidelines

`llama-server`

`llama-speculative-simple`

Original model card — `z-lab/Qwen3.6-27B-DFlash`