Back to Models
spiritbuun logo

spiritbuun/Qwen3.6-27B-DFlash-GGUF

spiritbuungeneral

Qwen3.6-27B-DFlash — GGUF (Q4_K_M + Q8_0)

llama.cpp quantizations of z-lab/Qwen3.6-27B-DFlash, the block-diffusion drafter for DFlash speculative decoding. Pair it with Qwen/Qwen3.6-27B (or a quant of it).

Two quants are published:

FileSizeRecommended?
dflash-draft-3.6-q8_0.gguf1.75 GBYes — use this. Matches F16 acceptance.
dflash-draft-3.6-q4_k_m.gguf1.03 GBOnly if VRAM-constrained; acceptance drops ~17 points.

Unlike the 3.5 drafter (all full-attention, Q4-robust), the 3.6 drafter introduces causal sliding-window attention layers (pattern [S,S,S,S,F], window = 2048). Those SWA layers are Q4-fragile — Q4_K_M collapses acceptance from ~43 % → ~28 % on the same workload. Q8_0 is the smallest quant that preserves F16 quality and happens to run slightly faster than F16 in our benchmarks.

Requirements

DFlash speculative decoding is not yet in upstream llama.cpp. You need the fork:

  • Fork: spiritbuun/buun-llama-cpp (branch master)
  • SWA support for the DFlash drafter landed in commit b9d01582b (SD-073). Older checkpoints will load the drafter but produce garbage.
  • Built with: cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON

Usage

llama-server

./build/bin/llama-server \
    -m   /path/to/Qwen3.6-27B-target.Q4_K_M.gguf \
    -md  /path/to/dflash-draft-3.6-q8_0.gguf \
    --spec-type dflash \
    -ngl 99 -ngld 99 \
    -np 1 -c 6048 -cd 256 \
    -fa on -b 256 -ub 64 \
    --host 0.0.0.0 --port 8080 --jinja \
    --chat-template-kwargs '{"enable_thinking": false}'

Thinking footgun: the Qwen3.6 chat template enables <think>…</think> by default. That collapses DFlash acceptance because the drafter wasn't trained on the think-wrapped distribution. Pass --chat-template-kwargs '{"enable_thinking": false}' to disable it (≈1.8× throughput uplift).

llama-speculative-simple

./build/bin/llama-speculative-simple \
    -m   /path/to/Qwen3.6-27B-target.Q4_K_M.gguf \
    -md  /path/to/dflash-draft-3.6-q8_0.gguf \
    --spec-type dflash \
    -ngl 99 -ngld 99 \
    -c 4096 --draft-max 16 --draft-min 1 \
    -p "Write a Python mergesort."

Observed performance (RTX 3090, llama-server, Qwen3.6-27B UD-Q4_K_XL target, Python BST code prompt, temp = 0, 400 tokens, thinking OFF)

Drafter quantRaw (t/s)Raw acceptChat (t/s)Chat accept
Q8_0 (recommended)8737 %9743 %
F168036 %9345 %
Q4_K_M7329 %7028 %

Q8_0 tracks F16 within noise and is half the size.

Note on comparison with the 3.5 drafter

Short-context code prompts do not exercise the sliding-window attention (most queries fall inside the 2048-token window anyway), so the 3.6 drafter's architectural change doesn't produce a dramatic win on this benchmark. The SWA infrastructure is expected to matter on longer-context workloads (> 2 k generated tokens). On short code, Q8_0 on 3.6 is ≈1.3× the throughput of Q4_K_M on 3.5 because the 3.6 target pairs slightly better with the retrained drafter.

Quantization details

  • Source: z-lab/Qwen3.6-27B-DFlash (BF16 safetensors, 2 B parameters)
  • Converter: convert_hf_to_gguf.py from spiritbuun/buun-llama-cpp — emits qwen35.attention.sliding_window + qwen35.attention.sliding_window_pattern so the runtime builds per-layer SWA masks
  • Quants: llama-quantizeQ4_K_M, Q8_0
  • Tensors: drafter transformer (5 layers, pattern [S,S,S,S,F], window = 2048) + projection heads + cross-attention layers targeting Qwen3.6-27B layer ids [1, 16, 31, 46, 61]

Reproducing the conversion

Tokenizer heads-up: the upstream z-lab/Qwen3.6-27B-DFlash repo ships only config.json, model.safetensors, and a README — no tokenizer files. The drafter shares the target model's tokenizer. Copy the Qwen3.6 tokenizer files into the drafter directory first.

# 1. Pull the DFlash drafter weights
hf download z-lab/Qwen3.6-27B-DFlash --local-dir ./dflash-drafter-3.6

# 2. Pull tokenizer files from the target model into the same directory
hf download Qwen/Qwen3.6-27B \
    tokenizer.json tokenizer_config.json vocab.json merges.txt \
    special_tokens_map.json \
    --local-dir ./dflash-drafter-3.6

# 3. Convert to GGUF (F16 first, then quantize)
python convert_hf_to_gguf.py ./dflash-drafter-3.6 \
    --outtype f16 \
    --outfile dflash-draft-3.6-f16.gguf

# 4. Quantize
./build/bin/llama-quantize dflash-draft-3.6-f16.gguf dflash-draft-3.6-q8_0.gguf Q8_0
./build/bin/llama-quantize dflash-draft-3.6-f16.gguf dflash-draft-3.6-q4_k_m.gguf Q4_K_M

Required files in ./dflash-drafter-3.6/ before step 3:

FileSource
config.jsonz-lab/Qwen3.6-27B-DFlash (has architectures: ["DFlashDraftModel"], use_sliding_window: true, layer_types: [...])
model.safetensorsz-lab/Qwen3.6-27B-DFlash
tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.jsonQwen/Qwen3.6-27B

The converter auto-detects DFlashDraftModel from config.json and emits the SWA metadata when use_sliding_window is set.


Original model card — z-lab/Qwen3.6-27B-DFlash

Reproduced from the upstream model page. License: MIT.

Overview

Qwen3.6-27B-DFlash is a lightweight drafter component for DFlash speculative decoding. It must be used with the target model Qwen/Qwen3.6-27B.

What is DFlash?

DFlash is a novel speculative decoding method using a lightweight block diffusion model for drafting, enabling efficient, high-quality parallel drafting that significantly speeds up inference.

Upstream Quick Start (vLLM / SGLang)

vLLM

uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

vllm serve Qwen/Qwen3.6-27B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

SGLang

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.6-27B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3.6-27B-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --trust-remote-code

Citation

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}

License

MIT — inherited from the upstream model. This repository redistributes quantized derivatives under the same terms.

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes50
Downloads
📝

No reviews yet

Be the first to review spiritbuun/Qwen3.6-27B-DFlash-GGUF!

Model Info

Providerspiritbuun
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes50
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor