Qwen3.6-27B-OTQ-GGUF

Name: zlaabsi/Qwen3.6-27B-OTQ-GGUF
Brand: zlaabsi
Rating: 0.0 (12 reviews)

OpenTQ TurboQuant Qwen3.6 banner

OpenTQ TurboQuant dynamic-compatible GGUFs for Qwen/Qwen3.6-27B.

This is the stock llama.cpp release track. OpenTQ chooses the tensor-level allocation policy, but the files themselves use standard GGUF tensor types (Q3_K_M, Q4_K_M, Q5_K, Q6_K, Q8_0, F16). No custom OpenTQ runtime is required for these GGUF files.

The Hugging Face pipeline_tag follows the official Qwen3.6-27B card (image-text-to-text). These GGUF artifacts are validated here for local text inference with stock llama.cpp; vision tensors are not part of this text-focused release track.

Why This Release Exists

These builds target MacBook-class Apple Silicon where wall-clock time matters, especially with long prompts, large system messages and agent/tool context. The goal is not to publish another uniform quant; it is to provide a stock-compatible GGUF family where OpenTQ spends precision on the tensors that matter more for local inference.

What Is OpenTQ?

OpenTQ is an open quantization toolchain for TurboQuant-style low-bit model releases. For this GGUF track, OpenTQ does not introduce a custom file format: it audits the model tensor map, assigns standard GGUF tensor types per tensor family, validates the resulting files in stock llama.cpp, and publishes the allocation/evaluation evidence next to the model.

Field	Value
Release track	`Qwen3.6-27B-OTQ-GGUF`
Method	OpenTQ / TurboQuant-inspired dynamic tensor allocation
Runtime	stock `llama.cpp` with Metal and FlashAttention
Compatibility boundary	standard GGUF only; no native OpenTQ kernel required
Current public variants	`Q3_K_M` compact, `Q4_K_M` balanced, and `Q5_K_M` quality-first
Validation machine	M1 Max, 8K prefill gate, bounded generation, deterministic release suites

Paired BF16-vs-GGUF Quality Signal

These are small paired release signals, not full benchmark replacements. They use the same pinned task IDs, prompt format qwen3-no-think, deterministic decoding, and local scoring rules for BF16 and the GGUF artifacts.

BF16 sidecar: Hugging Face Jobs H200 run 69f235d2d2c8bd8662bd320e, model Qwen/Qwen3.6-27B. Reproducibility data is published in zlaabsi/Qwen3.6-27B-OTQ-GGUF-benchmarks.

Paired BF16 vs GGUF quantization deltas

Benchmark	BF16	Q3_K_M	Delta Q3	Q4_K_M	Delta Q4	Q5_K_M	Delta Q5
`mmlu`	15/16 (93.8%)	15/16 (93.8%)	+0.0%	15/16 (93.8%)	+0.0%	15/16 (93.8%)	+0.0%
`mmlu_pro`	13/24 (54.2%)	13/24 (54.2%)	+0.0%	13/24 (54.2%)	+0.0%	13/24 (54.2%)	+0.0%
`arc`	15/16 (93.8%)	15/16 (93.8%)	+0.0%	15/16 (93.8%)	+0.0%	15/16 (93.8%)	+0.0%
`hellaswag`	15/16 (93.8%)	15/16 (93.8%)	+0.0%	14/16 (87.5%)	-6.2%	15/16 (93.8%)	+0.0%
`gsm8k`	6/16 (37.5%)	5/16 (31.2%)	-6.2%	6/16 (37.5%)	+0.0%	6/16 (37.5%)	+0.0%
`math`	6/16 (37.5%)	5/16 (31.2%)	-6.2%	7/16 (43.8%)	+6.2%	6/16 (37.5%)	+0.0%
`bbh`	18/24 (75.0%)	18/24 (75.0%)	+0.0%	18/24 (75.0%)	+0.0%	19/24 (79.2%)	+4.2%
`gpqa`	0/24 (0.0%)	0/24 (0.0%)	+0.0%	0/24 (0.0%)	+0.0%	0/24 (0.0%)	+0.0%
`truthfulqa`	14/16 (87.5%)	13/16 (81.2%)	-6.2%	13/16 (81.2%)	-6.2%	14/16 (87.5%)	+0.0%
`winogrande`	14/16 (87.5%)	14/16 (87.5%)	+0.0%	14/16 (87.5%)	+0.0%	14/16 (87.5%)	+0.0%
`drop`	13/16 (81.2%)	13/16 (81.2%)	+0.0%	12/16 (75.0%)	-6.2%	11/16 (68.8%)	-12.5%
`piqa`	15/16 (93.8%)	15/16 (93.8%)	+0.0%	15/16 (93.8%)	+0.0%	15/16 (93.8%)	+0.0%
`commonsenseqa`	13/16 (81.2%)	13/16 (81.2%)	+0.0%	13/16 (81.2%)	+0.0%	12/16 (75.0%)	-6.2%
`TOTAL`	157/232 (67.7%)	154/232 (66.4%)	-1.3%	155/232 (66.8%)	-0.9%	155/232 (66.8%)	-0.9%

Aggregate deltas on this practical subset are small: Q3 is -1.3 points, Q4 is -0.9 points, and Q5 is -0.9 points vs BF16. Per-benchmark rows still have small-N variance and should not be used as leaderboard claims.

Official Qwen3.6-27B full-harness scores remain the baseline for model capability claims. This table measures same-subset quantization regression only.

Allocation Transparency

Variant	Mapped tensors	F16	Q3_K	Q4_K	Q5_K	Q6_K	Q8_0
`Q3_K_M`	851	353	180	252	65	1	0
`Q4_K_M`	851	353	0	180	237	80	1
`Q5_K_M`	851	353	0	0	180	237	81

Tensor allocation

Allocation policy

The allocation plots show where OpenTQ spends precision. For example, the compact profile pushes bulk MLP tensors lower while preserving attention anchors and output-sensitive tensors at higher precision.

Files

File	Quant	Size	SHA256	Target
`Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf`	`Q3_K_M`	13.48 GiB	`0088e8884a0593b6720a58e2e0ab91a1dd216dfb80942b698f9ddee5dc8b3192`	32 GB Apple Silicon first pick
`Qwen3.6-27B-OTQ-DYN-Q4_K_M.gguf`	`Q4_K_M`	16.82 GiB	`6b1b9bcbb987e8861c9727488b320e90446d1610a6d3341e3c2185e7388bc2e9`	32 GB moderate context; 48 GB+ preferred
`Qwen3.6-27B-OTQ-DYN-Q5_K_M.gguf`	`Q5_K_M`	19.92 GiB	`aaf270a91d943e9f26692f267aa9ccaa5359ae2084abb8ba76d84d56b660ab16`	48 GB+ preferred; measured on M1 Max 32 GB with tight headroom

Variant Family

File	Quant	Size	Apple Silicon target	Role
`Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf`	`Q3_K_M`	13.48 GiB	32 GB Apple Silicon first pick	smallest public OpenTQ dynamic-compatible release
`Qwen3.6-27B-OTQ-DYN-Q4_K_M.gguf`	`Q4_K_M`	16.82 GiB	32 GB moderate context; 48 GB+ preferred	quality-balanced public release
`Qwen3.6-27B-OTQ-DYN-Q5_K_M.gguf`	`Q5_K_M`	19.92 GiB	48 GB+ preferred; measured on M1 Max 32 GB with tight headroom	quality-first public release for larger unified-memory Macs

Naming

OTQ: OpenTQ, the release/tooling brand.
TurboQuant: the quantization family and design direction.
DYN: dynamic tensor-level allocation; different tensor families receive different GGUF quant types.
Q3_K_M / Q4_K_M / Q5_K_M: standard GGUF quant names recognized by Hugging Face and stock llama.cpp.

Which File Should I Use?

Q3_K_M: first pick for 32 GB Apple Silicon and larger app/tool contexts.
Q4_K_M: quality-balanced pick; usable on 32 GB at moderate context, more comfortable on 48 GB+.
Q5_K_M: quality-first pick; measured on M1 Max 32 GB, but 48 GB+ is the practical target.

Hardware Compatibility

Hardware	Status	Recommended artifact	Notes
M1 Max 32 GB	Measured	`Q3_K_M`; `Q4_K_M`; `Q5_K_M` tight	`Q5_K_M` passed 8K gates but leaves limited app/headroom.
32 GB Apple Silicon	Expected	`Q3_K_M`; `Q4_K_M` only with care	Capacity guidance for M-series systems with similar usable unified memory.
48 GB Apple Silicon	Expected	`Q4_K_M`; `Q5_K_M`	Recommended floor for comfortable Q5 use.
64 GB+ Apple Silicon	Expected	`Q5_K_M` quality-first	Best local target for Q5 plus larger contexts and other apps.
16 GB Apple Silicon	Not recommended	None	Current 27B artifacts leave too little memory headroom.

Expected rows are capacity guidance, not measured benchmark claims. Q5_K_M is measured on M1 Max 32 GB, but 48 GB+ is the practical recommendation for comfortable use.

Model Overview

Base model field	Value
Base model	`Qwen/Qwen3.6-27B`
Parameter class	27B dense model
HF architecture	`Qwen3_5ForConditionalGeneration`
Layer count	64 language layers
Hidden size	5120
Native context	262,144 tokens in the base model; practical local context depends on RAM, KV/cache settings and apps
Public GGUF modality	text inference release track
Runtime target	Apple Silicon Metal through stock `llama.cpp`

Runtime Compatibility

llama.cpp, llama-cli, llama-server: supported.
LM Studio and Ollama local GGUF import: expected to work as standard GGUF loaders.
OpenTQ custom runtime: not required for this repo.
Native TurboQuant/OpenTQ tensor formats: separate release track, not mixed into this GGUF repo.
MLX: not the target runtime for this GGUF track.

Quick Start

1. Download A GGUF

hf download zlaabsi/Qwen3.6-27B-OTQ-GGUF Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf --local-dir models/Qwen3.6-27B-OTQ-GGUF

Use Q3_K_M first on 32 GB Macs. Use Q4_K_M when you can afford the extra memory. Use Q5_K_M for quality-first local inference when headroom matters less than fidelity.

2. Build llama.cpp With Metal

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON
cmake --build build -j

3. Run Locally

./build/bin/llama-cli \
  -m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
  -ngl 99 \
  -fa \
  -c 8192 \
  --temp 0.6 \
  --top-p 0.95 \
  -p "<|im_start|>user\nExplain the tradeoff between prefill and decode throughput.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"

4. Serve An OpenAI-Compatible API

./build/bin/llama-server \
  -m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
  -ngl 99 \
  -fa \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.6-27b-otq","messages":[{"role":"user","content":"Give me a 3-bullet summary of OpenTQ."}],"temperature":0.6}'

llama.cpp Settings

Setting	Recommended value	Why
GPU layers	`-ngl 99`	Offload all supported layers to Metal on Apple Silicon
FlashAttention	`-fa` / `-fa on`	Critical for long-context prefill wall-clock
Context	`-c 8192` first	Validated release gate; increase only after checking memory headroom
Prompt format	Qwen chat template	Keep `<
Sampling	`--temp 0.6 --top-p 0.95`	Good default for general chat; tighten for deterministic evals
Server	`llama-server`	Use for OpenAI-compatible local apps and agents

Apple Silicon Guide

Machine class	Recommendation
32 GB MacBook Pro / Mac Studio	Prefer `Q3_K_M` for headroom, especially with agentic prompts and other apps open.
48-64 GB Apple Silicon	Prefer `Q4_K_M` for balance; use `Q5_K_M` for quality-first local inference.
96 GB+ Apple Silicon	Prefer `Q5_K_M`; larger native/custom candidates remain separate until runtime gates pass.
Agent workloads with large tool context	Measure total wall-clock time. Decode-only tok/s hides prefill cost.

Benchmarks

Variant	Test	Throughput	Backend	Size
`Q3_K_M`	pp8192	107.09 +/- 0.00	MTL,BLAS	13.47 GiB
`Q3_K_M`	tg128	10.19 +/- 0.00	MTL,BLAS	13.47 GiB
`Q4_K_M`	pp8192	106.98 +/- 0.00	MTL,BLAS	16.81 GiB
`Q4_K_M`	tg128	9.62 +/- 0.00	MTL,BLAS	16.81 GiB
`Q5_K_M`	pp8192	93.94 +/- 0.00	MTL,BLAS	19.91 GiB
`Q5_K_M`	tg128	8.87 +/- 0.00	MTL,BLAS	19.91 GiB

Runtime frontier

Prefill decode tradeoff

Release scorecard

The plots compare the quantized OTQ artifacts against each other on measured release data. Official Qwen scores are kept as a reference table, not plotted as a fake delta.

Practical Mini-Subset Quality Signals

See Paired BF16-vs-GGUF Quality Signal. The table and chart are placed near the top of this card because they are the main same-subset quantization-regression evidence.

Release Evaluation

Variant	Suite	Passed	Pass rate	Mean latency	p95 latency
`Q3_K_M`	smoke	5/5	1.0	7.605s	22.371s
`Q3_K_M`	release	10/10	1.0	9.325s	26.905s
`Q4_K_M`	smoke	5/5	1.0	8.333s	23.826s
`Q4_K_M`	release	10/10	1.0	9.907s	21.395s
`Q5_K_M`	smoke	5/5	1.0	16.046s	34.387s
`Q5_K_M`	release	10/10	1.0	16.955s	34.58s

Release Gate

Variant	Metadata	Bounded generation	8K llama-bench	Smoke gate	Release gate	Timestamp
`Q3_K_M`	passed	passed (24.246s)	passed (91.371s)	5/5	10/10	`2026-04-27T19:38:50.320253+00:00`
`Q4_K_M`	passed	passed (22.348s)	passed (93.163s)	5/5	10/10	`2026-04-27T19:43:25.174228+00:00`
`Q5_K_M`	passed	passed (44.272s)	passed (119.964s)	5/5	10/10	`2026-04-28T23:18:17.700281+00:00`

Release gate latency

Release gate coverage

Official Baseline vs OTQ Claims

Item	Status
Official Qwen3.6-27B source scores	Imported from the official model card into `benchmarks/official_qwen36_baseline.csv`
OTQ `Q3_K_M` / `Q4_K_M` / `Q5_K_M` runtime	Measured with `llama-bench` on M1 Max
OTQ functional release gates	Measured with deterministic smoke and extended suites
Official benchmark deltas	Not claimed yet; requires running the same tasks/scoring on the GGUF artifacts

Transparency Files

Each variant has full release evidence under evidence/<quant>/:

validation.json
quality-eval.json
release-eval.json
opentq-plan.json
tensor-types.txt
tensor-types.annotated.tsv
quantize-dry-run.log

Reproduce Release Evidence

git clone https://github.com/zlaabsi/opentq
cd opentq
uv sync
uv run python scripts/stage_qwen36_otq_gguf_repo.py
uv run python scripts/build_qwen36_release_report.py --repo artifacts/hf-gguf-canonical/Qwen3.6-27B-OTQ-GGUF

Run the same style of OTQ release evaluation:

LLAMA_CPP_DIR=/path/to/llama.cpp ./scripts/run_qwen36_otq_eval.sh

Run the long-context benchmark directly:

./build/bin/llama-bench \
  -m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
  -ngl 99 \
  -fa on \
  -p 8192 \
  -n 128 \
  -r 1 \
  --no-warmup

zlaabsi/Qwen3.6-27B-OTQ-GGUF