zlaabsi/Qwen3.6-27B-OTQ-GGUF
zlaabsi • imageQwen3.6-27B-OTQ-GGUF

OpenTQ TurboQuant dynamic-compatible GGUFs for Qwen/Qwen3.6-27B.
This is the stock llama.cpp release track. OpenTQ chooses the tensor-level allocation policy, but the files themselves use standard GGUF tensor types (Q3_K_M, Q4_K_M, Q5_K, Q6_K, Q8_0, F16). No custom OpenTQ runtime is required for these GGUF files.
The Hugging Face
pipeline_tagfollows the official Qwen3.6-27B card (image-text-to-text). These GGUF artifacts are validated here for local text inference with stockllama.cpp; vision tensors are not part of this text-focused release track.
Why This Release Exists
These builds target MacBook-class Apple Silicon where wall-clock time matters, especially with long prompts, large system messages and agent/tool context. The goal is not to publish another uniform quant; it is to provide a stock-compatible GGUF family where OpenTQ spends precision on the tensors that matter more for local inference.
What Is OpenTQ?
OpenTQ is an open quantization toolchain for TurboQuant-style low-bit model releases. For this GGUF track, OpenTQ does not introduce a custom file format: it audits the model tensor map, assigns standard GGUF tensor types per tensor family, validates the resulting files in stock llama.cpp, and publishes the allocation/evaluation evidence next to the model.
| Field | Value |
|---|---|
| Release track | Qwen3.6-27B-OTQ-GGUF |
| Method | OpenTQ / TurboQuant-inspired dynamic tensor allocation |
| Runtime | stock llama.cpp with Metal and FlashAttention |
| Compatibility boundary | standard GGUF only; no native OpenTQ kernel required |
| Current public variants | Q3_K_M compact, Q4_K_M balanced, and Q5_K_M quality-first |
| Validation machine | M1 Max, 8K prefill gate, bounded generation, deterministic release suites |
Paired BF16-vs-GGUF Quality Signal
These are small paired release signals, not full benchmark replacements. They use the same pinned task IDs, prompt format qwen3-no-think, deterministic decoding, and local scoring rules for BF16 and the GGUF artifacts.
BF16 sidecar: Hugging Face Jobs H200 run 69f235d2d2c8bd8662bd320e, model Qwen/Qwen3.6-27B. Reproducibility data is published in zlaabsi/Qwen3.6-27B-OTQ-GGUF-benchmarks.

| Benchmark | BF16 | Q3_K_M | Delta Q3 | Q4_K_M | Delta Q4 | Q5_K_M | Delta Q5 |
|---|---|---|---|---|---|---|---|
mmlu | 15/16 (93.8%) | 15/16 (93.8%) | +0.0% | 15/16 (93.8%) | +0.0% | 15/16 (93.8%) | +0.0% |
mmlu_pro | 13/24 (54.2%) | 13/24 (54.2%) | +0.0% | 13/24 (54.2%) | +0.0% | 13/24 (54.2%) | +0.0% |
arc | 15/16 (93.8%) | 15/16 (93.8%) | +0.0% | 15/16 (93.8%) | +0.0% | 15/16 (93.8%) | +0.0% |
hellaswag | 15/16 (93.8%) | 15/16 (93.8%) | +0.0% | 14/16 (87.5%) | -6.2% | 15/16 (93.8%) | +0.0% |
gsm8k | 6/16 (37.5%) | 5/16 (31.2%) | -6.2% | 6/16 (37.5%) | +0.0% | 6/16 (37.5%) | +0.0% |
math | 6/16 (37.5%) | 5/16 (31.2%) | -6.2% | 7/16 (43.8%) | +6.2% | 6/16 (37.5%) | +0.0% |
bbh | 18/24 (75.0%) | 18/24 (75.0%) | +0.0% | 18/24 (75.0%) | +0.0% | 19/24 (79.2%) | +4.2% |
gpqa | 0/24 (0.0%) | 0/24 (0.0%) | +0.0% | 0/24 (0.0%) | +0.0% | 0/24 (0.0%) | +0.0% |
truthfulqa | 14/16 (87.5%) | 13/16 (81.2%) | -6.2% | 13/16 (81.2%) | -6.2% | 14/16 (87.5%) | +0.0% |
winogrande | 14/16 (87.5%) | 14/16 (87.5%) | +0.0% | 14/16 (87.5%) | +0.0% | 14/16 (87.5%) | +0.0% |
drop | 13/16 (81.2%) | 13/16 (81.2%) | +0.0% | 12/16 (75.0%) | -6.2% | 11/16 (68.8%) | -12.5% |
piqa | 15/16 (93.8%) | 15/16 (93.8%) | +0.0% | 15/16 (93.8%) | +0.0% | 15/16 (93.8%) | +0.0% |
commonsenseqa | 13/16 (81.2%) | 13/16 (81.2%) | +0.0% | 13/16 (81.2%) | +0.0% | 12/16 (75.0%) | -6.2% |
TOTAL | 157/232 (67.7%) | 154/232 (66.4%) | -1.3% | 155/232 (66.8%) | -0.9% | 155/232 (66.8%) | -0.9% |
Aggregate deltas on this practical subset are small: Q3 is -1.3 points, Q4 is -0.9 points, and Q5 is -0.9 points vs BF16. Per-benchmark rows still have small-N variance and should not be used as leaderboard claims.
Official Qwen3.6-27B full-harness scores remain the baseline for model capability claims. This table measures same-subset quantization regression only.
Allocation Transparency
| Variant | Mapped tensors | F16 | Q3_K | Q4_K | Q5_K | Q6_K | Q8_0 |
|---|---|---|---|---|---|---|---|
Q3_K_M | 851 | 353 | 180 | 252 | 65 | 1 | 0 |
Q4_K_M | 851 | 353 | 0 | 180 | 237 | 80 | 1 |
Q5_K_M | 851 | 353 | 0 | 0 | 180 | 237 | 81 |


The allocation plots show where OpenTQ spends precision. For example, the compact profile pushes bulk MLP tensors lower while preserving attention anchors and output-sensitive tensors at higher precision.
Files
| File | Quant | Size | SHA256 | Target |
|---|---|---|---|---|
Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf | Q3_K_M | 13.48 GiB | 0088e8884a0593b6720a58e2e0ab91a1dd216dfb80942b698f9ddee5dc8b3192 | 32 GB Apple Silicon first pick |
Qwen3.6-27B-OTQ-DYN-Q4_K_M.gguf | Q4_K_M | 16.82 GiB | 6b1b9bcbb987e8861c9727488b320e90446d1610a6d3341e3c2185e7388bc2e9 | 32 GB moderate context; 48 GB+ preferred |
Qwen3.6-27B-OTQ-DYN-Q5_K_M.gguf | Q5_K_M | 19.92 GiB | aaf270a91d943e9f26692f267aa9ccaa5359ae2084abb8ba76d84d56b660ab16 | 48 GB+ preferred; measured on M1 Max 32 GB with tight headroom |
Variant Family
| File | Quant | Size | Apple Silicon target | Role |
|---|---|---|---|---|
Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf | Q3_K_M | 13.48 GiB | 32 GB Apple Silicon first pick | smallest public OpenTQ dynamic-compatible release |
Qwen3.6-27B-OTQ-DYN-Q4_K_M.gguf | Q4_K_M | 16.82 GiB | 32 GB moderate context; 48 GB+ preferred | quality-balanced public release |
Qwen3.6-27B-OTQ-DYN-Q5_K_M.gguf | Q5_K_M | 19.92 GiB | 48 GB+ preferred; measured on M1 Max 32 GB with tight headroom | quality-first public release for larger unified-memory Macs |
Naming
OTQ: OpenTQ, the release/tooling brand.TurboQuant: the quantization family and design direction.DYN: dynamic tensor-level allocation; different tensor families receive different GGUF quant types.Q3_K_M/Q4_K_M/Q5_K_M: standard GGUF quant names recognized by Hugging Face and stockllama.cpp.
Which File Should I Use?
Q3_K_M: first pick for 32 GB Apple Silicon and larger app/tool contexts.Q4_K_M: quality-balanced pick; usable on 32 GB at moderate context, more comfortable on 48 GB+.Q5_K_M: quality-first pick; measured on M1 Max 32 GB, but 48 GB+ is the practical target.
Hardware Compatibility
| Hardware | Status | Recommended artifact | Notes |
|---|---|---|---|
| M1 Max 32 GB | Measured | Q3_K_M; Q4_K_M; Q5_K_M tight | Q5_K_M passed 8K gates but leaves limited app/headroom. |
| 32 GB Apple Silicon | Expected | Q3_K_M; Q4_K_M only with care | Capacity guidance for M-series systems with similar usable unified memory. |
| 48 GB Apple Silicon | Expected | Q4_K_M; Q5_K_M | Recommended floor for comfortable Q5 use. |
| 64 GB+ Apple Silicon | Expected | Q5_K_M quality-first | Best local target for Q5 plus larger contexts and other apps. |
| 16 GB Apple Silicon | Not recommended | None | Current 27B artifacts leave too little memory headroom. |
Expected rows are capacity guidance, not measured benchmark claims.
Q5_K_M is measured on M1 Max 32 GB, but 48 GB+ is the practical recommendation for comfortable use.
Model Overview
| Base model field | Value |
|---|---|
| Base model | Qwen/Qwen3.6-27B |
| Parameter class | 27B dense model |
| HF architecture | Qwen3_5ForConditionalGeneration |
| Layer count | 64 language layers |
| Hidden size | 5120 |
| Native context | 262,144 tokens in the base model; practical local context depends on RAM, KV/cache settings and apps |
| Public GGUF modality | text inference release track |
| Runtime target | Apple Silicon Metal through stock llama.cpp |
Runtime Compatibility
llama.cpp,llama-cli,llama-server: supported.- LM Studio and Ollama local GGUF import: expected to work as standard GGUF loaders.
- OpenTQ custom runtime: not required for this repo.
- Native TurboQuant/OpenTQ tensor formats: separate release track, not mixed into this GGUF repo.
- MLX: not the target runtime for this GGUF track.
Quick Start
1. Download A GGUF
hf download zlaabsi/Qwen3.6-27B-OTQ-GGUF Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf --local-dir models/Qwen3.6-27B-OTQ-GGUF
Use Q3_K_M first on 32 GB Macs. Use Q4_K_M when you can afford the extra memory. Use Q5_K_M for quality-first local inference when headroom matters less than fidelity.
2. Build llama.cpp With Metal
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON
cmake --build build -j
3. Run Locally
./build/bin/llama-cli \
-m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
-ngl 99 \
-fa \
-c 8192 \
--temp 0.6 \
--top-p 0.95 \
-p "<|im_start|>user\nExplain the tradeoff between prefill and decode throughput.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
4. Serve An OpenAI-Compatible API
./build/bin/llama-server \
-m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
-ngl 99 \
-fa \
-c 8192 \
--host 0.0.0.0 \
--port 8080
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.6-27b-otq","messages":[{"role":"user","content":"Give me a 3-bullet summary of OpenTQ."}],"temperature":0.6}'
llama.cpp Settings
| Setting | Recommended value | Why |
|---|---|---|
| GPU layers | -ngl 99 | Offload all supported layers to Metal on Apple Silicon |
| FlashAttention | -fa / -fa on | Critical for long-context prefill wall-clock |
| Context | -c 8192 first | Validated release gate; increase only after checking memory headroom |
| Prompt format | Qwen chat template | Keep `< |
| Sampling | --temp 0.6 --top-p 0.95 | Good default for general chat; tighten for deterministic evals |
| Server | llama-server | Use for OpenAI-compatible local apps and agents |
Apple Silicon Guide
| Machine class | Recommendation |
|---|---|
| 32 GB MacBook Pro / Mac Studio | Prefer Q3_K_M for headroom, especially with agentic prompts and other apps open. |
| 48-64 GB Apple Silicon | Prefer Q4_K_M for balance; use Q5_K_M for quality-first local inference. |
| 96 GB+ Apple Silicon | Prefer Q5_K_M; larger native/custom candidates remain separate until runtime gates pass. |
| Agent workloads with large tool context | Measure total wall-clock time. Decode-only tok/s hides prefill cost. |
Benchmarks
| Variant | Test | Throughput | Backend | Size |
|---|---|---|---|---|
Q3_K_M | pp8192 | 107.09 +/- 0.00 | MTL,BLAS | 13.47 GiB |
Q3_K_M | tg128 | 10.19 +/- 0.00 | MTL,BLAS | 13.47 GiB |
Q4_K_M | pp8192 | 106.98 +/- 0.00 | MTL,BLAS | 16.81 GiB |
Q4_K_M | tg128 | 9.62 +/- 0.00 | MTL,BLAS | 16.81 GiB |
Q5_K_M | pp8192 | 93.94 +/- 0.00 | MTL,BLAS | 19.91 GiB |
Q5_K_M | tg128 | 8.87 +/- 0.00 | MTL,BLAS | 19.91 GiB |



The plots compare the quantized OTQ artifacts against each other on measured release data. Official Qwen scores are kept as a reference table, not plotted as a fake delta.
Practical Mini-Subset Quality Signals
See Paired BF16-vs-GGUF Quality Signal. The table and chart are placed near the top of this card because they are the main same-subset quantization-regression evidence.
Release Evaluation
| Variant | Suite | Passed | Pass rate | Mean latency | p95 latency |
|---|---|---|---|---|---|
Q3_K_M | smoke | 5/5 | 1.0 | 7.605s | 22.371s |
Q3_K_M | release | 10/10 | 1.0 | 9.325s | 26.905s |
Q4_K_M | smoke | 5/5 | 1.0 | 8.333s | 23.826s |
Q4_K_M | release | 10/10 | 1.0 | 9.907s | 21.395s |
Q5_K_M | smoke | 5/5 | 1.0 | 16.046s | 34.387s |
Q5_K_M | release | 10/10 | 1.0 | 16.955s | 34.58s |
Release Gate
| Variant | Metadata | Bounded generation | 8K llama-bench | Smoke gate | Release gate | Timestamp |
|---|---|---|---|---|---|---|
Q3_K_M | passed | passed (24.246s) | passed (91.371s) | 5/5 | 10/10 | 2026-04-27T19:38:50.320253+00:00 |
Q4_K_M | passed | passed (22.348s) | passed (93.163s) | 5/5 | 10/10 | 2026-04-27T19:43:25.174228+00:00 |
Q5_K_M | passed | passed (44.272s) | passed (119.964s) | 5/5 | 10/10 | 2026-04-28T23:18:17.700281+00:00 |


Official Baseline vs OTQ Claims
| Item | Status |
|---|---|
| Official Qwen3.6-27B source scores | Imported from the official model card into benchmarks/official_qwen36_baseline.csv |
OTQ Q3_K_M / Q4_K_M / Q5_K_M runtime | Measured with llama-bench on M1 Max |
| OTQ functional release gates | Measured with deterministic smoke and extended suites |
| Official benchmark deltas | Not claimed yet; requires running the same tasks/scoring on the GGUF artifacts |
Transparency Files
Each variant has full release evidence under evidence/<quant>/:
validation.jsonquality-eval.jsonrelease-eval.jsonopentq-plan.jsontensor-types.txttensor-types.annotated.tsvquantize-dry-run.log
Reproduce Release Evidence
git clone https://github.com/zlaabsi/opentq
cd opentq
uv sync
uv run python scripts/stage_qwen36_otq_gguf_repo.py
uv run python scripts/build_qwen36_release_report.py --repo artifacts/hf-gguf-canonical/Qwen3.6-27B-OTQ-GGUF
Run the same style of OTQ release evaluation:
LLAMA_CPP_DIR=/path/to/llama.cpp ./scripts/run_qwen36_otq_eval.sh
Run the long-context benchmark directly:
./build/bin/llama-bench \
-m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
-ngl 99 \
-fa on \
-p 8192 \
-n 128 \
-r 1 \
--no-warmup