Back to Models
ZL

zlaabsi/Qwen3.6-27B-OTQ-GGUF

zlaabsiimage

Qwen3.6-27B-OTQ-GGUF

OpenTQ TurboQuant Qwen3.6 banner

GGUF OpenTQ Apple Silicon Release gate Base model

OpenTQ TurboQuant dynamic-compatible GGUFs for Qwen/Qwen3.6-27B.

This is the stock llama.cpp release track. OpenTQ chooses the tensor-level allocation policy, but the files themselves use standard GGUF tensor types (Q3_K_M, Q4_K_M, Q5_K, Q6_K, Q8_0, F16). No custom OpenTQ runtime is required for these GGUF files.

The Hugging Face pipeline_tag follows the official Qwen3.6-27B card (image-text-to-text). These GGUF artifacts are validated here for local text inference with stock llama.cpp; vision tensors are not part of this text-focused release track.

Why This Release Exists

These builds target MacBook-class Apple Silicon where wall-clock time matters, especially with long prompts, large system messages and agent/tool context. The goal is not to publish another uniform quant; it is to provide a stock-compatible GGUF family where OpenTQ spends precision on the tensors that matter more for local inference.

What Is OpenTQ?

OpenTQ is an open quantization toolchain for TurboQuant-style low-bit model releases. For this GGUF track, OpenTQ does not introduce a custom file format: it audits the model tensor map, assigns standard GGUF tensor types per tensor family, validates the resulting files in stock llama.cpp, and publishes the allocation/evaluation evidence next to the model.

FieldValue
Release trackQwen3.6-27B-OTQ-GGUF
MethodOpenTQ / TurboQuant-inspired dynamic tensor allocation
Runtimestock llama.cpp with Metal and FlashAttention
Compatibility boundarystandard GGUF only; no native OpenTQ kernel required
Current public variantsQ3_K_M compact, Q4_K_M balanced, and Q5_K_M quality-first
Validation machineM1 Max, 8K prefill gate, bounded generation, deterministic release suites

Paired BF16-vs-GGUF Quality Signal

These are small paired release signals, not full benchmark replacements. They use the same pinned task IDs, prompt format qwen3-no-think, deterministic decoding, and local scoring rules for BF16 and the GGUF artifacts.

BF16 sidecar: Hugging Face Jobs H200 run 69f235d2d2c8bd8662bd320e, model Qwen/Qwen3.6-27B. Reproducibility data is published in zlaabsi/Qwen3.6-27B-OTQ-GGUF-benchmarks.

Paired BF16 vs GGUF quantization deltas

BenchmarkBF16Q3_K_MDelta Q3Q4_K_MDelta Q4Q5_K_MDelta Q5
mmlu15/16 (93.8%)15/16 (93.8%)+0.0%15/16 (93.8%)+0.0%15/16 (93.8%)+0.0%
mmlu_pro13/24 (54.2%)13/24 (54.2%)+0.0%13/24 (54.2%)+0.0%13/24 (54.2%)+0.0%
arc15/16 (93.8%)15/16 (93.8%)+0.0%15/16 (93.8%)+0.0%15/16 (93.8%)+0.0%
hellaswag15/16 (93.8%)15/16 (93.8%)+0.0%14/16 (87.5%)-6.2%15/16 (93.8%)+0.0%
gsm8k6/16 (37.5%)5/16 (31.2%)-6.2%6/16 (37.5%)+0.0%6/16 (37.5%)+0.0%
math6/16 (37.5%)5/16 (31.2%)-6.2%7/16 (43.8%)+6.2%6/16 (37.5%)+0.0%
bbh18/24 (75.0%)18/24 (75.0%)+0.0%18/24 (75.0%)+0.0%19/24 (79.2%)+4.2%
gpqa0/24 (0.0%)0/24 (0.0%)+0.0%0/24 (0.0%)+0.0%0/24 (0.0%)+0.0%
truthfulqa14/16 (87.5%)13/16 (81.2%)-6.2%13/16 (81.2%)-6.2%14/16 (87.5%)+0.0%
winogrande14/16 (87.5%)14/16 (87.5%)+0.0%14/16 (87.5%)+0.0%14/16 (87.5%)+0.0%
drop13/16 (81.2%)13/16 (81.2%)+0.0%12/16 (75.0%)-6.2%11/16 (68.8%)-12.5%
piqa15/16 (93.8%)15/16 (93.8%)+0.0%15/16 (93.8%)+0.0%15/16 (93.8%)+0.0%
commonsenseqa13/16 (81.2%)13/16 (81.2%)+0.0%13/16 (81.2%)+0.0%12/16 (75.0%)-6.2%
TOTAL157/232 (67.7%)154/232 (66.4%)-1.3%155/232 (66.8%)-0.9%155/232 (66.8%)-0.9%

Aggregate deltas on this practical subset are small: Q3 is -1.3 points, Q4 is -0.9 points, and Q5 is -0.9 points vs BF16. Per-benchmark rows still have small-N variance and should not be used as leaderboard claims.

Official Qwen3.6-27B full-harness scores remain the baseline for model capability claims. This table measures same-subset quantization regression only.

Allocation Transparency

VariantMapped tensorsF16Q3_KQ4_KQ5_KQ6_KQ8_0
Q3_K_M8513531802526510
Q4_K_M8513530180237801
Q5_K_M8513530018023781

Tensor allocation

Allocation policy

The allocation plots show where OpenTQ spends precision. For example, the compact profile pushes bulk MLP tensors lower while preserving attention anchors and output-sensitive tensors at higher precision.

Files

FileQuantSizeSHA256Target
Qwen3.6-27B-OTQ-DYN-Q3_K_M.ggufQ3_K_M13.48 GiB0088e8884a0593b6720a58e2e0ab91a1dd216dfb80942b698f9ddee5dc8b319232 GB Apple Silicon first pick
Qwen3.6-27B-OTQ-DYN-Q4_K_M.ggufQ4_K_M16.82 GiB6b1b9bcbb987e8861c9727488b320e90446d1610a6d3341e3c2185e7388bc2e932 GB moderate context; 48 GB+ preferred
Qwen3.6-27B-OTQ-DYN-Q5_K_M.ggufQ5_K_M19.92 GiBaaf270a91d943e9f26692f267aa9ccaa5359ae2084abb8ba76d84d56b660ab1648 GB+ preferred; measured on M1 Max 32 GB with tight headroom

Variant Family

FileQuantSizeApple Silicon targetRole
Qwen3.6-27B-OTQ-DYN-Q3_K_M.ggufQ3_K_M13.48 GiB32 GB Apple Silicon first picksmallest public OpenTQ dynamic-compatible release
Qwen3.6-27B-OTQ-DYN-Q4_K_M.ggufQ4_K_M16.82 GiB32 GB moderate context; 48 GB+ preferredquality-balanced public release
Qwen3.6-27B-OTQ-DYN-Q5_K_M.ggufQ5_K_M19.92 GiB48 GB+ preferred; measured on M1 Max 32 GB with tight headroomquality-first public release for larger unified-memory Macs

Naming

  • OTQ: OpenTQ, the release/tooling brand.
  • TurboQuant: the quantization family and design direction.
  • DYN: dynamic tensor-level allocation; different tensor families receive different GGUF quant types.
  • Q3_K_M / Q4_K_M / Q5_K_M: standard GGUF quant names recognized by Hugging Face and stock llama.cpp.

Which File Should I Use?

  • Q3_K_M: first pick for 32 GB Apple Silicon and larger app/tool contexts.
  • Q4_K_M: quality-balanced pick; usable on 32 GB at moderate context, more comfortable on 48 GB+.
  • Q5_K_M: quality-first pick; measured on M1 Max 32 GB, but 48 GB+ is the practical target.

Hardware Compatibility

HardwareStatusRecommended artifactNotes
M1 Max 32 GBMeasuredQ3_K_M; Q4_K_M; Q5_K_M tightQ5_K_M passed 8K gates but leaves limited app/headroom.
32 GB Apple SiliconExpectedQ3_K_M; Q4_K_M only with careCapacity guidance for M-series systems with similar usable unified memory.
48 GB Apple SiliconExpectedQ4_K_M; Q5_K_MRecommended floor for comfortable Q5 use.
64 GB+ Apple SiliconExpectedQ5_K_M quality-firstBest local target for Q5 plus larger contexts and other apps.
16 GB Apple SiliconNot recommendedNoneCurrent 27B artifacts leave too little memory headroom.

Expected rows are capacity guidance, not measured benchmark claims. Q5_K_M is measured on M1 Max 32 GB, but 48 GB+ is the practical recommendation for comfortable use.

Model Overview

Base model fieldValue
Base modelQwen/Qwen3.6-27B
Parameter class27B dense model
HF architectureQwen3_5ForConditionalGeneration
Layer count64 language layers
Hidden size5120
Native context262,144 tokens in the base model; practical local context depends on RAM, KV/cache settings and apps
Public GGUF modalitytext inference release track
Runtime targetApple Silicon Metal through stock llama.cpp

Runtime Compatibility

  • llama.cpp, llama-cli, llama-server: supported.
  • LM Studio and Ollama local GGUF import: expected to work as standard GGUF loaders.
  • OpenTQ custom runtime: not required for this repo.
  • Native TurboQuant/OpenTQ tensor formats: separate release track, not mixed into this GGUF repo.
  • MLX: not the target runtime for this GGUF track.

Quick Start

1. Download A GGUF

hf download zlaabsi/Qwen3.6-27B-OTQ-GGUF Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf --local-dir models/Qwen3.6-27B-OTQ-GGUF

Use Q3_K_M first on 32 GB Macs. Use Q4_K_M when you can afford the extra memory. Use Q5_K_M for quality-first local inference when headroom matters less than fidelity.

2. Build llama.cpp With Metal

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON
cmake --build build -j

3. Run Locally

./build/bin/llama-cli \
  -m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
  -ngl 99 \
  -fa \
  -c 8192 \
  --temp 0.6 \
  --top-p 0.95 \
  -p "<|im_start|>user\nExplain the tradeoff between prefill and decode throughput.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"

4. Serve An OpenAI-Compatible API

./build/bin/llama-server \
  -m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
  -ngl 99 \
  -fa \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.6-27b-otq","messages":[{"role":"user","content":"Give me a 3-bullet summary of OpenTQ."}],"temperature":0.6}'

llama.cpp Settings

SettingRecommended valueWhy
GPU layers-ngl 99Offload all supported layers to Metal on Apple Silicon
FlashAttention-fa / -fa onCritical for long-context prefill wall-clock
Context-c 8192 firstValidated release gate; increase only after checking memory headroom
Prompt formatQwen chat templateKeep `<
Sampling--temp 0.6 --top-p 0.95Good default for general chat; tighten for deterministic evals
Serverllama-serverUse for OpenAI-compatible local apps and agents

Apple Silicon Guide

Machine classRecommendation
32 GB MacBook Pro / Mac StudioPrefer Q3_K_M for headroom, especially with agentic prompts and other apps open.
48-64 GB Apple SiliconPrefer Q4_K_M for balance; use Q5_K_M for quality-first local inference.
96 GB+ Apple SiliconPrefer Q5_K_M; larger native/custom candidates remain separate until runtime gates pass.
Agent workloads with large tool contextMeasure total wall-clock time. Decode-only tok/s hides prefill cost.

Benchmarks

VariantTestThroughputBackendSize
Q3_K_Mpp8192107.09 +/- 0.00MTL,BLAS13.47 GiB
Q3_K_Mtg12810.19 +/- 0.00MTL,BLAS13.47 GiB
Q4_K_Mpp8192106.98 +/- 0.00MTL,BLAS16.81 GiB
Q4_K_Mtg1289.62 +/- 0.00MTL,BLAS16.81 GiB
Q5_K_Mpp819293.94 +/- 0.00MTL,BLAS19.91 GiB
Q5_K_Mtg1288.87 +/- 0.00MTL,BLAS19.91 GiB

Runtime frontier

Prefill decode tradeoff

Release scorecard

The plots compare the quantized OTQ artifacts against each other on measured release data. Official Qwen scores are kept as a reference table, not plotted as a fake delta.

Practical Mini-Subset Quality Signals

See Paired BF16-vs-GGUF Quality Signal. The table and chart are placed near the top of this card because they are the main same-subset quantization-regression evidence.

Release Evaluation

VariantSuitePassedPass rateMean latencyp95 latency
Q3_K_Msmoke5/51.07.605s22.371s
Q3_K_Mrelease10/101.09.325s26.905s
Q4_K_Msmoke5/51.08.333s23.826s
Q4_K_Mrelease10/101.09.907s21.395s
Q5_K_Msmoke5/51.016.046s34.387s
Q5_K_Mrelease10/101.016.955s34.58s

Release Gate

VariantMetadataBounded generation8K llama-benchSmoke gateRelease gateTimestamp
Q3_K_Mpassedpassed (24.246s)passed (91.371s)5/510/102026-04-27T19:38:50.320253+00:00
Q4_K_Mpassedpassed (22.348s)passed (93.163s)5/510/102026-04-27T19:43:25.174228+00:00
Q5_K_Mpassedpassed (44.272s)passed (119.964s)5/510/102026-04-28T23:18:17.700281+00:00

Release gate latency

Release gate coverage

Official Baseline vs OTQ Claims

ItemStatus
Official Qwen3.6-27B source scoresImported from the official model card into benchmarks/official_qwen36_baseline.csv
OTQ Q3_K_M / Q4_K_M / Q5_K_M runtimeMeasured with llama-bench on M1 Max
OTQ functional release gatesMeasured with deterministic smoke and extended suites
Official benchmark deltasNot claimed yet; requires running the same tasks/scoring on the GGUF artifacts

Transparency Files

Each variant has full release evidence under evidence/<quant>/:

  • validation.json
  • quality-eval.json
  • release-eval.json
  • opentq-plan.json
  • tensor-types.txt
  • tensor-types.annotated.tsv
  • quantize-dry-run.log

Reproduce Release Evidence

git clone https://github.com/zlaabsi/opentq
cd opentq
uv sync
uv run python scripts/stage_qwen36_otq_gguf_repo.py
uv run python scripts/build_qwen36_release_report.py --repo artifacts/hf-gguf-canonical/Qwen3.6-27B-OTQ-GGUF

Run the same style of OTQ release evaluation:

LLAMA_CPP_DIR=/path/to/llama.cpp ./scripts/run_qwen36_otq_eval.sh

Run the long-context benchmark directly:

./build/bin/llama-bench \
  -m models/Qwen3.6-27B-OTQ-GGUF/Qwen3.6-27B-OTQ-DYN-Q3_K_M.gguf \
  -ngl 99 \
  -fa on \
  -p 8192 \
  -n 128 \
  -r 1 \
  --no-warmup
Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes12
Downloads
📝

No reviews yet

Be the first to review zlaabsi/Qwen3.6-27B-OTQ-GGUF!

Model Info

Providerzlaabsi
Categoryimage
Reviews0
Avg. Rating / 5.0

Community

Likes12
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor