groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit
groxaxo • image
🚀 Qwen36-27B-GPTQ-Pro-4Bit
Welcome to Qwen36-27B-GPTQ-Pro-4Bit – a titan of reasoning and generation, elegantly squeezed into a remarkably efficient 4-bit package. It punches leagues above its weight class while keeping your VRAM happy and your inference speeds blazingly fast! Thank you Qwen team for another amazing model.
🌟 Why the "Pro"?
This isn't your average quantization. We used the GPTQ-Pro framework combined with the FOEM (First-Order Error Metric) approach. This advanced technique carefully preserves the most critical weights during the 4-bit compression process by evaluating the exact impact of quantization on the model's loss landscape.
The result?
- Near-Lossless Performance: Enjoy the profound reasoning, coding prowess, and vast knowledge of a 27 Billion parameter model, but with a drastically reduced memory footprint.
- Marlin Optimized: Ready out-of-the-box for Marlin kernels to deliver maximum token-per-second throughput in serving engines like vLLM.
- Consumer Hardware Friendly: Fit a massive 27B powerhouse model on consumer GPUs with room to spare for massive context lengths!
This repository contains a 4-bit GPTQ-Pro quantization of unsloth/Qwen3.6-27B, produced with GPTQModel and the FOEM/GPTAQ-style quality settings used in the GPTQ-Pro project.
Source project: https://github.com/groxaxo/GPTQ-Pro
Deployment
vLLM
CUDA_VISIBLE_DEVICES=0,1 vllm serve groxaxo/Qwen3.6-27B-GPTQ-Pro-4Bit \
--dtype float16 \
--quantization gptq_marlin \
--disable-custom-all-reduce \
--tensor-parallel-size 2 \
--max-model-len 132144 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.92
Local path
CUDA_VISIBLE_DEVICES=0,1 vllm serve /path/to/Qwen3.6-27B-GPTQ-Pro-4Bit \
--dtype float16 \
--quantization gptq_marlin \
--disable-custom-all-reduce \
--tensor-parallel-size 2 \
--max-model-len 132144
Transformers
from gptqmodel import BACKEND, GPTQModel
model = GPTQModel.load(
"groxaxo/Qwen3.6-27B-GPTQ-Pro-4Bit",
backend=BACKEND.GPTQ_MARLIN,
device="cuda:0",
)
print(model.generate("Write a short deployment checklist.", max_new_tokens=64)[0])
Notes
- Tested with tensor parallel size 2 on RTX 3090 GPUs.
- Use
float16andgptq_marlinfor the most reliable vLLM startup path. - The quantization and serving workflow lives in the
GPTQ-Prorepository above. - MTP/speculative decoding is detected by vLLM for this model, but on 2x RTX 3090 the exact
--max-model-len 262144launch OOMs during KV-cache setup. - The working local vLLM configuration I verified is
--max-model-len 65536with--enforce-eager; that starts and serves, but the current metrics showedspec_decode_num_accepted_tokens_total=0, so it does not improve speed yet. - If you test MTP, use
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'and disable thinking in the request payload when you want a plain answer.
⚡ Speed Benchmarks
Tested on 2× NVIDIA RTX 3090 with vLLM (gptq_marlin, tensor-parallel=2, float16).
| Metric | Value |
|---|---|
| Avg Generation Speed | 64.0 tok/s |
| Median Generation Speed | 64.0 tok/s |
| Peak Generation Speed | 65.0 tok/s |
| Avg Time-to-First-Token | 54 ms |
| Median TTFT | 56 ms |
📋 Detailed Run Results
Test 1: Short Prompt → 256 Tokens (Streaming)
| Run | TTFT | Tokens | Speed | Total Time |
|---|---|---|---|---|
| 1 | 60 ms | 256 | 64.0 tok/s | 4.04s |
| 2 | 55 ms | 256 | 64.0 tok/s | 4.04s |
| 3 | 56 ms | 256 | 62.4 tok/s | 4.14s |
Test 2: Medium Prompt → 512 Tokens (Non-Streaming)
| Run | Tokens | Speed | Total Time |
|---|---|---|---|
| 1 | 512 | 62.9 tok/s | 8.15s |
| 2 | 512 | 63.0 tok/s | 8.13s |
| 3 | 512 | 62.9 tok/s | 8.14s |
Test 3: Short Burst → 64 Tokens (Streaming)
| Run | TTFT | Tokens | Speed |
|---|---|---|---|
| 1 | 50 ms | 64 | 65.0 tok/s |
| 2 | 56 ms | 64 | 64.9 tok/s |
| 3 | 56 ms | 64 | 64.7 tok/s |
| 4 | 54 ms | 64 | 64.9 tok/s |
| 5 | 48 ms | 64 | 64.9 tok/s |
📊 Quality Evaluation
- Wikitext-2 test perplexity: 6.366 (n_ctx=1024)