lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled
lordx64 • generalQwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled
A reasoning-distilled variant of Qwen3.6-35B-A3B taught to imitate the chain-of-thought style of Kimi K2.6, the frontier reasoning model from Moonshot AI. The goal: port Kimi-grade reasoning behavior into a permissively-licensed Mixture-of-Experts model that an individual can actually run.
This is the second model in the same lineup as lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled. Same base model, same training pipeline, same Unsloth + LoRA recipe — only the teacher differs. The two are designed to be directly compared, so users can see how reasoning style transfers from different upstream teachers into the same student architecture.
Why this model
- Kimi-style reasoning, open weights. Kimi K2.6 is one of the strongest open-style reasoning models available, but only via the Moonshot API. This model has been fine-tuned on ~7.8k high-quality reasoning traces produced by Kimi K2.6, teaching the base to think before answering — with explicit
<think>…</think>blocks — in Kimi's structure and cadence. - Verbose, deliberate reasoning. Empirically, Kimi K2.6 produces ~3.4× longer reasoning chains than Claude Opus 4.7 at "max" effort (mean 2,933 tokens/row vs 849, p95 9,764 vs 2,404 — measured on this dataset's tokenized output). The student model trained here inherits that verbosity. If you want long, careful, deliberate chains of thought, this is the variant of the lineup to use.
- Sparse activation, dense knowledge. The base is a 35B-parameter MoE with 256 experts, 8 routed + 1 shared, of which only about 3B parameters are active per token. You get the capacity of a 35B model at the inference cost of a small dense model. Full-quality bf16 inference runs on 2× 80GB A100, 1× H200, or any 96GB+ single GPU. Quantized variants fit smaller setups (see below).
- Long thinking supported. 64k token context. The model routinely emits 5–30k tokens of
<think>reasoning on hard problems before giving the final answer — which is the whole point of reasoning models, and why this one was specifically trained end-to-end with an upstream teacher that also reasons explicitly. - Companion to the Claude variant. Use this when you want Kimi's longer/more deliberate reasoning. Use the Claude variant when you want shorter/tighter chains. Same base, same conversational interface, fully interchangeable for serving.
Intended use
Built for hard reasoning: graduate-level STEM, competition math (AIME / MATH), code reasoning with explicit walk-through, multi-step logic puzzles, and agentic planning where explicit <think> helps correctness.
For short-turn conversational latency-sensitive workloads the thinking budget can be large (longer than the Claude variant); cap max_new_tokens or post-process to strip <think>…</think> blocks if you only want final answers in production.
How to use
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
repo = "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)
messages = [{"role": "user", "content": "How many positive integers less than 1000 have digits that sum to 20?"}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=32768, do_sample=False)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
Serving with vLLM
Recommended backend: vLLM for serving — the MoE routing + KV cache benefit significantly from continuous batching.
vllm serve lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled \
--dtype bfloat16 \
--max-model-len 65536 \
--gpu-memory-utilization 0.9 \
--trust-remote-code
The --trust-remote-code flag is required: the Qwen3.6 tokenizer ships custom code that vLLM and transformers need explicit permission to execute.
GGUF (LM Studio / llama.cpp / Ollama)
Quantized GGUF weights for llama.cpp, LM Studio, and Ollama are published in a sibling repo:
lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled-GGUF
| Quant | Approx size | Use case | File |
|---|---|---|---|
| IQ4_XS | 18.94 GB | Smallest — fits on a single 24 GB consumer GPU; LM Studio default | Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled.IQ4_XS.gguf |
| Q5_K_M | 24.73 GB | Balanced quality / size, recommended sweet spot | Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled.Q5_K_M.gguf |
| Q8_0 | 36.90 GB | Near-lossless, closest to bf16 quality | Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled.Q8_0.gguf |
LM Studio
Search lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled in LM Studio's model browser. The IQ4_XS will show up as the default suggestion.
llama.cpp
huggingface-cli download lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled-GGUF \
Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled.IQ4_XS.gguf \
--local-dir ./models
llama-server -m ./models/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled.IQ4_XS.gguf \
--ctx-size 65536 \
--n-gpu-layers -1 \
--jinja
The --jinja flag is recommended so llama.cpp uses the model's bundled chat template (which preserves <think> blocks correctly).
APEX-GGUF (community)
An APEX-GGUF variant maintained by @mudler — the canonical MoE-aware quantization recipe — may follow once the community picks it up. The Claude variant's APEX quant is the precedent in this lineup.
Training
| Base model | Qwen/Qwen3.6-35B-A3B (loaded via unsloth/Qwen3.6-35B-A3B for faster finetuning) |
| Teacher | Kimi K2.6 (Moonshot AI), accessed via OpenRouter |
| Training dataset | lordx64/reasoning-distill-kimi-k2-6-max-sft — reasoning traces from Kimi K2.6 reformatted into SFT conversations (ChatML + <think>…</think>) |
| Source dataset | lordx64/reasoning-distill-kimi-k2-6-max — raw teacher traces (pre-SFT formatting) |
| Dataset size | 7,836 full conversations, assistant side trained including <think>…</think> |
| Source prompts | Drawn from Delta-Vector/Tauri-Physical-Reasoning, multiple TeichAI Claude reasoning sets, and Crownelius Opus-4.6-Reasoning-2100x — same prompt distribution as the Claude variant for direct teacher-comparability |
| Method | SFT with Unsloth + TRL SFTTrainer + train_on_responses_only (loss only on assistant tokens) |
| LoRA config | r=16, alpha=16, dropout=0.0, targets=["q_proj","k_proj","v_proj","o_proj"] (attention-only) |
| Hyperparameters | lr=2e-5, cosine schedule, warmup_ratio=0.03, weight_decay=0.01, optimizer adamw_8bit |
| Batch | per_device=1, grad_accum=16, effective=16, 2 epochs = 980 steps |
| Sequence | 4,096 tokens during training (64k usable at inference — base supports it natively) |
| Precision | bf16 on 1× H200 141GB (HF Inference Endpoint, custom container) |
| Trainable | 3.44M params out of 35.1B (0.01%) |
| Wall-clock | ~21 hours on H200 |
Training-time observations
- Loss curve: descended cleanly from ~0.95 (warmup) → ~0.83 (mid-training), gradient norms steady at ~0.005, no instability throughout 980 steps. Cosine LR decayed from peak 2e-5 to ~0 by the final step.
- FLA fast-path disabled: Unsloth's runtime check rejected the compiled
causal-conv1d==1.6.1binary on H200/cc-9.0 as ABI-incompatible, forcing the Gated DeltaNet linear-attention layers to run on the slower torch fallback. This is a known issue for this stack and added an estimated ~30–50% to per-step time. Future runs in this lineup will pincausal-conv1dto a binary-compatible version. - Token verbosity: Kimi K2.6 traces averaged 2,933 tokens (mean) and 9,764 tokens (p95) versus 849 / 2,404 for the matched Opus 4.7 dataset — an effective ~2.5× compute multiplier for distillation runs at fixed
MAX_SEQ_LENGTH. Treat this as a budgeting prior when scoping future verbose-teacher distillations.
Why attention-only LoRA on a MoE
The initial plan was full LoRA including the MoE expert FFNs (gate_proj/up_proj/down_proj). In the course of the sister Claude project I filed and upstreamed a shape-mismatch fix to unsloth-zoo's MoE+LoRA grouped-mm path — unslothai/unsloth-zoo#601 — without which the expert-LoRA forward crashes on Qwen3.6's 256-expert layout. Even with that fix, single-GPU memory made expert-LoRA impractical for this run. Attention-only captures most of the signal on style distillation at a fraction of the trainable parameter count and memory footprint, and matches the recipe used for the Claude sibling so the two student runs are directly comparable.
Evaluation
Read this first — methodology limitations
I evaluated the Kimi-distill and the base under the same pipeline (vLLM + lm-eval-harness, <think> stripping wrapper, max_gen_toks=16384, single H200, training/eval.py). The intent was a fair head-to-head. The pipeline turned out to systematically understate the base model's capability when compared against the numbers Qwen published on the base's model card. I'm reporting that gap honestly here, and removing benchmarks where my pipeline's numbers were unreliable enough to be misleading.
| Benchmark | My pipeline → base | Qwen-published → base | Gap | Why |
|---|---|---|---|---|
| MMLU-Pro overall | 6.35% | 85.2% | -78.85 pp | extractor regex fails on this model's output format; numbers are essentially unusable. Removed from head-to-head below. |
| GPQA Diamond | 79.29% | 86.0% | -6.71 pp | flexible-extract works, but my pipeline doesn't fully invoke the base's thinking mode |
| AIME 2026 | (not measured) | 92.7% | — | I ran AIME 2024 / 2025; both extractor-broken (see below) |
So treat the numbers below as "what my specific evaluation pipeline measured," not "what these models actually do at peak." For absolute capability numbers on the base, refer to the official Qwen3.6-35B-A3B card.
What I'm willing to publish
| Benchmark | Setup | Base Qwen3.6-35B-A3B | Kimi-Distill | Confidence |
|---|---|---|---|---|
| MATH-500 | 0-shot, 100 problems, math_verify (symbolic equivalence) | 53.00% | 47.00% | ✅ Most defensible — both runs through same pipeline, metric handles formatting variance |
| GPQA Diamond | 0-shot CoT, 198 problems, flex-extract | 79.29% | 75.25% | 🟡 Below Qwen-published 86.0% — pipeline appears to underweight base capability; relative gap may still be informative |
| GSM8K | 8-shot CoT, 300 examples, strict-match | 64.00% | 92.67% | 🟠 Base 64% is suspiciously low for a frontier 35B-A3B (typical thinking-mode eval: 95%+); I believe the fewshot template doesn't trigger the base's thinking mode under this pipeline. Treat the +28.67 pp delta as "my pipeline rewards always-think models" not as a capability claim. |
| AIME 2024 / 2025 | 0-shot, 30 problems | 0% / 0% | 0% / 0% | ❌ Removed. Strict-match extractor expects literal \boxed{N}; both models reason correctly and arrive at correct integers in prose form (verified via log_samples on Kimi-distill — e.g. AIME 2024-II-4 model produced "$m + n = 25 + 8 = 33$", target=33). 0% is cosmetic, not capability. Needs custom thinking-aware extractor. |
| MMLU-Pro | 5-shot CoT, custom-extract | (unreliable) | (unreliable) | ❌ Removed. My pipeline scored the base ~13× lower than Qwen's published 85.2%. Whatever delta my pipeline produces between Kimi-distill and base on this benchmark is not credible. Needs a thinking-aware extractor before re-publishing. |
What this set of numbers actually supports
The only methodologically clean head-to-head I have right now is MATH-500: base 53% vs Kimi-distill 47%, where the base wins by 6 pp. On GPQA flex-extract, the base also edges the distill (79.29% vs 75.25%) but both numbers are below the base's published peak.
Under this evidence, the honest claim is not "this distillation makes the model significantly better than the base." It's:
What I can defensibly say: The Kimi-distill reliably emits
<think>blocks regardless of prompt pattern, while the base's thinking mode is conditional on the prompt format. On the one benchmark in this run that fairly compares both models with their reasoning invoked (MATH-500), the base outperforms the distill by 6 pp. My pipeline does not yet provide evidence that this distillation improves raw reasoning capability over the base.
The distillation may still be the right choice if you want predictable <think>-block reasoning under fewshot or prompt-pattern templates that don't trigger the base's thinking mode. That's a real, useful property. But the +28 / +37 / +41 pp wins I previously cited on GSM8K and MMLU-Pro Math/CS are likely artifacts of the base's thinking mode not being invoked under those prompts, not capability gains.
What I'd need to fix to publish stronger claims
- Re-run the base with proper thinking-mode prompting — system prompt or
/thinktag that reliably triggers the base's<think>blocks under fewshot evals. If the base then jumps from 64% to 95%+ on GSM8K, the +28pp Kimi-distill "win" disappears. If it doesn't, the win is real. - Replace lm-eval's strict-match / custom-extract regexes with thinking-aware extractors for MMLU-Pro and AIME. Required before either benchmark can be reported.
- Larger sample sizes (full GSM8K test split, full MATH-500, full MMLU-Pro per subject) to tighten standard errors below 3 pp.
Until those three are done, the numbers above are what I'm willing to stand behind, with all caveats stated.
Limitations and caveats
- Inherits base limitations. Anything
Qwen/Qwen3.6-35B-A3Bis bad at, this model is also bad at. Distillation transfers reasoning style; it does not add factual knowledge. - Not safety-tuned beyond the base. No additional RLHF or safety alignment pass was performed. The model will reason out loud about anything it's asked to. Add your own guardrails before exposing to end users.
- Long generations. As noted, Kimi-style reasoning is verbose. Plan tokens accordingly; default
max_new_tokens=32768is recommended for hard problems, lower for shorter Q&A. - Apache-2.0 license matches the base. Use freely for commercial and research work; attribution appreciated but not required.
Roadmap
The next iteration in this lineup will:
- Bump training context to
MAX_SEQ_LENGTH=8192to fully capture Kimi's p95 reasoning length (currently ~9.7k tokens — see Training-time observations). This will let the student learn from complete chains on the longest, hardest problems. - Pin a binary-compatible
causal-conv1dversion on H200 to re-enable the FLA fast path and roughly halve per-step training time. - Eval-driven dataset curation: once formal benchmark numbers land for both this and the Claude sibling, the next dataset will be biased toward the question categories where each teacher most outperforms the base — making each successive distillation more efficient per training token.
- Companion adapter releases: stand-alone LoRA adapter weights (separate from the merged model published here) so users can stack the Kimi reasoning style on top of other Qwen3.6-35B-A3B fine-tunes.
Citation
@misc{lordx64_qwen36_kimi_distill_2026,
author = {lordx64},
title = {Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled},
year = {2026},
url = {https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled},
}
Acknowledgements
- Moonshot AI for Kimi K2.6, the teacher whose reasoning style this model emulates.
- Qwen team for the strong open-weights MoE base.
- Unsloth for the fast-finetuning stack that made this run tractable.
- The wider open-weights reasoning-distillation community whose prompt sets and methodology informed the dataset construction.