Back to Models
hesamation logo

hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled

hesamationimage

🔥 Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled

A reasoning SFT fine-tune of Qwen/Qwen3.6-35B-A3B on chain-of-thought (CoT) distillation mostly sourced from Claude Opus 4.6. The goal is to preserve Qwen3.6's strong agentic coding and reasoning base while nudging the model toward structured Claude Opus-style reasoning traces and more stable long-form problem solving.

The training path is text-only. The Qwen3.6 base architecture includes a vision encoder, but this fine-tuning run did not train on image or video examples.

This fine-tuning run is inspired by Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, including the notebook/training workflow style and Claude Opus reasoning-distillation direction.

Follow on X Discord

Benchmark Results

The MMLU-Pro pass used 70 total questions per model: --limit 5 across 14 MMLU-Pro subjects. Treat this as a smoke/comparative check, not a release-quality full benchmark.

BenchmarkHarnessSamples per modelSettingMetricBase modelFine-tuned merged modelDelta
MMLU-Pro overalllm-evaluation-harness70--limit 5 across 14 subjectsexact_match, custom-extract42.86%75.71%+32.85 pp

Base model: Qwen/Qwen3.6-35B-A3B. Fine-tuned model: hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.

[!WARNING] Community benchmarks welcome

To better understand this fine-tuned model's capabilities, I welcome independent benchmark results. If you run evaluations, please include the benchmark name, harness/script, sample count, decoding settings, and raw logs or result files when possible.

Share results by opening a PR/discussion or DMing @hesamation on X.

Base Qwen3.6 Highlights

This release delivers substantial upgrades, particularly in:

  • Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision.
  • Thinking Preservation: Qwen introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead.

Benchmark Results

For more details, please refer to the Qwen blog post Qwen3.6-35B-A3B.

Base Model Overview

  • Type: Causal Language Model with Vision Encoder
  • Training Stage: Pre-training & Post-training
  • Language Model:
    • Number of Parameters: 35B in total and 3B activated
    • Hidden Dimension: 2048
    • Token Embedding: 248320 (Padded)
    • Number of Layers: 40
    • Hidden Layout: 10 x (3 x (Gated DeltaNet -> MoE) -> 1 x (Gated Attention -> MoE))
    • Gated DeltaNet:
      • Number of Linear Attention Heads: 32 for V and 16 for QK
      • Head Dimension: 128
    • Gated Attention:
      • Number of Attention Heads: 16 for Q and 2 for KV
      • Head Dimension: 256
      • Rotary Position Embedding Dimension: 64
    • Mixture Of Experts:
      • Number of Experts: 256
      • Number of Activated Experts: 8 Routed + 1 Shared
      • Expert Intermediate Dimension: 512
    • LM Output: 248320 (Padded)
    • MTP: trained with multi-steps
  • Context Length: 262,144 natively and extensible up to 1,010,000 tokens.

Base Benchmark Results

The following table is from the upstream Qwen3.6-35B-A3B release context and is included for base-model reference. It is not a benchmark of this fine-tuned checkpoint unless explicitly stated in the fine-tune benchmark table above.

CategoryBenchmarkQwen3.5-27BGemma4-31BQwen3.5-35BA3BGemma4-26BA4BQwen3.6-35BA3B
Coding AgentSWE-bench Verified75.052.070.017.473.4
Coding AgentSWE-bench Multilingual69.351.760.317.367.2
Coding AgentSWE-bench Pro51.235.744.613.849.5
Coding AgentTerminal-Bench 2.041.642.940.534.251.5
Coding AgentClaw-Eval Avg64.348.565.458.868.7
Coding AgentClaw-Eval Pass^346.225.051.028.050.0
Coding AgentSkillsBench Avg527.223.64.412.328.7
Coding AgentQwenClawBench52.241.747.738.752.6
Coding AgentNL2Repo27.315.520.511.629.4
Coding AgentQwenWebBench1068119797811781397
General AgentTAU3-Bench68.467.568.959.067.2
General AgentVITA-Bench41.843.029.136.935.6
General AgentDeepPlanning22.624.022.816.225.9
General AgentTool Decathlon31.521.228.712.026.9
General AgentMCPMark36.318.127.014.237.0
General AgentMCP-Atlas68.457.262.450.062.8
General AgentWideSearch66.435.259.138.360.1
KnowledgeMMLU-Pro86.185.285.382.685.2
KnowledgeMMLU-Redux93.293.793.392.793.3
KnowledgeSuperGPQA65.665.763.461.464.7
KnowledgeC-Eval90.582.690.282.590.0
STEM & ReasoningGPQA85.584.384.282.386.0
STEM & ReasoningHLE24.319.522.48.721.4
STEM & ReasoningLiveCodeBench v680.780.074.677.180.4
STEM & ReasoningHMMT Feb 2592.088.789.091.790.7
STEM & ReasoningHMMT Nov 2589.887.589.287.589.1
STEM & ReasoningHMMT Feb 2684.377.278.779.083.6
STEM & ReasoningIMOAnswerBench79.974.576.874.378.9
STEM & ReasoningAIME2692.689.291.088.392.7

Notes from the upstream Qwen3.6 release:

  • SWE-Bench Series: internal agent scaffold with bash and file-edit tools; temp=1.0, top_p=0.95, 200K context window.
  • Terminal-Bench 2.0: Harbor/Terminus-2 harness; 3h timeout, 32 CPU/48 GB RAM; temp=1.0, top_p=0.95, top_k=20, max_tokens=80K, 256K ctx; average of 5 runs.
  • SkillsBench: evaluated via OpenCode on 78 tasks, using a self-contained subset excluding API-dependent tasks; average of 5 runs.
  • NL2Repo: evaluated via Claude Code for other models, with temp=1.0, top_p=0.95, max_turns=900.
  • QwenClawBench: internal real-user-distribution Claw agent benchmark; temp=0.6, 256K ctx.
  • QwenWebBench: internal front-end code generation benchmark; bilingual EN/CN, seven categories, auto-render plus multimodal judge, BT/Elo rating system.
  • TAU3-Bench: official user model with gpt-5.2 low reasoning effort and default BM25 retrieval.
  • VITA-Bench: average subdomain scores, using claude-4-sonnet as judge.
  • MCPMark: GitHub MCP v0.30.3, Playwright responses truncated at 32K tokens.
  • MCP-Atlas: public set score, gemini-2.5-pro judge.
  • AIME 26: full AIME 2026 I and II.

Training Pipeline

Qwen/Qwen3.6-35B-A3B
  -> supervised fine-tuning with LoRA
  -> merged full model
  -> Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled

Training configuration:

SettingValue
Fine-tuning methodSupervised fine-tuning with LoRA
LoRA targetAttention-only modules
LoRA rank / alpha32 / 32
Micro-batch size1
Gradient accumulation32
Epochs2
Completed steps762 / 762
Final reported training loss0.3362497625740494
Dataset max tokens8192
Max sequence length32768

Training Data

The recipe samples and normalizes reasoning conversations from three datasets, then renders them with the qwen3-thinking chat template and response-only SFT masking.

DatasetRequested sample countRole
nohurry/Opus-4.6-Reasoning-3000x-filtered3,900Claude Opus reasoning trajectories
Jackrong/Qwen3.5-reasoning-700x700Curated Qwen reasoning samples
Roman1111111/claude-opus-4.6-10000x9,633Additional Claude Opus reasoning examples

Intended Use

This model is intended for reasoning-heavy text workflows such as coding assistance, planning, math-style reasoning, and structured analytical responses. Because the fine-tune is text-only, image/video behavior should be treated as inherited from the base model rather than improved by this training run.

Acknowledgements

Thanks to the Qwen team for the base model, Unsloth for the training stack, and Jackrong for the public reasoning-distillation workflow that inspired this fine-tune.

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes74
Downloads
📝

No reviews yet

Be the first to review hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled!

Model Info

Providerhesamation
Categoryimage
Reviews0
Avg. Rating / 5.0

Community

Likes74
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor