Back to Models
prism-ml logo

prism-ml/Ternary-Bonsai-8B-gguf

prism-mlgeneral

Bonsai

Prism ML Website  |  White Paper  |  Demo & Examples  |  Discord

Ternary-Bonsai-8B-gguf

Ternary (1.58-bit) language model in GGUF Q2_0 format for llama.cpp

Pareto Frontier

Resources

  • White Paper
  • Demo repo — examples for serving, benchmarking, and integrating Bonsai
  • Discord — community support and updates
  • Kernels: Q2_0 is not yet in mainline llama.cpp. Use our fork at PrismML-Eng/llama.cpp (prism branch, default) which adds Q2_0 support for CPU (NEON/generic) and Metal. Upstream PR coming soon.

Model Overview

ItemSpecification
Base modelQwen3-8B
Parameters8.19B (~6.95B non-embedding)
ArchitectureGQA (32 query / 8 KV heads), SwiGLU MLP, RoPE, RMSNorm
Layers36 Transformer decoder blocks
Context length65,536 tokens
Vocab size151,936
Weight formatGGUF Q2_0 g128: {-1, 0, +1} with FP16 group-wise scaling
Packed Q2_0 size2.03 GiB (2.18 GB)
Ternary coverageEmbeddings, attention projections, MLP projections, LM head
LicenseApache 2.0

Quantization Format: GGUF Q2_0 (g128)

Each weight takes a value from {-1, 0, +1}, with one shared FP16 scale per group of 128 weights:

w_i = scale_g * t_i,    t_i in {-1, 0, +1}

Q2_0 encodes each weight as a 2-bit code q in {0, 1, 2, 3}, dequantized via w = (q - 1) * scale. One 128-element block is 34 bytes (2 bytes FP16 scale + 32 bytes of packed 2-bit codes) for an effective 2.125 bits/weight. The fourth code point (q = 3, reconstructing to +2 * scale) is reserved for future extensions; for ternary weights it is unused.

Memory

FormatSizeReductionRatio
FP1616.38 GB--1.0x
GGUF Q2_0 g1282.03 GiB (2.18 GB)86.7%7.5x

Files in this repo

FileFormatSizeRecommended
Ternary-Bonsai-8B-F16.ggufFP1616.38 GBbaseline / re-quantization source
Ternary-Bonsai-8B-Q2_0.ggufQ2_0 (g128)2.03 GiBrecommended (lossless for ternary)

Quickstart

Build from the Prism fork

git clone https://github.com/PrismML-Eng/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON   # or -DGGML_CUDA=ON, -DGGML_VULKAN=ON
cmake --build build -j

llama.cpp CLI

./build/bin/llama-cli \
  -m Ternary-Bonsai-8B-Q2_0.gguf \
  -p "Explain quantum computing in simple terms." \
  -n 256

llama.cpp server

./build/bin/llama-server -m Ternary-Bonsai-8B-Q2_0.gguf -c 4096

Throughput (llama.cpp, Apple M4 Pro 48 GB)

BackendPP512 (tok/s)TG128 (tok/s)
Metal (GPU)45576
NEON CPU (10 t)14632

Flags: -ngl 99 -fa 1 for Metal; -ngl 0 -fa 1 -t 10 for CPU.

Benchmarks

Evaluated with EvalScope v1.4.2 + vLLM 0.15.1 on NVIDIA H100 under identical infrastructure, generation parameters, and scoring. All models are in the 6B-9B parameter range.

ModelSizeAvgMMLU-RMuSRGSM8KHE+IFEvalBFCL
Qwen 3 8B16.38 GB79.383559382.381.581
Ternary Bonsai 8B2.18 GB75.572.656.29177.481.873.9
1-bit Bonsai 8B (prior)1.15 GB70.565.7508873.879.865.7
RNJ 8B16.63 GB73.175.550.493.784.273.861.1
Ministral3 8B16.04 GB71.068.953.887.972.667.475.4
Olmo 3 7B14.60 GB70.97256.192.579.387.138.4

Ternary Bonsai 8B ranks 2nd among all compared models despite being 1/8th the size.

Intelligence Density

density = -ln(1 - score/100) / size_GB
ModelSizeIntelligence Density (1/GB)
Ternary Bonsai 8B2.18 GB0.645
1-bit Bonsai 8B (prior)1.15 GB1.062
Qwen 3 8B16.38 GB0.096
RNJ 8B16.62 GB0.079

Citation

@techreport{ternarybonsai,
    title   = {Ternary Bonsai: 1.58-bit Language Models at 8B, 4B, and 1.7B Scale},
    author  = {Prism ML},
    year    = {2026},
    month   = {April},
    url     = {https://prismml.com}
}

Contact

For questions, feedback, or collaboration inquiries: contact@prismml.com

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes53
Downloads
📝

No reviews yet

Be the first to review prism-ml/Ternary-Bonsai-8B-gguf!

Model Info

Providerprism-ml
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes53
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor