Back to Models
CI

CISCai/gemma-4-31B-it-NVFP4-turbo-GGUF

CISCai • general

⚔ Gemma 4 31B IT NVFP4 Turbo GGUF

Requires ggml-org/llama.cpp#21971

A repackaged nvidia/Gemma-4-31B-IT-NVFP4 that is 68% smaller in GPU memory and ~2.5Ɨ faster than the base model, while retaining nearly identical quality (1-3% loss). Fits on a single RTX 5090 (šŸŽ‰).

Approach

Three changes were made:

  1. Quantized all self-attention weights from BF16 → FP4 (RTN, group_size=16, matching modelopt NVFP4 format)
  2. Updated architecture to Gemma4ForCausalLM and quantization config accordingly
  3. Stripped the vision and audio encoder

Everything else is untouched — MLP layers keep NVIDIA's calibrated FP4, embed_tokens stays BF16, all norms preserved, so we retain all the nvidia/Gemma-4-31B-IT-NVFP4 optimizations.

Why RTN didn't hurt quality

RTN (Round-To-Nearest) is the simplest quantization method — no calibration data, fully reproducible. It worked here because:

  • FP4 with group_size=16 and per-group scaling preserves relative weight distributions well
  • Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5)
  • MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4
  • embed_tokens stays BF16, preventing noise from propagating through all layers

License

Apache 2.0 — same as the base model.

Credits

Visit Website
—

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes18
Downloads—
šŸ“

No reviews yet

Be the first to review CISCai/gemma-4-31B-it-NVFP4-turbo-GGUF!

Model Info

ProviderCISCai
Categorygeneral
Reviews0
Avg. Rating— / 5.0

Community

Likes18
Downloads—

Rating Guidelines

ā˜…ā˜…ā˜…ā˜…ā˜…Exceptional
ā˜…ā˜…ā˜…ā˜…Great
ā˜…ā˜…ā˜…Good
ā˜…ā˜…Fair
ā˜…Poor