Back to Models
NS

nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF

nsparksgeneral

DeepSeek-V4-Flash native FP4 / FP8 GGUF

Native, 1:1 conversion of deepseek-ai/DeepSeek-V4-Flash from the original safetensors into a single GGUF file that preserves the model's native low-precision weights:

  • Dense weights: FP8 E4M3 (F8_E4M3_B128, 128-element blocks with one E8M0 scale)
  • MoE expert weights: MXFP4 (MXFP4)

This file is not derived from a higher-precision intermediate; the FP4 and FP8 codes from the upstream checkpoint are written directly into the GGUF.

File

FileSizeQuant
DeepSeek-V4-Flash-FP4-FP8-native.gguf~146 GBF8_E4M3 + MXFP4

Loading

This GGUF requires a llama.cpp build with native F8_E4M3_B128 and MXFP4 support and the DeepSeek V4 Flash architecture. Stock upstream llama.cpp cannot load this file.

Reference (WIP) build that can both produce and run this GGUF:

https://github.com/nisparks/llama.cpp/tree/wip/deepseek-v4-support

That branch adds:

  • GGML_TYPE_F8_E4M3_B128 (ggml type 42)
  • LLAMA_FTYPE_MOSTLY_F8_E4M3_MXFP4 (ftype 41, exposed as F8_E4M3_MXFP4 / moe-f8-e4m3-mxfp4)
  • CUDA dequant / MMVQ kernels for F8_E4M3_B128
  • Loader / converter / gguf-py support
  • Custom DeepSeek V4 Flash model graph

The branch is an active WIP, expect rough edges.

Notes

  • DeepSeek V4 Flash is a custom architecture (MoE + sliding-window attention + compressor + indexer). The runtime in the reference branch implements that graph as a custom model path.
  • For matching activation behavior the runtime also applies HF's blockwise FP8 / FP4 fake-activation-quant on attention KV and indexer Q/KV after the Hadamard rotation.

Provenance

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes8
Downloads
📝

No reviews yet

Be the first to review nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF!

Model Info

Providernsparks
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes8
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor