RedHatAI/gemma-4-31B-it-speculator.dflash
RedHatAI • generalRedHatAI/gemma-4-31B-it-speculator.dflash
This is a preliminary (and subject to change) DFlash speculator model for google/gemma-4-31B-it.
It was trained using the Speculators library on a combination of the Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered dataset and the train_sft split of the HuggingFaceH4/ultrachat_200k dataset. Training data used Magpie + UltraChat with responses from the gemma-4-31B-it model (no reasoning).
This model should be used with the google/gemma-4-31b-it chat template, specifically through the /chat/completions endpoint.
Note:
It was validated on Nvidia H100, other hardware validation pending.
We are continuing to train this model and will update with more evaluations and new weights in the future.
Deployment
Deploy with vLLM (main/nightly) using the speculator as a draft model.
First install vllm nightly
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
Then run:
vllm serve -tp 2 RedHatAI/gemma-4-31B-it-speculator.dflash
It can also be deployed with a quantized verifier for even better speedups:
vllm serve RedHatAI/gemma-4-31B-it-FP8-block --tensor-parallel-size 2 --speculative-config '{
"model": "RedHatAI/gemma-4-31B-it-speculator.dflash",
"num_speculative_tokens": 8,
"method": "dflash"
}'
Preliminary Evaluations
Evaluation command:
vllm bench serve --backend openai-chat --endpoint /v1/chat/completions \
--dataset-name hf --tokenizer google/gemma-4-31B-it \
--dataset-path "philschmid/mt-bench" --num-prompts 80 \
--max-concurrency 1 --model RedHatAI/gemma-4-31B-it-speculator.dflash \
--hf-output-len 2048 \
--temperature 0 --save-result --save-detailed
Per-Position Acceptance Rate
| Dataset | Pos 0 | Pos 1 | Pos 2 | Pos 3 | Pos 4 | Pos 5 | Pos 6 | Pos 7 | Avg. Length |
|---|---|---|---|---|---|---|---|---|---|
| HumanEval | 85.8% | 72.1% | 60.3% | 50.4% | 41.8% | 34.3% | 26.9% | 19.6% | 4.91 |
| math_reasoning | 88.7% | 76.1% | 64.8% | 54.9% | 45.5% | 36.5% | 28.8% | 21.5% | 5.17 |
| qa | 67.5% | 41% | 23.8% | 13.8% | 8.1% | 4.5% | 2.6% | 1.3% | 2.63 |
| question | 75.1% | 51.1% | 34.7% | 24.5% | 17.9% | 13% | 9.4% | 6.5% | 3.32 |
| rag | 76.1% | 54.8% | 39.8% | 28.7% | 19.9% | 12.9% | 7% | 3.8% | 3.43 |
| summarization | 67.3% | 39.9% | 22.3% | 12% | 6.4% | 3.1% | 1.5% | 0.7% | 2.53 |
| tool_call | 65.7% | 45.7% | 31.6% | 21.7% | 15% | 9.6% | 6.2% | 3.6% | 2.99 |
| translation | 73.4% | 51.4% | 35.3% | 23.6% | 15.6% | 9.3% | 5.4% | 2.6% | 3.17 |
| writing | 75.3% | 51.6% | 35.1% | 24.5% | 17.8% | 13% | 9.4% | 6.5% | 3.33 |