Back to Models
ibm-granite logo

ibm-granite/granite-speech-4.1-2b-plus

ibm-graniteaudio

Granite-Speech-4.1-2B-Plus

Model Summary

Granite-Speech-4.1-2B-Plus has similar capabilities to the Granite-Speech-4.1-2B model. The plus model adds two new community-requested rich transcription features that can be activated with a simple prompt change: speaker-attributed ASR (speaker labels and word transcripts) and word-level timing information. Unlike the base mode, the plus model doesn't provide punctuation and capitalization.

The model was trained on corpora similar to the Granite-Speech-4.1-2B model which were augmented with speaker turns and word-level timestamp tags. This allows the model to provide different modes of functionality controlled by different prompts.

Two additional model variants explore different capabilities and inference optimization:

  • Granite-Speech-4.1-2B for applications where accuracy is the primary concern with support for punctuated, capitalized transcripts, AST and keyword-biased recognition, and includes Japanese.
  • Granite-Speech-4.1-2B-NAR introduces a novel non-autoregressive architecture for higher throughput

ASR only mode

In this mode the model generates only the text transcript similar to the Granite-Speech-4.1-2B model.

Speaker attributed ASR (SAA)

In this mode, the model adds speaker tags in the format of [Speaker N]: where $N$ is the speaker number, before each speaker turn. The speakers are numbered by their order of appearance so the first speaker will always be marked with [Speaker 1]: and the second with [Speaker 2]:, etc. For example: "[Speaker 1]: Hello how are you [Speaker 2]: I'm fine and how are you feeling [Speaker 1]: I feel wonderful".

See Resources for more information about SAA.

Word-level timestamps

In this mode, the model adds timestamp tags after each word indicating the end of the word in the audio. Silences are transcribed as _ and a timestamp tag also indicates their end. The format of the tag is [T:N] where $N$ is an integer number indicating the time in centiseconds (1/100th of a second). To reduce the amount of generated tokens, only the last three digits of $N$ are provided. This causes a rollover after 10 seconds.

The conversion from time $t$ in seconds to timestamp is $N = round(t*100) \mod 1000$. To convert back to seconds, use $t = N/100 + 10R$ where $R$ is the rollover counter. See code below for example implementation in Python.

See Resources for more information about timestamps.

Incremental decoding

There are cases where we want to transcribe a new audio segment along with previous segments that we've already transcribed. This can be useful for providing longer context for the model in order to improve transcription accuracy or to maintain the speaker numbering in SAA mode. To avoid re-decoding the previous segments, we can provide the previous transcription in the prefix_text field of the conversation template. The model will decode the parts after that. See the code below for examples.

Keyword list biasing (KWB)

Keyword list biasing capability is available to enhance the recognition of keywords, such as names and technical terms. This is particularly useful in tasks where complex terms may otherwise be misrecognized. Keyword biasing can be applied by including the keywords directly in the prompt; for example, in ASR mode: Can you transcribe the speech into a written format? Keywords: …

Users may provide either a single keyword or a list of keywords, which may also include terms that do not appear in the input audio, making them well suited for batch processing or recurring domain-specific use cases.

See Resources for more information about keyword list biasing.

Evaluations

Our evaluations showed that this model works well with audio segments up to 9 minutes long for ASR and SAA, and up to 5 minutes for timestamps.

ASR

Performance on HuggingFace Open ASR leaderboard:

modelAverage WERAMIEarnings22GigaspeechLS CleanLS OtherSPGISpeechTedliumVoxpopuli
ibm-granite/granite-speech-4.1-2b-plus5.718.638.6810.381.443.063.723.895.9
ibm-granite/granite-speech-4.1-2b5.338.098.379.81.332.53.783.075.7
ibm-granite/granite-speech-4.1-2b-nar5.448.038.4410.161.282.773.333.625.86

(Using speculative decoding)

Keyword list biasing accuracy - Keyword F1 score (%, ↑ higher is better):

ModeGigaspeechLS-CLS-OSPGISpeechVOXTED_LIUMEarnings22CV-enCV-deCV-esCV-frCV-pt
Without KWB74.289.178.280.893.987.968.874.678.583.174.590.0
With KWB84.196.193.092.596.394.981.591.592.993.990.695.0

Speaker Attributed ASR

Speaker Attributed ASR performance - WDER (%, ↓ lower is better):

ModelFISHERCALLHOME EnglishAMI-SDMGALE
VibeVoice ASR [1]2.87.127.444.8
Granite-speech-4.1-2b-plus0.92.214.630.2

The results are averaged over 2-5 minute speech segments.

(The evaluation metric: Word Diarization Error Rate [WDER] is the percentage of words attributed to the wrong speaker)

Timestamps

Word-level timestamp accuracy - AAS (ms, ↓ lower is better):

ModelAMI-IAMI-SLS-CLS-OVOXCVMLSTMTEn AvgMLS-frMLS-esMLS-deMLS-ptCV-frCV-esCV-deCV-ptML Avg
Qwen3-FA [2]48.182.527.829.341.048.434.329.942.738.127.031.226.330.340.029.434.233.3
CrisperWhisper [3]55.764.335.940.147.297.446.442.753.735.628.031.236.862.958.960.983.850.1
Canary-v2 [4]127.8129.792.589.2109.9110.394.386.1105.085.081.180.286.888.591.5
WhisperX [5]107.1150.271.772.078.891.279.263.689.2117.384.7132.275.0104.288.1126.879.5101.0
Granite-speech-4.1-2b-plus43.469.011.414.680.243.324.324.538.845.423.041.347.118.619.319.524.229.8

(The evaluation metric: Accumulated Averaging Shift [AAS] is measuring the average time shift of each word)

Release Date

April 28, 2026

License

Apache 2.0

Supported Languages

English, French, German, Spanish, Portuguese

Intended Use

The model is intended to be used in enterprise applications that involve processing of speech input especially when a rich transcript adding speaker turns and time stamps is desired. In particular, the model is well-suited for English, French, German, Spanish, and Portuguese speech-to-text.

Usage

The Granite Speech model is supported natively in transformers>=5.8. Below is a simple example of how to use the different modes of the model.

Usage with transformers

First install pytorch.

Install transformers. The code for the granite-speech-plus model was added recently so you might need to install from the sources until the PyPI package is updated.

pip install torchaudio datasets accelerate torchcodec

Setup — load the model and a test audio clip:

import re
import torch
from datasets import Audio, load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

Load the model and define a general function for decoding the audio:

MODEL_NAME = "ibm-granite/granite-speech-4.1-2b-plus"

device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(MODEL_NAME)
tokenizer = processor.tokenizer
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, device_map=device, dtype=torch.bfloat16)
model.eval()

SYSTEM_PROMPT = "Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant"

@torch.inference_mode()
def transcribe(audio, prompt, max_new_tokens=2000, prefix_text=None):
    chat = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": prompt}]
    extra = {"prefix_text": prefix_text} if prefix_text is not None else {}
    prompt_text = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True, **extra)
    inputs = processor(prompt_text, audio, device=device, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, num_beams=1)
    new_tokens = outputs[0, inputs["input_ids"].shape[-1]:]
    output_text = tokenizer.decode(new_tokens, add_special_tokens=False, skip_special_tokens=True)
    return output_text

Load some example audio data from the AMI dataset

SAMPLE_RATE = 16000

ds = load_dataset("diarizers-community/ami", "ihm", split="test")
ds = ds.cast_column("audio", Audio(sampling_rate=SAMPLE_RATE, num_channels=1))

TEST_SAMPLE = 0
START_TIME, END_TIME = 5 * 60, 6 * 60
audio = ds["audio"][TEST_SAMPLE].get_samples_played_in_range(START_TIME, END_TIME)

Task 1: ASR — plain speech-to-text transcription:

ASR_PROMPT = "<|audio|> can you transcribe the speech into a written format?"

asr_text = transcribe(audio.data, ASR_PROMPT)
print(asr_text)

Task 2: Speaker Attributed ASR — transcription with speaker labels:

SAA_PROMPT = "<|audio|> Speaker attribution: Transcribe and denote who is speaking by adding [Speaker 1]: and [Speaker 2]: tags before speaker turns."

saa_text = transcribe(audio.data, SAA_PROMPT)
for segment in re.split(r"(\[Speaker \d+\]:)", saa_text):
    print(segment.strip())

Task 3: Word-level timestamps — transcription with per-word timing:

The timestamps are given in centiseconds and are modulo 1000 (=10 seconds) so we need to unwrap them by adding multiples of 10 seconds.

TS_PROMPT = "<|audio|> Timestamps: Transcribe the speech. After each word, add a timestamp tag showing the end time in centiseconds, e.g. hello [T:45] world [T:82]"

ts_text = transcribe(audio.data, TS_PROMPT, max_new_tokens=10000)
ts_words = re.split(r"\[T:(\d+)\]", ts_text)
last_word_end_time = 0
offset_time = 0
for word, ts in zip(ts_words[::2], ts_words[1::2]):
    word_end_time = float(ts) / 100
    while word_end_time + offset_time < last_word_end_time:
        offset_time += 10
    last_word_end_time = word_end_time + offset_time
    print(f"{word}\t{last_word_end_time:.2f}s")

Task 4: Incremental decoding — transcribe segments while accumulating audio context:

NUM_SEGMENTS = 3
previous_transcript = ""
all_audio = None

for k in range(NUM_SEGMENTS):
    t1 = START_TIME + (END_TIME - START_TIME) * k / NUM_SEGMENTS
    t2 = START_TIME + (END_TIME - START_TIME) * (k + 1) / NUM_SEGMENTS
    new_audio = ds["audio"][TEST_SAMPLE].get_samples_played_in_range(t1, t2)
    all_audio = new_audio.data if all_audio is None else torch.cat([all_audio, new_audio.data], dim=-1)
    saa_text = transcribe(all_audio, SAA_PROMPT, prefix_text=previous_transcript)
    print(f"{t1:06.2f}-{t2:06.2f}:\t{saa_text}")
    previous_transcript = (previous_transcript + " " + saa_text).strip()

Model Architecture

The model shares the same architecture as the Granite-Speech-4.1-2B model.

Training Data

The model was trained on the same datasets as Granite-Speech-4.1-2B.

Additional training data for SAA was created using audio segments from datasets that have speaker identification (e.g. Multilingual-Librispeech). Segments with alternating speakers were concatenated to create a long multi-speaker sample.

Training Data for Timestamps

Word-level timestamping capabilities are achieved by using a combination of publicly available speech corpora: LibriSpeech, MLS (en, fr, de, pt, es), CommonVoice (en, fr, de, pt, es), VoxPopuli (en, fr, de, es), AMI-IHM, Switchboard, TIMIT and YODAS. For AMI-IHM, Switchboard and TIMIT, we use the available timestamp annotations. For all other datasets, we obtain word-level alignments using the Montreal Forced Aligner (MFA), a GMM-HMM based forced alignment tool. We also use MFA to insert silence boundaries into the manually annotated datasets.

To ensure high-quality training data, we validate the MFA-derived alignments using forced alignments with our CTC-based speech encoder. We compute the Accumulated Average Shift (AAS), the mean absolute error between timestamps in milliseconds, for the CTC and MFA alignments and retain only samples with the lowest alignment error: the top 95% for English and top 70% for non-English data. For the larger datasets (YODAS and MLS-en), we cap the training data at 4M and 5M samples, respectively.

Additional training data containing long audio samples with timestamps were generated by concatenation of short segments.

The model was trained on audio samples up to 10 minutes for ASR and SAA, and up to 5 minutes for timestamps.

Infrastructure

We train Granite Speech using IBM's supercomputing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. The training of this particular model was completed in about 5 days on 32 H100 GPUs.

Ethical Considerations and Limitations

The use of Large Speech and Language Models can trigger certain risks and ethical considerations. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate, biased, offensive or unwanted responses to user prompts. Additionally, whether smaller models may exhibit increased susceptibility to hallucination in generation scenarios due to their reduced sizes, which could limit their ability to generate coherent and contextually accurate responses, remains uncertain. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain.

IBM recommends using this model for automatic speech recognition and translation tasks. The model's design improves safety by limiting how audio inputs can influence the system. If an unfamiliar or malformed prompt is received, the model simply ignores it and performs transcription, which is the default fallback mode. This minimizes the risk of adversarial inputs, unlike integrated models that directly interpret audio and may be more exposed to such attacks. Note that more general speech tasks may pose higher inherent risks of triggering unwanted outputs.

To enhance safety, we recommend using Granite-Speech-4.1-2B-Plus alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas.

Resources

References

[1] VibeVoice-ASR (Transformers-compatible version). Available online: https://huggingface.co/microsoft/VibeVoice-ASR-HF.

[2] X. Shi et al., "Qwen3-ASR technical report," 2026. arXiv

[3] M. Zusag, L. Wagner, and B. Thallinger, "CrisperWhisper: Accurate timestamps on verbatim speech transcriptions," in Proc. Interspeech, 2024.

[4] M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bartley, N. Karpov, J. Balam, and B. Ginsburg, "Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and high-performance models for multilingual ASR and AST," 2025. arXiv

[5] M. Bain, J. Huh, T. Han, and A. Zisserman, "WhisperX: Time-accurate speech transcription of long-form audio," 2023. arXiv

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes38
Downloads
📝

No reviews yet

Be the first to review ibm-granite/granite-speech-4.1-2b-plus!

Model Info

Provideribm-granite
Categoryaudio
Reviews0
Avg. Rating / 5.0

Community

Likes38
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor