ModernBERTić - the first modern long-context encoder for BCMS

ModernBERTić-large

The first modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS). 395M parameters, native 8192-token context, FlashAttention 2.

State of the art on SuperGLUE-SR. Live leaderboard: balkanbench.com/leaderboard.

Looking for the smaller variant? See permitt/galton-modernbertic-base (149M params).

TL;DR


Architecture	ModernBERT-large (28 layers, 1024 hidden, 16 heads)
Parameters	395M
Context length	8192 tokens (RoPE base 160K)
Attention	Sliding window 256 + global every 2nd layer, FlashAttention 2
Tokenizer	BPE, 50,304 vocab, Latin-only, cased
Pretraining tokens	66B BCMS tokens, 22 sources
Compute	64× A100-64GB on Leonardo HPC, ~10h wall clock

Why this model exists

The de facto encoder for BCMS has been classla/bcms-bertic since 2021: 110M parameters, 512-token context, ELECTRA. Excellent within its envelope. Insufficient for production tasks that require long-document understanding (CV parsing, legal documents, retrieval over knowledge bases).

ModernBERTić ports the ModernBERT recipe to BCMS:

8K native context instead of 512, via RoPE
FlashAttention 2 + unpadding for ~3.5× faster inference at identical hardware
Alternating attention (sliding window 256 + full attention every 2nd layer) for O(n) cost on long inputs
Latin-only BCMS-native tokenizer at 31% lower than mmBERT's multilingual SentencePiece

Results: SuperGLUE Serbian edition

Evaluation from BalkanBench v1.0. 5 random seeds per cell, mean reported om the website; standard deviations in the leaderboard UI.

Live, sortable leaderboard with all 9 evaluated models, per-task standard deviations, and reproducibility info: balkanbench.com/leaderboard.

Quickstart

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
 
model_id = "permitt/galton-modernbertic-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",   # falls back to sdpa if FA2 unavailable
    torch_dtype=torch.bfloat16,
).to("cuda")
 
text = "Glavni grad Crne Gore je [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
    logits = model(**inputs).logits
 
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1))
print(predicted)

Long context

# 8192 tokens supported natively, no positional interpolation needed
tokenizer.model_max_length = 8192
inputs = tokenizer(very_long_document, return_tensors="pt", truncation=True).to("cuda")

Fine-tuning

from transformers import AutoModelForSequenceClassification
 
model = AutoModelForSequenceClassification.from_pretrained(
    "permitt/galton-modernbertic-large",
    num_labels=3,
    attn_implementation="flash_attention_2",
)
# standard HF Trainer flow from here

Recommended hyperparameters (from our SuperGLUE-SR sweeps):

Task type	Learning rate	Batch size	Epochs
Sequence classification	2e-5 to 5e-5	16-32	3-5
Token classification (NER, POS)	3e-5	32	5-10
Long-context tasks (>512 tok)	1e-5 to 3e-5	8-16	3-5

Tokenizer

	Vocab	Tokens / character	OOV rate
ModernBERTić	50,304	0.229	0.000%
BERTić	32,000	0.242	0.006%
XLM-R-BERTić	250,002	0.274	0.008%
mmBERT	256,000	0.334	0.000%

Measured on 55.8M characters of held-out BCMS text.

The vocabulary is Latin-only. Cyrillic input should be transliterated upstream; Cased input is preferred (uncased reduces tokenizer efficiency by ~14%).

Pretraining

Corpus: 66B tokens, 227M documents, assembled from 22 BCMS sources (FineWiki, BERTić-MaCoCu, FineWeb-2, HPLT 3.0, FinePDFs, CLASSLA web, books, news, plus others). Tiered priority, BCMS-specific quality filters (gambling/content-farm/stop-word heuristics), MinHash LSH cross-source deduplication at 0.8 Jaccard threshold.
Objective: Masked Language Modeling, 30% masking ratio.
Optimizer: AdamW, peak LR 5e-4, warmup-stable-decay schedule with ~9% decay phase.
Batch: 4096 sequences global, kept constant across GPU counts (strong scaling).
Precision: bfloat16.
Framework: MosaicML Composer + FlexBERT. MDS streaming dataset format with deterministic resume across the 24-hour Leonardo job limit.

Intended uses and limitations

Intended uses. Sequence classification, token classification (NER, POS), masked language modeling, long-document understanding, and as a base model for fine-tuned retrievers and rerankers (see EmbedBERTić and RerankerBERTić, releasing soon).

Out of scope. This is an encoder, not a generative model. It does not produce open-ended text. For text generation in BCMS, see the national LLM initiative announced April 2026 or general-purpose multilingual LLMs.

Limitations.

Latin script only. Cyrillic input should be transliterated before tokenization. Raw Cyrillic falls back to byte-level encoding and burns context for no signal.
Domain skew. Training data is heavy on web text, news, encyclopedic content, and PDFs (academic + literary). Heavy code, conversational chat, or highly technical scientific text are underrepresented.
Variants. All four BCMS varieties (Bosnian, Croatian, Montenegrin, Serbian) are represented, but Croatian and Serbian dominate the corpus volume. Montenegrin in particular is upsampled 4× during mixing to compensate.

Production note

ModernBERTić powers production features at Recrewty, an AI-assisted talent management platform for the Balkans, including long-document CV understanding, psychometric inference, and the candidate retrieve-then-rerank pipeline. The model is the same artifact in production and on this card; nothing is held back.

Citation

@misc{perovic2026modernbertic,
  title  = {{ModernBERTić}: A Modern Encoder for {BCMS} Languages},
  author = {Perovic, Mitar},
  year   = {2026},
  url    = {https://huggingface.co/permitt/galton-modernbertic-large},
  note   = {Recrewty, EU-funded grant}
}

Acknowledgments

This work was developed at Recrewty as part of an EU-funded grant. Compute on Leonardo HPC was provided through the consortium grant.

Standing on the shoulders of:

Nikola Ljubešić and the CLASSLA team for BERTić, BENCHić, and the broader BCMS NLP infrastructure that made this work possible.
The ModernBERT team (Warner et al., 2024) for the architecture and the FlexBERT codebase.
MosaicML / Databricks for Composer and the MDS streaming format.
HuggingFace for the model hub, datasets, and tokenizers library.
JeRTeh, ReLDI, and the broader Serbian NLP community for datasets and evaluation resources.
EuroHPC and the Leonardo consortium for compute access.

permitt/galton-modernbertic-large