Back to Models
PE

permitt/galton-modernbertic-large

permittcode

ModernBERTić - the first modern long-context encoder for BCMS

ModernBERTić-large

The first modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS). 395M parameters, native 8192-token context, FlashAttention 2.

State of the art on SuperGLUE-SR. Live leaderboard: balkanbench.com/leaderboard.

Looking for the smaller variant? See permitt/galton-modernbertic-base (149M params).

TL;DR

ArchitectureModernBERT-large (28 layers, 1024 hidden, 16 heads)
Parameters395M
Context length8192 tokens (RoPE base 160K)
AttentionSliding window 256 + global every 2nd layer, FlashAttention 2
TokenizerBPE, 50,304 vocab, Latin-only, cased
Pretraining tokens66B BCMS tokens, 22 sources
Compute64× A100-64GB on Leonardo HPC, ~10h wall clock

Why this model exists

The de facto encoder for BCMS has been classla/bcms-bertic since 2021: 110M parameters, 512-token context, ELECTRA. Excellent within its envelope. Insufficient for production tasks that require long-document understanding (CV parsing, legal documents, retrieval over knowledge bases).

ModernBERTić ports the ModernBERT recipe to BCMS:

  • 8K native context instead of 512, via RoPE
  • FlashAttention 2 + unpadding for ~3.5× faster inference at identical hardware
  • Alternating attention (sliding window 256 + full attention every 2nd layer) for O(n) cost on long inputs
  • Latin-only BCMS-native tokenizer at 31% lower than mmBERT's multilingual SentencePiece

Results: SuperGLUE Serbian edition

Evaluation from BalkanBench v1.0. 5 random seeds per cell, mean reported om the website; standard deviations in the leaderboard UI.

Live, sortable leaderboard with all 9 evaluated models, per-task standard deviations, and reproducibility info: balkanbench.com/leaderboard.

Quickstart

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
 
model_id = "permitt/galton-modernbertic-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",   # falls back to sdpa if FA2 unavailable
    torch_dtype=torch.bfloat16,
).to("cuda")
 
text = "Glavni grad Crne Gore je [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
    logits = model(**inputs).logits
 
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1))
print(predicted)  

Long context

# 8192 tokens supported natively, no positional interpolation needed
tokenizer.model_max_length = 8192
inputs = tokenizer(very_long_document, return_tensors="pt", truncation=True).to("cuda")

Fine-tuning

from transformers import AutoModelForSequenceClassification
 
model = AutoModelForSequenceClassification.from_pretrained(
    "permitt/galton-modernbertic-large",
    num_labels=3,
    attn_implementation="flash_attention_2",
)
# standard HF Trainer flow from here

Recommended hyperparameters (from our SuperGLUE-SR sweeps):

Task typeLearning rateBatch sizeEpochs
Sequence classification2e-5 to 5e-516-323-5
Token classification (NER, POS)3e-5325-10
Long-context tasks (>512 tok)1e-5 to 3e-58-163-5

Tokenizer

VocabTokens / characterOOV rate
ModernBERTić50,3040.2290.000%
BERTić32,0000.2420.006%
XLM-R-BERTić250,0020.2740.008%
mmBERT256,0000.3340.000%

Measured on 55.8M characters of held-out BCMS text.

The vocabulary is Latin-only. Cyrillic input should be transliterated upstream; Cased input is preferred (uncased reduces tokenizer efficiency by ~14%).

Pretraining

  • Corpus: 66B tokens, 227M documents, assembled from 22 BCMS sources (FineWiki, BERTić-MaCoCu, FineWeb-2, HPLT 3.0, FinePDFs, CLASSLA web, books, news, plus others). Tiered priority, BCMS-specific quality filters (gambling/content-farm/stop-word heuristics), MinHash LSH cross-source deduplication at 0.8 Jaccard threshold.
  • Objective: Masked Language Modeling, 30% masking ratio.
  • Optimizer: AdamW, peak LR 5e-4, warmup-stable-decay schedule with ~9% decay phase.
  • Batch: 4096 sequences global, kept constant across GPU counts (strong scaling).
  • Precision: bfloat16.
  • Framework: MosaicML Composer + FlexBERT. MDS streaming dataset format with deterministic resume across the 24-hour Leonardo job limit.

Intended uses and limitations

Intended uses. Sequence classification, token classification (NER, POS), masked language modeling, long-document understanding, and as a base model for fine-tuned retrievers and rerankers (see EmbedBERTić and RerankerBERTić, releasing soon).

Out of scope. This is an encoder, not a generative model. It does not produce open-ended text. For text generation in BCMS, see the national LLM initiative announced April 2026 or general-purpose multilingual LLMs.

Limitations.

  • Latin script only. Cyrillic input should be transliterated before tokenization. Raw Cyrillic falls back to byte-level encoding and burns context for no signal.
  • Domain skew. Training data is heavy on web text, news, encyclopedic content, and PDFs (academic + literary). Heavy code, conversational chat, or highly technical scientific text are underrepresented.
  • Variants. All four BCMS varieties (Bosnian, Croatian, Montenegrin, Serbian) are represented, but Croatian and Serbian dominate the corpus volume. Montenegrin in particular is upsampled 4× during mixing to compensate.

Production note

ModernBERTić powers production features at Recrewty, an AI-assisted talent management platform for the Balkans, including long-document CV understanding, psychometric inference, and the candidate retrieve-then-rerank pipeline. The model is the same artifact in production and on this card; nothing is held back.

Citation

@misc{perovic2026modernbertic,
  title  = {{ModernBERTić}: A Modern Encoder for {BCMS} Languages},
  author = {Perovic, Mitar},
  year   = {2026},
  url    = {https://huggingface.co/permitt/galton-modernbertic-large},
  note   = {Recrewty, EU-funded grant}
}

Acknowledgments

This work was developed at Recrewty as part of an EU-funded grant. Compute on Leonardo HPC was provided through the consortium grant.

Standing on the shoulders of:

  • Nikola Ljubešić and the CLASSLA team for BERTić, BENCHić, and the broader BCMS NLP infrastructure that made this work possible.
  • The ModernBERT team (Warner et al., 2024) for the architecture and the FlexBERT codebase.
  • MosaicML / Databricks for Composer and the MDS streaming format.
  • HuggingFace for the model hub, datasets, and tokenizers library.
  • JeRTeh, ReLDI, and the broader Serbian NLP community for datasets and evaluation resources.
  • EuroHPC and the Leonardo consortium for compute access.

See also

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes4
Downloads
📝

No reviews yet

Be the first to review permitt/galton-modernbertic-large!

Model Info

Providerpermitt
Categorycode
Reviews0
Avg. Rating / 5.0

Community

Likes4
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor