ko-vdr-preview

Name: johnandru/ko-vdr-preview
Brand: johnandru
Rating: 0.0 (4 reviews)

Korean visual document retrieval — 6 MTEB multimodal tasks (text→image).

A LoRA fine-tune of Qwen/Qwen3-VL-Embedding-2B trained on a mixed Korean/English VDR corpus with hard negatives mined by Qwen3-VL-Embedding-8B. Supports Matryoshka embeddings down to 128 dimensions (default: 2048).

Summary of Findings

Significant Improvement over 2B: ko-vdr-preview shows a massive performance uplift compared to the Qwen3-VL-2B baseline (e.g., ~0.48 vs ~0.35 avg nDCG@10).
Closing the Gap with 8B: The model's performance is remarkably close to the Qwen3-VL-8B model, offering near-state-of-the-art accuracy with much greater efficiency.

Usage

Install Dependencies

pip install -U sentence-transformers>=5.4.1
### Python code
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("johnandru/ko-vdr-preview")

# Run inference
queries = [
    '30인 이상 상용근로자를 보유한 기업의 1인당 평균 월별 법정외 복지비용이 10~29인 규모 기업보다 높은지 판단해 주세요'
]
documents = [
    'ko-vdr-public/3818.png',
    'ko-vdr-public/7753.png',
    'ko-vdr-public/3760.png'
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 2048] [3, 2048]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)

Matryoshka truncation

The model supports shortened embeddings via Matryoshka training. Supported dimensions: 2048, 1536, 1024, 768, 512, 256, 128 model = SentenceTransformer("johnandru/ko-vdr-preview", truncate_dim=512)

Model Details

Property	Value
Base model	Qwen/Qwen3-VL-Embedding-2B
Fine-tuning method	LoRA (r=32, alpha=32, no dropout)
LoRA target modules	q_proj, k_proj, v_proj, up_proj, down_proj, gate_proj
Embedding dimension	2048 (Matryoshka: 1536 / 1024 / 768 / 512 / 256 / 128)
Precision	bfloat16
Attention	Flash Attention 2
Max image pixels	1280 × 28 × 28
Framework	sentence-transformers==5.4.1, peft>=0.19.1

Training

Data

Training used a multi-source Korean/English VDR dataset with hard negatives mined offline:

Source	Language	Type
NomaDamas/ko-vdr-train-public-v2.0	Korean	Query–page pairs
whybe-choi/ko-vdr-train-private-v0.1	Korean	Query–page pairs
vidore/colpali_train_set	English	Query–page pairs
tomaarsen/llamaindex-vdr-en-train-preprocessed	English	Query–page pairs
Ko/En text retrieval corpus	Korean + English	Text pairs

Hard negatives were mined with Qwen/Qwen3-VL-Embedding-8B using absolute_margin=0.05 and 7 negatives per pair (top sampling).

Loss

MatryoshkaLoss(SelfGuideCachedMultipleNegativesRankingLoss): InfoNCE with cosine similarity (scale=20), cached mini-batches (mini_batch_size=4), and Matryoshka multi-granularity weighting.

Evaluation

Task abbreviations

Short	MTEB task
SDS-T2IT	`SDSKoPubVDRT2ITRetrieval`
SDS-T2I	`SDSKoPubVDRT2IRetrieval`
KV-Cyber	`KoVidore2CybersecurityRetrieval`
KV-Econ	`KoVidore2EconomicRetrieval`
KV-Energy	`KoVidore2EnergyRetrieval`
KV-Hr	`KoVidore2HrRetrieval`

Results - nDCG@10

rank	model_name	SDS-T2IT_nDCG@10	SDS-T2I_nDCG@10	KV-Cyber_nDCG@10	KV-Econ_nDCG@10	KV-Energy_nDCG@10	KV-Hr_nDCG@10	avg_nDCG@10
1	Qwen/Qwen3-VL-Embedding-8B	0.6999	0.6136	0.6857	0.2008	0.5415	0.2661	0.5013
2	(Ours) johnandru/ko-vdr-preview	0.6732	0.5623	0.6540	0.2139	0.5061	0.2975	0.4845
3	Qwen/Qwen3-VL-Embedding-2B	0.6605	0.2923	0.5359	0.1246	0.3565	0.1498	0.3533

Results - Recall@10

rank	model_name	SDS-T2IT_Recall@10	SDS-T2I_Recall@10	KV-Cyber_Recall@10	KV-Econ_Recall@10	KV-Energy_Recall@10	KV-Hr_Recall@10	avg_Recall@10
1	Qwen/Qwen3-VL-Embedding-8B	0.9033	0.7817	0.7527	0.2975	0.6059	0.3433	0.6141
2	(Ours) johnandru/ko-vdr-preview	0.8533	0.7500	0.7538	0.2868	0.5940	0.3847	0.6038
3	Qwen/Qwen3-VL-Embedding-2B	0.8650	0.4317	0.6012	0.1858	0.4166	0.1962	0.4494

Notes

All Qwen3-VL-Embedding family models loaded with max_pixels = 1280 * 28 * 28, bf16, flash-attention-2.
Prompt usage:
- Qwen3-VL-Embedding 2B / 8B and our LoRA fine-tune: training prompt "Represent the user's input." (matches train.py).
LoRA fine-tune used peft 0.19.1 workaround in loader.py to inject lora_B weights (transformers 5.5.4 silently dropped them on from_pretrained for headless models — see PR huggingface/transformers#45428).

johnandru/ko-vdr-preview