Back to Models
JO

johnandru/ko-vdr-preview

johnandruimage

ko-vdr-preview

Korean visual document retrieval — 6 MTEB multimodal tasks (text→image).

A LoRA fine-tune of Qwen/Qwen3-VL-Embedding-2B trained on a mixed Korean/English VDR corpus with hard negatives mined by Qwen3-VL-Embedding-8B. Supports Matryoshka embeddings down to 128 dimensions (default: 2048).

Summary of Findings

  • Significant Improvement over 2B: ko-vdr-preview shows a massive performance uplift compared to the Qwen3-VL-2B baseline (e.g., ~0.48 vs ~0.35 avg nDCG@10).
  • Closing the Gap with 8B: The model's performance is remarkably close to the Qwen3-VL-8B model, offering near-state-of-the-art accuracy with much greater efficiency.

Usage

Install Dependencies

pip install -U sentence-transformers>=5.4.1
### Python code
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("johnandru/ko-vdr-preview")

# Run inference
queries = [
    '30인 이상 상용근로자를 보유한 기업의 1인당 평균 월별 법정외 복지비용이 10~29인 규모 기업보다 높은지 판단해 주세요'
]
documents = [
    'ko-vdr-public/3818.png',
    'ko-vdr-public/7753.png',
    'ko-vdr-public/3760.png'
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 2048] [3, 2048]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)

Matryoshka truncation

The model supports shortened embeddings via Matryoshka training. Supported dimensions: 2048, 1536, 1024, 768, 512, 256, 128 model = SentenceTransformer("johnandru/ko-vdr-preview", truncate_dim=512)

Model Details

PropertyValue
Base modelQwen/Qwen3-VL-Embedding-2B
Fine-tuning methodLoRA (r=32, alpha=32, no dropout)
LoRA target modulesq_proj, k_proj, v_proj, up_proj, down_proj, gate_proj
Embedding dimension2048 (Matryoshka: 1536 / 1024 / 768 / 512 / 256 / 128)
Precisionbfloat16
AttentionFlash Attention 2
Max image pixels1280 × 28 × 28
Frameworksentence-transformers==5.4.1, peft>=0.19.1

Training

Data

Training used a multi-source Korean/English VDR dataset with hard negatives mined offline:

SourceLanguageType
NomaDamas/ko-vdr-train-public-v2.0KoreanQuery–page pairs
whybe-choi/ko-vdr-train-private-v0.1KoreanQuery–page pairs
vidore/colpali_train_setEnglishQuery–page pairs
tomaarsen/llamaindex-vdr-en-train-preprocessedEnglishQuery–page pairs
Ko/En text retrieval corpusKorean + EnglishText pairs

Hard negatives were mined with Qwen/Qwen3-VL-Embedding-8B using absolute_margin=0.05 and 7 negatives per pair (top sampling).

Loss

MatryoshkaLoss(SelfGuideCachedMultipleNegativesRankingLoss): InfoNCE with cosine similarity (scale=20), cached mini-batches (mini_batch_size=4), and Matryoshka multi-granularity weighting.

Evaluation

Task abbreviations

ShortMTEB task
SDS-T2ITSDSKoPubVDRT2ITRetrieval
SDS-T2ISDSKoPubVDRT2IRetrieval
KV-CyberKoVidore2CybersecurityRetrieval
KV-EconKoVidore2EconomicRetrieval
KV-EnergyKoVidore2EnergyRetrieval
KV-HrKoVidore2HrRetrieval

Results - nDCG@10

rankmodel_nameSDS-T2IT_nDCG@10SDS-T2I_nDCG@10KV-Cyber_nDCG@10KV-Econ_nDCG@10KV-Energy_nDCG@10KV-Hr_nDCG@10avg_nDCG@10
1Qwen/Qwen3-VL-Embedding-8B0.69990.61360.68570.20080.54150.26610.5013
2(Ours) johnandru/ko-vdr-preview0.67320.56230.65400.21390.50610.29750.4845
3Qwen/Qwen3-VL-Embedding-2B0.66050.29230.53590.12460.35650.14980.3533

Results - Recall@10

rankmodel_nameSDS-T2IT_Recall@10SDS-T2I_Recall@10KV-Cyber_Recall@10KV-Econ_Recall@10KV-Energy_Recall@10KV-Hr_Recall@10avg_Recall@10
1Qwen/Qwen3-VL-Embedding-8B0.90330.78170.75270.29750.60590.34330.6141
2(Ours) johnandru/ko-vdr-preview0.85330.75000.75380.28680.59400.38470.6038
3Qwen/Qwen3-VL-Embedding-2B0.86500.43170.60120.18580.41660.19620.4494

Notes

  • All Qwen3-VL-Embedding family models loaded with max_pixels = 1280 * 28 * 28, bf16, flash-attention-2.
  • Prompt usage:
    • Qwen3-VL-Embedding 2B / 8B and our LoRA fine-tune: training prompt "Represent the user's input." (matches train.py).
  • LoRA fine-tune used peft 0.19.1 workaround in loader.py to inject lora_B weights (transformers 5.5.4 silently dropped them on from_pretrained for headless models — see PR huggingface/transformers#45428).
Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes4
Downloads
📝

No reviews yet

Be the first to review johnandru/ko-vdr-preview!

Model Info

Providerjohnandru
Categoryimage
Reviews0
Avg. Rating / 5.0

Community

Likes4
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor