johnandru/ko-vdr-preview
johnandru • imageko-vdr-preview
Korean visual document retrieval — 6 MTEB multimodal tasks (text→image).
A LoRA fine-tune of Qwen/Qwen3-VL-Embedding-2B trained on a mixed Korean/English VDR corpus with hard negatives mined by Qwen3-VL-Embedding-8B. Supports Matryoshka embeddings down to 128 dimensions (default: 2048).
Summary of Findings
- Significant Improvement over 2B:
ko-vdr-previewshows a massive performance uplift compared to theQwen3-VL-2Bbaseline (e.g., ~0.48 vs ~0.35 avg nDCG@10). - Closing the Gap with 8B: The model's performance is remarkably close to the
Qwen3-VL-8Bmodel, offering near-state-of-the-art accuracy with much greater efficiency.
Usage
Install Dependencies
pip install -U sentence-transformers>=5.4.1
### Python code
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("johnandru/ko-vdr-preview")
# Run inference
queries = [
'30인 이상 상용근로자를 보유한 기업의 1인당 평균 월별 법정외 복지비용이 10~29인 규모 기업보다 높은지 판단해 주세요'
]
documents = [
'ko-vdr-public/3818.png',
'ko-vdr-public/7753.png',
'ko-vdr-public/3760.png'
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 2048] [3, 2048]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
Matryoshka truncation
The model supports shortened embeddings via Matryoshka training. Supported dimensions: 2048, 1536, 1024, 768, 512, 256, 128 model = SentenceTransformer("johnandru/ko-vdr-preview", truncate_dim=512)
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-VL-Embedding-2B |
| Fine-tuning method | LoRA (r=32, alpha=32, no dropout) |
| LoRA target modules | q_proj, k_proj, v_proj, up_proj, down_proj, gate_proj |
| Embedding dimension | 2048 (Matryoshka: 1536 / 1024 / 768 / 512 / 256 / 128) |
| Precision | bfloat16 |
| Attention | Flash Attention 2 |
| Max image pixels | 1280 × 28 × 28 |
| Framework | sentence-transformers==5.4.1, peft>=0.19.1 |
Training
Data
Training used a multi-source Korean/English VDR dataset with hard negatives mined offline:
| Source | Language | Type |
|---|---|---|
| NomaDamas/ko-vdr-train-public-v2.0 | Korean | Query–page pairs |
| whybe-choi/ko-vdr-train-private-v0.1 | Korean | Query–page pairs |
| vidore/colpali_train_set | English | Query–page pairs |
| tomaarsen/llamaindex-vdr-en-train-preprocessed | English | Query–page pairs |
| Ko/En text retrieval corpus | Korean + English | Text pairs |
Hard negatives were mined with Qwen/Qwen3-VL-Embedding-8B using absolute_margin=0.05 and 7 negatives per pair (top sampling).
Loss
MatryoshkaLoss(SelfGuideCachedMultipleNegativesRankingLoss): InfoNCE with cosine similarity (scale=20), cached mini-batches (mini_batch_size=4), and Matryoshka multi-granularity weighting.
Evaluation
Task abbreviations
| Short | MTEB task |
|---|---|
| SDS-T2IT | SDSKoPubVDRT2ITRetrieval |
| SDS-T2I | SDSKoPubVDRT2IRetrieval |
| KV-Cyber | KoVidore2CybersecurityRetrieval |
| KV-Econ | KoVidore2EconomicRetrieval |
| KV-Energy | KoVidore2EnergyRetrieval |
| KV-Hr | KoVidore2HrRetrieval |
Results - nDCG@10
| rank | model_name | SDS-T2IT_nDCG@10 | SDS-T2I_nDCG@10 | KV-Cyber_nDCG@10 | KV-Econ_nDCG@10 | KV-Energy_nDCG@10 | KV-Hr_nDCG@10 | avg_nDCG@10 |
|---|---|---|---|---|---|---|---|---|
| 1 | Qwen/Qwen3-VL-Embedding-8B | 0.6999 | 0.6136 | 0.6857 | 0.2008 | 0.5415 | 0.2661 | 0.5013 |
| 2 | (Ours) johnandru/ko-vdr-preview | 0.6732 | 0.5623 | 0.6540 | 0.2139 | 0.5061 | 0.2975 | 0.4845 |
| 3 | Qwen/Qwen3-VL-Embedding-2B | 0.6605 | 0.2923 | 0.5359 | 0.1246 | 0.3565 | 0.1498 | 0.3533 |
Results - Recall@10
| rank | model_name | SDS-T2IT_Recall@10 | SDS-T2I_Recall@10 | KV-Cyber_Recall@10 | KV-Econ_Recall@10 | KV-Energy_Recall@10 | KV-Hr_Recall@10 | avg_Recall@10 |
|---|---|---|---|---|---|---|---|---|
| 1 | Qwen/Qwen3-VL-Embedding-8B | 0.9033 | 0.7817 | 0.7527 | 0.2975 | 0.6059 | 0.3433 | 0.6141 |
| 2 | (Ours) johnandru/ko-vdr-preview | 0.8533 | 0.7500 | 0.7538 | 0.2868 | 0.5940 | 0.3847 | 0.6038 |
| 3 | Qwen/Qwen3-VL-Embedding-2B | 0.8650 | 0.4317 | 0.6012 | 0.1858 | 0.4166 | 0.1962 | 0.4494 |
Notes
- All Qwen3-VL-Embedding family models loaded with
max_pixels = 1280 * 28 * 28, bf16, flash-attention-2. - Prompt usage:
- Qwen3-VL-Embedding 2B / 8B and our LoRA fine-tune: training prompt
"Represent the user's input."(matches train.py).
- Qwen3-VL-Embedding 2B / 8B and our LoRA fine-tune: training prompt
- LoRA fine-tune used
peft 0.19.1workaround inloader.pyto injectlora_Bweights (transformers 5.5.4 silently dropped them onfrom_pretrainedfor headless models — see PR huggingface/transformers#45428).