VietLegal-Harrier-0.6B
A Vietnamese legal domain embedding model fine-tuned from microsoft/harrier-oss-v1-0.6b (600M params, 1024-dim).
Achieves NDCG@10 = 0.7813 on the Zalo AI Legal Text Retrieval benchmark, outperforming all baselines including our previous vietlegal-e5 model.
Benchmark Results
Evaluated on MTEB ZacLegalTextRetrieval (61.4K corpus documents, 818 test queries).
| Model | Params | Dim | NDCG@10 | MRR@10 | Recall@10 |
|---|---|---|---|---|---|
| mainguyen9/vietlegal-harrier-0.6b | 600M | 1024 | 0.7813 | 0.7303 | 0.9321 |
| mainguyen9/vietlegal-e5 (mE5-large) | 560M | 1024 | 0.7310 | 0.6770 | 0.8972 |
| mainguyen9/vietlegal-harrier-270m | 270M | 1024 | 0.7174 | 0.6636 | 0.8864 |
| microsoft/harrier-oss-v1-0.6b | 600M | 1024 | 0.7210 | - | - |
| intfloat/multilingual-e5-large | 560M | 1024 | 0.6660 | - | - |
| bkai-foundation-models/vietnamese-bi-encoder | 135M | 768 | 0.6160 | - | - |
| contextboxai/halong_embedding | 278M | 768 | 0.6009 | - | - |
Key highlights:
- +5.0 points NDCG@10 over vietlegal-e5 (mE5-large baseline): 0.7813 vs 0.7310
- +5.3 points MRR@10: 0.7303 vs 0.6770
- +3.5 points Recall@10: 0.9321 vs 0.8972
- +6.0 points over original Harrier-0.6b (0.7813 vs 0.7210)
- State-of-the-art on Vietnamese legal text retrieval benchmark
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mainguyen9/vietlegal-harrier-0.6b")
# Harrier uses instruction-based queries
queries = ["Instruct: Given a Vietnamese legal question, retrieve relevant legal passages that answer the question\nQuery: Thu tuc dang ky kinh doanh gom nhung buoc nao?"]
passages = ["Dieu 27. Trinh tu, thu tuc dang ky doanh nghiep..."]
q_emb = model.encode(queries)
p_emb = model.encode(passages)
similarity = q_emb @ p_emb.T
Model Details
- Model Type: Sentence Transformer
- Base model: microsoft/harrier-oss-v1-0.6b
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
- Language: Vietnamese
- License: Apache 2.0
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: Qwen3Model
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_lasttoken': True})
(2): Normalize()
)
Training
Training Data
- Legal documents: 518K Vietnamese legal documents from th1nhng0/vietnamese-legal-documents
- Query-passage pairs: 507K pairs from phamson02/large-vi-legal-queries
Training Pipeline
Stage 1: Data Preparation
| 518K docs -> ~500K chunks (article-aware segmentation)
|
Stage 2: Contrastive Fine-tuning (Round 1)
| MultipleNegativesRankingLoss
|
Stage 3: Hard Negative Mining
| FAISS retrieval -> mine rank 50-100 as hard negatives
|
Stage 4: Multi-task Blending (Final)
| 70% retrieval + 20% classification + 10% STS
| -> Final model (NDCG@10 = 0.7813)
Training Hyperparameters
- Learning rate: 3e-6
- Batch size: 256
- Epochs: 1 (multitask stage)
- Warmup: 10%
- Scheduler: Cosine
- Precision: bf16
- Hardware: NVIDIA H100 80GB
Citation
@misc{vietlegal-harrier,
title={VietLegal-Harrier-0.6B: Vietnamese Legal Domain Embedding Model},
author={Nguyen, Mai},
year={2026},
url={https://huggingface.co/mainguyen9/vietlegal-harrier-0.6b}
}
Acknowledgements
- Base model: microsoft/harrier-oss-v1-0.6b
- Training framework: sentence-transformers
- Benchmark: Zalo AI Legal Text Retrieval
- Downloads last month
- 36