VietLegal-Harrier-0.6B

A Vietnamese legal domain embedding model fine-tuned from microsoft/harrier-oss-v1-0.6b (600M params, 1024-dim).

Achieves NDCG@10 = 0.7813 on the Zalo AI Legal Text Retrieval benchmark, outperforming all baselines including our previous vietlegal-e5 model.

Benchmark Results

Evaluated on MTEB ZacLegalTextRetrieval (61.4K corpus documents, 818 test queries).

Model Params Dim NDCG@10 MRR@10 Recall@10
mainguyen9/vietlegal-harrier-0.6b 600M 1024 0.7813 0.7303 0.9321
mainguyen9/vietlegal-e5 (mE5-large) 560M 1024 0.7310 0.6770 0.8972
mainguyen9/vietlegal-harrier-270m 270M 1024 0.7174 0.6636 0.8864
microsoft/harrier-oss-v1-0.6b 600M 1024 0.7210 - -
intfloat/multilingual-e5-large 560M 1024 0.6660 - -
bkai-foundation-models/vietnamese-bi-encoder 135M 768 0.6160 - -
contextboxai/halong_embedding 278M 768 0.6009 - -

Key highlights:

  • +5.0 points NDCG@10 over vietlegal-e5 (mE5-large baseline): 0.7813 vs 0.7310
  • +5.3 points MRR@10: 0.7303 vs 0.6770
  • +3.5 points Recall@10: 0.9321 vs 0.8972
  • +6.0 points over original Harrier-0.6b (0.7813 vs 0.7210)
  • State-of-the-art on Vietnamese legal text retrieval benchmark

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("mainguyen9/vietlegal-harrier-0.6b")

# Harrier uses instruction-based queries
queries = ["Instruct: Given a Vietnamese legal question, retrieve relevant legal passages that answer the question\nQuery: Thu tuc dang ky kinh doanh gom nhung buoc nao?"]
passages = ["Dieu 27. Trinh tu, thu tuc dang ky doanh nghiep..."]

q_emb = model.encode(queries)
p_emb = model.encode(passages)

similarity = q_emb @ p_emb.T

Model Details

  • Model Type: Sentence Transformer
  • Base model: microsoft/harrier-oss-v1-0.6b
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity
  • Language: Vietnamese
  • License: Apache 2.0

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: Qwen3Model
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_lasttoken': True})
  (2): Normalize()
)

Training

Training Data

Training Pipeline

Stage 1: Data Preparation
|  518K docs -> ~500K chunks (article-aware segmentation)
|
Stage 2: Contrastive Fine-tuning (Round 1)
|  MultipleNegativesRankingLoss
|
Stage 3: Hard Negative Mining
|  FAISS retrieval -> mine rank 50-100 as hard negatives
|
Stage 4: Multi-task Blending (Final)
|  70% retrieval + 20% classification + 10% STS
|  -> Final model (NDCG@10 = 0.7813)

Training Hyperparameters

  • Learning rate: 3e-6
  • Batch size: 256
  • Epochs: 1 (multitask stage)
  • Warmup: 10%
  • Scheduler: Cosine
  • Precision: bf16
  • Hardware: NVIDIA H100 80GB

Citation

@misc{vietlegal-harrier,
  title={VietLegal-Harrier-0.6B: Vietnamese Legal Domain Embedding Model},
  author={Nguyen, Mai},
  year={2026},
  url={https://huggingface.co/mainguyen9/vietlegal-harrier-0.6b}
}

Acknowledgements

Downloads last month
36
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mainguyen9/vietlegal-harrier-0.6b

Finetuned
(1)
this model
Quantizations
1 model

Datasets used to train mainguyen9/vietlegal-harrier-0.6b