Vietnamese Legal Embedding Model
Model Details
Model Description
This model is a fine-tuned version of huyydangg/DEk21_hcmute_embedding, specialized for the Vietnamese Legal domain.
It has been trained to improve information retrieval tasks within legal documents, significantly outperforming the base model and traditional lexical search methods (BM25) on the Zalo Legal Retrieval benchmark.
- Model type: Sentence Transformer (Bi-encoder)
- Language: Vietnamese
- Finetuned from model:
huyydangg/DEk21_hcmute_embedding - Intended Domain: Legal, Law, Jurisprudence (Vietnam)
Uses
Direct Use
This model is designed for Semantic Search and Question Answering in the legal domain. It can be used to:
- Retrieve relevant legal articles based on a natural language query.
- Calculate similarity between legal queries and documents.
- Rerank search results for legal chat-bots.
Out-of-Scope Use
- This model is not a generative LLM (it does not generate text).
- It may not perform optimally on general-domain text (news, social media) compared to general-purpose embedding models, as it is biased towards legal terminology.
How to Get Started with the Model
You can use this model with the sentence-transformers library:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# 1. Load the model
model = SentenceTransformer("Quockhanh05/Vietnam_legal_embeddings")
# 2. Define queries and documents
queries = [
"Hành vi nào bị coi là tội trộm cắp tài sản theo Bộ luật Hình sự?"
]
documents = [
"Người nào chiếm đoạt tài sản của người khác một cách lén lút thì bị coi là phạm tội trộm cắp tài sản.",
"Nếu giá trị tài sản chiếm đoạt từ 2 triệu đồng trở lên hoặc dưới 2 triệu nhưng đã bị xử phạt hành chính trước đó thì sẽ bị truy cứu trách nhiệm hình sự.",
"Các hành vi gây rối trật tự công cộng, chống người thi hành công vụ cũng là tội phạm theo quy định của Bộ luật Hình sự."
]
# 3. Encode
query_embeddings = model.encode(queries)
doc_embeddings = model.encode(documents)
# 4. Compute similarity scores
scores = cosine_similarity(query_embeddings, doc_embeddings)
print(scores)
Training Details
Training Data
The model was fine-tuned on a combination of Vietnamese legal question-answering datasets:
These datasets provide pairs of questions and relevant legal passages, enabling the model to learn semantic relationships in Vietnamese legal texts.
Evaluation
Testing Data & Metrics
The model was evaluated on the another-symato/VMTEB-Zalo-legel-retrieval-wseg dataset using the InformationRetrievalEvaluator.
Metrics used:
- NDCG@10 (Normalized Discounted Cumulative Gain at rank 10): Measures the quality of the ranking.
- MRR@10 (Mean Reciprocal Rank at rank 10): Measures how high the first relevant document appears.
Benchmark Results
The table below shows the performance of this model compared to other state-of-the-art Vietnamese embedding models and baselines. The models are sorted by NDCG@10.
| Model | Type | NDCG@10 | MRR@10 | NDCG@3 | MRR@3 |
|---|---|---|---|---|---|
| AITeamVN/Vietnamese_Embedding | Dense | 0.8650 | 0.8334 | 0.8427 | 0.8221 |
| bkai-foundation-models/vietnamese-bi-encoder | Hybrid | 0.8469 | 0.8068 | 0.8272 | 0.7992 |
| bkai-foundation-models/vietnamese-bi-encoder | Dense | 0.8396 | 0.8096 | 0.8141 | 0.7966 |
| BAAI/bge-m3 | Hybrid | 0.8120 | 0.7713 | 0.7752 | 0.7477 |
| BAAI/bge-m3 | Dense | 0.8170 | 0.7803 | 0.7841 | 0.7633 |
| Quockhanh05/VietNam_legal_embedings | Dense | 0.8020 | 0.7557 | 0.7653 | 0.7370 |
| huyydangg/DEk21_hcmute_embedding (Base) | Dense | 0.7851 | 0.7411 | 0.7522 | 0.7247 |
| hiieu/halong_embedding | Hybrid | 0.7792 | 0.7320 | 0.7363 | 0.7104 |
| dangvantuan/vietnamese-embedding | Dense | 0.7634 | 0.7189 | 0.7190 | 0.6964 |
| BM25 | Lexical | 0.7616 | 0.7157 | 0.7281 | 0.6995 |
| VoVanPhuc/sup-SimCSE-VietNamese-phobert-base | Hybrid | 0.7339 | 0.6770 | 0.6885 | 0.6602 |
Analysis
- Significant Improvement: The fine-tuning process yielded a ~1.7% increase in NDCG@10 compared to the base model (
huyydangg/DEk21), proving the effectiveness of the training data for the legal domain. - Competitive Performance: The model outperforms popular general-purpose models like
hiieu/halong_embedding,dangvantuan/vietnamese-embedding, and the strong lexical baseline BM25. - Comparison to SOTA: While slightly behind massive models like
AITeamVNorbge-m3, this model offers a specialized, efficient alternative that is significantly better than its pre-trained starting point.
Citation
If you use this model, please cite the following:
@misc{Quockhanh05/Vietnam_legal_embeddings,
author = {Trần Quốc khánh},
title = {Vietnamese Legal Embedding Model},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{[https://huggingface.co/Quockhanh05/Vietnam_legal_embeddingshttps://huggingface.co/Quockhanh05/Vietnam_legal_embeddings](https://huggingface.co/Quockhanh05/Vietnam_legal_embeddings)}},
}
@misc{DEk21_hcmute_embedding,
title={DEk21_hcmute_embedding: A Vietnamese Text Embedding},
author={QUANG HUY},
year={2025},
publisher={Huggingface},
}
- Downloads last month
- 475
Model tree for Quockhanh05/Vietnam_legal_embeddings
Base model
bkai-foundation-models/vietnamese-bi-encoder