Vietnamese Legal Embedding Model

Model Details

Model Description

This model is a fine-tuned version of huyydangg/DEk21_hcmute_embedding, specialized for the Vietnamese Legal domain.

It has been trained to improve information retrieval tasks within legal documents, significantly outperforming the base model and traditional lexical search methods (BM25) on the Zalo Legal Retrieval benchmark.

  • Model type: Sentence Transformer (Bi-encoder)
  • Language: Vietnamese
  • Finetuned from model: huyydangg/DEk21_hcmute_embedding
  • Intended Domain: Legal, Law, Jurisprudence (Vietnam)

Uses

Direct Use

This model is designed for Semantic Search and Question Answering in the legal domain. It can be used to:

  • Retrieve relevant legal articles based on a natural language query.
  • Calculate similarity between legal queries and documents.
  • Rerank search results for legal chat-bots.

Out-of-Scope Use

  • This model is not a generative LLM (it does not generate text).
  • It may not perform optimally on general-domain text (news, social media) compared to general-purpose embedding models, as it is biased towards legal terminology.

How to Get Started with the Model

You can use this model with the sentence-transformers library:

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# 1. Load the model
model = SentenceTransformer("Quockhanh05/Vietnam_legal_embeddings")

# 2. Define queries and documents
queries = [
    "Hành vi nào bị coi là tội trộm cắp tài sản theo Bộ luật Hình sự?"
]

documents = [
    "Người nào chiếm đoạt tài sản của người khác một cách lén lút thì bị coi là phạm tội trộm cắp tài sản.",
    "Nếu giá trị tài sản chiếm đoạt từ 2 triệu đồng trở lên hoặc dưới 2 triệu nhưng đã bị xử phạt hành chính trước đó thì sẽ bị truy cứu trách nhiệm hình sự.",
    "Các hành vi gây rối trật tự công cộng, chống người thi hành công vụ cũng là tội phạm theo quy định của Bộ luật Hình sự."
]


# 3. Encode
query_embeddings = model.encode(queries)
doc_embeddings = model.encode(documents)

# 4. Compute similarity scores
scores = cosine_similarity(query_embeddings, doc_embeddings)

print(scores)

Training Details

Training Data

The model was fine-tuned on a combination of Vietnamese legal question-answering datasets:

These datasets provide pairs of questions and relevant legal passages, enabling the model to learn semantic relationships in Vietnamese legal texts.

Evaluation

Testing Data & Metrics

The model was evaluated on the another-symato/VMTEB-Zalo-legel-retrieval-wseg dataset using the InformationRetrievalEvaluator.

Metrics used:

  • NDCG@10 (Normalized Discounted Cumulative Gain at rank 10): Measures the quality of the ranking.
  • MRR@10 (Mean Reciprocal Rank at rank 10): Measures how high the first relevant document appears.

Benchmark Results

The table below shows the performance of this model compared to other state-of-the-art Vietnamese embedding models and baselines. The models are sorted by NDCG@10.

Model Type NDCG@10 MRR@10 NDCG@3 MRR@3
AITeamVN/Vietnamese_Embedding Dense 0.8650 0.8334 0.8427 0.8221
bkai-foundation-models/vietnamese-bi-encoder Hybrid 0.8469 0.8068 0.8272 0.7992
bkai-foundation-models/vietnamese-bi-encoder Dense 0.8396 0.8096 0.8141 0.7966
BAAI/bge-m3 Hybrid 0.8120 0.7713 0.7752 0.7477
BAAI/bge-m3 Dense 0.8170 0.7803 0.7841 0.7633
Quockhanh05/VietNam_legal_embedings Dense 0.8020 0.7557 0.7653 0.7370
huyydangg/DEk21_hcmute_embedding (Base) Dense 0.7851 0.7411 0.7522 0.7247
hiieu/halong_embedding Hybrid 0.7792 0.7320 0.7363 0.7104
dangvantuan/vietnamese-embedding Dense 0.7634 0.7189 0.7190 0.6964
BM25 Lexical 0.7616 0.7157 0.7281 0.6995
VoVanPhuc/sup-SimCSE-VietNamese-phobert-base Hybrid 0.7339 0.6770 0.6885 0.6602

Analysis

  • Significant Improvement: The fine-tuning process yielded a ~1.7% increase in NDCG@10 compared to the base model (huyydangg/DEk21), proving the effectiveness of the training data for the legal domain.
  • Competitive Performance: The model outperforms popular general-purpose models like hiieu/halong_embedding, dangvantuan/vietnamese-embedding, and the strong lexical baseline BM25.
  • Comparison to SOTA: While slightly behind massive models like AITeamVN or bge-m3, this model offers a specialized, efficient alternative that is significantly better than its pre-trained starting point.

Citation

If you use this model, please cite the following:

@misc{Quockhanh05/Vietnam_legal_embeddings,
  author = {Trần Quốc khánh},
  title = {Vietnamese Legal Embedding Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/Quockhanh05/Vietnam_legal_embeddingshttps://huggingface.co/Quockhanh05/Vietnam_legal_embeddings](https://huggingface.co/Quockhanh05/Vietnam_legal_embeddings)}},
}

@misc{DEk21_hcmute_embedding,
  title={DEk21_hcmute_embedding: A Vietnamese Text Embedding},
  author={QUANG HUY},
  year={2025},
  publisher={Huggingface},
}

Downloads last month
475
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Quockhanh05/Vietnam_legal_embeddings

Datasets used to train Quockhanh05/Vietnam_legal_embeddings