code-daemon-embed-v1
A small, fast code-embedding model for semantic code search and code retrieval. It maps short code units — function and method bodies, signatures, docstrings — and short natural-language queries into a shared 768-dim vector space.
It is specialized for short code, not long documents: the maximum sequence is 128 tokens, trading long-context capability for high throughput and strong quality on short units.
- 768-dim embeddings, Matryoshka (MRL) — truncatable to 512 / 256 dims with graceful decay.
- ~54.5M params — XLM-RoBERTa architecture, 4 layers / 768 hidden, a code-oriented 32k SentencePiece vocab.
- Mean pooling fused into the graph — the output is already pooled (
[batch, 768]); just L2-normalize. - Trained at sequence length 128 (length buckets s / m / l = seq 40 / 64 / 128).
Where it's good — and where it isn't
Measured on CoIR (NDCG@10, full corpora). Use it for:
Strong:
- Code → related code — finding similar or duplicate implementations (its best relative area).
- Natural-language → code — docstring or description → function (strongest on Python).
- Short question → code — "how / where does X…".
- NL → SQL and NL instruction → code.
Weak / out of scope:
- Long documents — hard 128-token cap; longer inputs are truncated. This is not a long-context retriever.
- Noisy / ambiguous NL→code (hard, under-specified queries) — mid quality.
- General English prose (medical / financial / news) — the code-specialized 32k vocab trades general-text coverage for code. Multilingual text works as a fallback, not a specialty.
Embed queries and documents the same way — no instruction prefix. For smaller indexes, truncate to 256 or 512 dims (MRL) before normalizing.
How it was made
Knowledge-distilled (embedding regression) from Alibaba-NLP/gte-modernbert-base
(Apache-2.0, ~150M, a strong general + code retriever on a ModernBERT backbone). The student is a fresh,
shallow-wide XLM-R encoder trained from scratch on the teacher's passage embeddings over a ~30M-sample
code + text corpus, with a custom 32k code-oriented SentencePiece vocabulary (syntax + identifier lexicon
rather than prose). The shallow-wide 4-layer / 768-hidden shape keeps inference cheap while distilling at
the teacher's full 768-dim width.
Built for speed
- Short context by design — max 128 tokens, no long-document path, so the engines avoid a wide dynamic shape range.
- Rectangular TensorRT profiles — each length bucket is a fixed shape (min == opt == max), one optimal kernel set per bucket: s = batch 64 × seq 40 · m = batch 128 × seq 64 · l = batch 256 × seq 128.
- INT8 (W8A16) weights; mean-pool + projection + L2-norm fused into the graph (one pass →
[B, 768]).
Usage (standalone ONNX)
The FP32 model.onnx is bundled. Tokenize with the bundled sentencepiece.bpe.model, run, and the
pooled [B, 768] output is already produced — just L2-normalize:
import onnxruntime as ort, sentencepiece as spm, numpy as np
sp = spm.SentencePieceProcessor(model_file="sentencepiece.bpe.model") # pad=0 unk=1 bos=2 eos=3
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
def embed(texts, max_len=128, mrl_dim=768):
ids = [[2, *sp.encode(t)[: max_len - 2], 3] for t in texts] # bos … eos
L = max(len(x) for x in ids)
inp = np.array([x + [0] * (L - len(x)) for x in ids], dtype=np.int64) # pad=0
mask = (inp != 0).astype(np.int64)
out = sess.run(None, {"input_ids": inp, "attention_mask": mask})[0] # already mean-pooled [B,768]
out = out[:, :mrl_dim] # MRL truncation (768/512/256)
return out / np.linalg.norm(out, axis=1, keepdims=True)
What's in this repo — ready-to-run compiled engines
Pre-compiled engines, named per runtime × GPU arch × OS × length-bucket — pick the one matching your runtime and hardware; no compilation needed.
- TensorRT
*.engine— NVIDIA, INT8 W8A16:code-daemon-embed-v1-{s,m,l}_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine(sm_86 ≈ RTX 30xx / A-series · sm_89 ≈ RTX 40xx / L4 · sm_120 ≈ RTX 50xx). - TVM
*_tvm_vulkan.{dll,so}— Vulkan fallback for non-TRT / older NVIDIA & other GPUs, per bucket. - OpenVINO
*.xml+*.bin— Intel CPU / iGPU / NPU, per bucket. - Metal
*_tvm_metal.*— Apple Silicon (macOS), per bucket. - Tokenizer —
sentencepiece.bpe.model(specials at pad=0 / unk=1 / bos=2 / eos=3, byte-fallback) +tokenizer_config.json. - ONNX source —
model.onnx(+model.onnx.data) FP32 andmodel_int8qdt.onnx(INT8 W8A16).
Evaluation — CoIR (NDCG@10, full corpora)
Retrieval quality on the CoIR tasks that match this model's design (short code + retrieval). Four of CoIR's ten tasks — code↔code translation, multi-turn dialogue, long problem-statements — exceed the 128-token / retrieval scope and are not shown.
| CoIR task | NDCG@10 | Pattern |
|---|---|---|
| codesearchnet (6-lang avg) | 73.17 | docstring / NL → code |
| stackoverflow-qa | 62.70 | short question → code |
| synthetic-text2sql | 61.32 | NL → SQL |
| codesearchnet-ccr (6-lang avg) | 57.30 | code → related code |
| codefeedback-st | 56.38 | NL instruction → code |
| cosqa | 35.51 | NL question → code (noisy / hard) |
| Average | 57.73 |
Per language — codesearchnet (NL→code): python 88.82, java 75.78, php 73.71, go 73.49, ruby 64.10, js 63.09. Per language — codesearchnet-ccr (code→code): ruby 65.67, java 62.87, js 61.05, python 55.88, go 51.53, php 46.77.
Binary (1-bit sign) vectors retain 94.4% of the float NDCG before any rescore — the embeddings are clean by sign, so a Hamming (XOR + popcount) index over 1-bit codes gives a ~32× memory / search win at a small quality cost.
For scale, the 1.5B-parameter
bge-code-v1scores 81.77 on full CoIR — this is a 54.5M model (27× smaller) tuned for short-code retrieval.
Performance (embeddings / sec)
| Backend | Hardware | Throughput |
|---|---|---|
| TensorRT INT8 | NVIDIA RTX 5060 (sm_120) | ~24,000 emb/s |
| OpenVINO INT4 | Intel iGPU (Xe2, Lunar Lake) | ~580 emb/s |
| OpenVINO INT4 | Intel NPU (NPU4) | ~574 emb/s |
| OpenVINO INT8 | Intel CPU (Core Ultra) | ~375 emb/s |
| OpenVINO — all 3 in parallel | iGPU + NPU + CPU concurrently | ~1,290 emb/s |
The combined OpenVINO figure is genuine concurrent multi-device execution — three workers (iGPU, NPU,
CPU) embed different batches at the same time and the throughputs add up. This is not OpenVINO's
AUTO mode (which picks a single device per inference). Measured on a Core Ultra (Lunar Lake) laptop;
the TensorRT figure is on the bucketed batch path.
License & training data
Released under the MIT license.
The teacher (Alibaba-NLP/gte-modernbert-base) is Apache-2.0, and the XLM-R architecture is MIT. As is
standard practice for distilled embedding models, the weights are released under MIT. For
transparency, the training corpus the teacher embedded includes:
| Dataset | License note |
|---|---|
Fsoft-AIC/the-vault-function (code) |
dataset MIT; underlying code has mixed upstream provenance |
unicamp-dl/mmarco (EN/RU retrieval) |
MS MARCO-derived → non-commercial research terms |
sentence-transformers/all-nli |
SNLI (CC BY-SA 4.0) + MultiNLI |
sentence-transformers/gooaq |
Apache-2.0 |
jinaai/negation-dataset |
see source repo |
⚠️ If your use requires strict training-data-license compliance, note that mMARCO derives from MS MARCO (non-commercial). Whether a distilled model inherits dataset-use terms is legally unsettled; this is not legal advice. A data-clean variant can be retrained without the mMARCO splits if needed.
Attribution
Distilled from Alibaba-NLP/gte-modernbert-base (Apache-2.0). Backbone: XLM-RoBERTa (MIT).
- Downloads last month
- 51
Model tree for faxenoff/code-daemon-embed-v1
Base model
answerdotai/ModernBERT-base