code-daemon-embed-v1

A small, fast code-embedding model for semantic code search and code retrieval. It maps short code units — function and method bodies, signatures, docstrings — and short natural-language queries into a shared 768-dim vector space.

It is specialized for short code, not long documents: the maximum sequence is 128 tokens, trading long-context capability for high throughput and strong quality on short units.

768-dim embeddings, Matryoshka (MRL) — truncatable to 512 / 256 dims with graceful decay.
~54.5M params — XLM-RoBERTa architecture, 4 layers / 768 hidden, a code-oriented 32k SentencePiece vocab.
Mean pooling fused into the graph — the output is already pooled ([batch, 768]); just L2-normalize.
Trained at sequence length 128 (length buckets s / m / l = seq 40 / 64 / 128).

Where it's good — and where it isn't

Measured on CoIR (NDCG@10, full corpora). Use it for:

Strong:

Code → related code — finding similar or duplicate implementations (its best relative area).
Natural-language → code — docstring or description → function (strongest on Python).
Short question → code — "how / where does X…".
NL → SQL and NL instruction → code.

Weak / out of scope:

Long documents — hard 128-token cap; longer inputs are truncated. This is not a long-context retriever.
Noisy / ambiguous NL→code (hard, under-specified queries) — mid quality.
General English prose (medical / financial / news) — the code-specialized 32k vocab trades general-text coverage for code. Multilingual text works as a fallback, not a specialty.

Embed queries and documents the same way — no instruction prefix. For smaller indexes, truncate to 256 or 512 dims (MRL) before normalizing.

How it was made

Knowledge-distilled (embedding regression) from Alibaba-NLP/gte-modernbert-base (Apache-2.0, ~150M, a strong general + code retriever on a ModernBERT backbone). The student is a fresh, shallow-wide XLM-R encoder trained from scratch on the teacher's passage embeddings over a ~30M-sample code + text corpus, with a custom 32k code-oriented SentencePiece vocabulary (syntax + identifier lexicon rather than prose). The shallow-wide 4-layer / 768-hidden shape keeps inference cheap while distilling at the teacher's full 768-dim width.

Built for speed

Short context by design — max 128 tokens, no long-document path, so the engines avoid a wide dynamic shape range.
Rectangular TensorRT profiles — each length bucket is a fixed shape (min == opt == max), one optimal kernel set per bucket: s = batch 64 × seq 40 · m = batch 128 × seq 64 · l = batch 256 × seq 128.
INT8 (W8A16) weights; mean-pool + projection + L2-norm fused into the graph (one pass → [B, 768]).

Usage (standalone ONNX)

The FP32 model.onnx is bundled. Tokenize with the bundled sentencepiece.bpe.model, run, and the pooled [B, 768] output is already produced — just L2-normalize:

import onnxruntime as ort, sentencepiece as spm, numpy as np

sp   = spm.SentencePieceProcessor(model_file="sentencepiece.bpe.model")  # pad=0 unk=1 bos=2 eos=3
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

def embed(texts, max_len=128, mrl_dim=768):
    ids  = [[2, *sp.encode(t)[: max_len - 2], 3] for t in texts]          # bos … eos
    L    = max(len(x) for x in ids)
    inp  = np.array([x + [0] * (L - len(x)) for x in ids], dtype=np.int64) # pad=0
    mask = (inp != 0).astype(np.int64)
    out  = sess.run(None, {"input_ids": inp, "attention_mask": mask})[0]   # already mean-pooled [B,768]
    out  = out[:, :mrl_dim]                                                # MRL truncation (768/512/256)
    return out / np.linalg.norm(out, axis=1, keepdims=True)

What's in this repo — ready-to-run compiled engines

Pre-compiled engines, named per runtime × GPU arch × OS × length-bucket — pick the one matching your runtime and hardware; no compilation needed.

TensorRT *.engine — NVIDIA, INT8 W8A16: code-daemon-embed-v1-{s,m,l}_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine (sm_86 ≈ RTX 30xx / A-series · sm_89 ≈ RTX 40xx / L4 · sm_120 ≈ RTX 50xx).
TVM *_tvm_vulkan.{dll,so} — Vulkan fallback for non-TRT / older NVIDIA & other GPUs, per bucket.
OpenVINO *.xml + *.bin — Intel CPU / iGPU / NPU, per bucket.
Metal *_tvm_metal.* — Apple Silicon (macOS), per bucket.
Tokenizer — sentencepiece.bpe.model (specials at pad=0 / unk=1 / bos=2 / eos=3, byte-fallback) + tokenizer_config.json.
ONNX source — model.onnx (+ model.onnx.data) FP32 and model_int8qdt.onnx (INT8 W8A16).

Evaluation — CoIR (NDCG@10, full corpora)

Retrieval quality on the CoIR tasks that match this model's design (short code + retrieval). Four of CoIR's ten tasks — code↔code translation, multi-turn dialogue, long problem-statements — exceed the 128-token / retrieval scope and are not shown.

CoIR task	NDCG@10	Pattern
codesearchnet (6-lang avg)	73.17	docstring / NL → code
stackoverflow-qa	62.70	short question → code
synthetic-text2sql	61.32	NL → SQL
codesearchnet-ccr (6-lang avg)	57.30	code → related code
codefeedback-st	56.38	NL instruction → code
cosqa	35.51	NL question → code (noisy / hard)
Average	57.73

Per language — codesearchnet (NL→code): python 88.82, java 75.78, php 73.71, go 73.49, ruby 64.10, js 63.09. Per language — codesearchnet-ccr (code→code): ruby 65.67, java 62.87, js 61.05, python 55.88, go 51.53, php 46.77.

Binary (1-bit sign) vectors retain 94.4% of the float NDCG before any rescore — the embeddings are clean by sign, so a Hamming (XOR + popcount) index over 1-bit codes gives a ~32× memory / search win at a small quality cost.

For scale, the 1.5B-parameter bge-code-v1 scores 81.77 on full CoIR — this is a 54.5M model (27× smaller) tuned for short-code retrieval.

Performance (embeddings / sec)

Backend	Hardware	Throughput
TensorRT INT8	NVIDIA RTX 5060 (sm_120)	~24,000 emb/s
OpenVINO INT4	Intel iGPU (Xe2, Lunar Lake)	~580 emb/s
OpenVINO INT4	Intel NPU (NPU4)	~574 emb/s
OpenVINO INT8	Intel CPU (Core Ultra)	~375 emb/s
OpenVINO — all 3 in parallel	iGPU + NPU + CPU concurrently	~1,290 emb/s

The combined OpenVINO figure is genuine concurrent multi-device execution — three workers (iGPU, NPU, CPU) embed different batches at the same time and the throughputs add up. This is not OpenVINO's AUTO mode (which picks a single device per inference). Measured on a Core Ultra (Lunar Lake) laptop; the TensorRT figure is on the bucketed batch path.

License & training data

Released under the MIT license.

The teacher (Alibaba-NLP/gte-modernbert-base) is Apache-2.0, and the XLM-R architecture is MIT. As is standard practice for distilled embedding models, the weights are released under MIT. For transparency, the training corpus the teacher embedded includes:

Dataset	License note
`Fsoft-AIC/the-vault-function` (code)	dataset MIT; underlying code has mixed upstream provenance
`unicamp-dl/mmarco` (EN/RU retrieval)	MS MARCO-derived → non-commercial research terms
`sentence-transformers/all-nli`	SNLI (CC BY-SA 4.0) + MultiNLI
`sentence-transformers/gooaq`	Apache-2.0
`jinaai/negation-dataset`	see source repo

⚠️ If your use requires strict training-data-license compliance, note that mMARCO derives from MS MARCO (non-commercial). Whether a distilled model inherits dataset-use terms is legally unsettled; this is not legal advice. A data-clean variant can be retrained without the mMARCO splits if needed.

Attribution

Distilled from Alibaba-NLP/gte-modernbert-base (Apache-2.0). Backbone: XLM-RoBERTa (MIT).

Downloads last month: 51

Model tree for faxenoff/code-daemon-embed-v1

Base model

answerdotai/ModernBERT-base

Finetuned

Alibaba-NLP/gte-modernbert-base

Quantized

(12)

this model

faxenoff
/

code-daemon-embed-v1