code-daemon-embed-v1

A small, fast code-embedding model for semantic code search and code retrieval. It maps short code units — function and method bodies, signatures, docstrings — and short natural-language queries into a shared 768-dim vector space.

It is specialized for short code, not long documents: the maximum sequence is 128 tokens, trading long-context capability for high throughput and strong quality on short units.

  • 768-dim embeddings, Matryoshka (MRL) — truncatable to 512 / 256 dims with graceful decay.
  • ~54.5M params — XLM-RoBERTa architecture, 4 layers / 768 hidden, a code-oriented 32k SentencePiece vocab.
  • Mean pooling fused into the graph — the output is already pooled ([batch, 768]); just L2-normalize.
  • Trained at sequence length 128 (length buckets s / m / l = seq 40 / 64 / 128).

Where it's good — and where it isn't

Measured on CoIR (NDCG@10, full corpora). Use it for:

Strong:

  • Code → related code — finding similar or duplicate implementations (its best relative area).
  • Natural-language → code — docstring or description → function (strongest on Python).
  • Short question → code — "how / where does X…".
  • NL → SQL and NL instruction → code.

Weak / out of scope:

  • Long documents — hard 128-token cap; longer inputs are truncated. This is not a long-context retriever.
  • Noisy / ambiguous NL→code (hard, under-specified queries) — mid quality.
  • General English prose (medical / financial / news) — the code-specialized 32k vocab trades general-text coverage for code. Multilingual text works as a fallback, not a specialty.

Embed queries and documents the same way — no instruction prefix. For smaller indexes, truncate to 256 or 512 dims (MRL) before normalizing.

How it was made

Knowledge-distilled (embedding regression) from Alibaba-NLP/gte-modernbert-base (Apache-2.0, ~150M, a strong general + code retriever on a ModernBERT backbone). The student is a fresh, shallow-wide XLM-R encoder trained from scratch on the teacher's passage embeddings over a ~30M-sample code + text corpus, with a custom 32k code-oriented SentencePiece vocabulary (syntax + identifier lexicon rather than prose). The shallow-wide 4-layer / 768-hidden shape keeps inference cheap while distilling at the teacher's full 768-dim width.

Built for speed

  • Short context by design — max 128 tokens, no long-document path, so the engines avoid a wide dynamic shape range.
  • Rectangular TensorRT profiles — each length bucket is a fixed shape (min == opt == max), one optimal kernel set per bucket: s = batch 64 × seq 40 · m = batch 128 × seq 64 · l = batch 256 × seq 128.
  • INT8 (W8A16) weights; mean-pool + projection + L2-norm fused into the graph (one pass → [B, 768]).

Usage (standalone ONNX)

The FP32 model.onnx is bundled. Tokenize with the bundled sentencepiece.bpe.model, run, and the pooled [B, 768] output is already produced — just L2-normalize:

import onnxruntime as ort, sentencepiece as spm, numpy as np

sp   = spm.SentencePieceProcessor(model_file="sentencepiece.bpe.model")  # pad=0 unk=1 bos=2 eos=3
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

def embed(texts, max_len=128, mrl_dim=768):
    ids  = [[2, *sp.encode(t)[: max_len - 2], 3] for t in texts]          # bos … eos
    L    = max(len(x) for x in ids)
    inp  = np.array([x + [0] * (L - len(x)) for x in ids], dtype=np.int64) # pad=0
    mask = (inp != 0).astype(np.int64)
    out  = sess.run(None, {"input_ids": inp, "attention_mask": mask})[0]   # already mean-pooled [B,768]
    out  = out[:, :mrl_dim]                                                # MRL truncation (768/512/256)
    return out / np.linalg.norm(out, axis=1, keepdims=True)

What's in this repo — ready-to-run compiled engines

Pre-compiled engines, named per runtime × GPU arch × OS × length-bucket — pick the one matching your runtime and hardware; no compilation needed.

  • TensorRT *.engine — NVIDIA, INT8 W8A16: code-daemon-embed-v1-{s,m,l}_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine (sm_86 ≈ RTX 30xx / A-series · sm_89 ≈ RTX 40xx / L4 · sm_120 ≈ RTX 50xx).
  • TVM *_tvm_vulkan.{dll,so} — Vulkan fallback for non-TRT / older NVIDIA & other GPUs, per bucket.
  • OpenVINO *.xml + *.bin — Intel CPU / iGPU / NPU, per bucket.
  • Metal *_tvm_metal.* — Apple Silicon (macOS), per bucket.
  • Tokenizersentencepiece.bpe.model (specials at pad=0 / unk=1 / bos=2 / eos=3, byte-fallback) + tokenizer_config.json.
  • ONNX sourcemodel.onnx (+ model.onnx.data) FP32 and model_int8qdt.onnx (INT8 W8A16).

Evaluation — CoIR (NDCG@10, full corpora)

Retrieval quality on the CoIR tasks that match this model's design (short code + retrieval). Four of CoIR's ten tasks — code↔code translation, multi-turn dialogue, long problem-statements — exceed the 128-token / retrieval scope and are not shown.

CoIR task NDCG@10 Pattern
codesearchnet (6-lang avg) 73.17 docstring / NL → code
stackoverflow-qa 62.70 short question → code
synthetic-text2sql 61.32 NL → SQL
codesearchnet-ccr (6-lang avg) 57.30 code → related code
codefeedback-st 56.38 NL instruction → code
cosqa 35.51 NL question → code (noisy / hard)
Average 57.73

Per language — codesearchnet (NL→code): python 88.82, java 75.78, php 73.71, go 73.49, ruby 64.10, js 63.09. Per language — codesearchnet-ccr (code→code): ruby 65.67, java 62.87, js 61.05, python 55.88, go 51.53, php 46.77.

Binary (1-bit sign) vectors retain 94.4% of the float NDCG before any rescore — the embeddings are clean by sign, so a Hamming (XOR + popcount) index over 1-bit codes gives a ~32× memory / search win at a small quality cost.

For scale, the 1.5B-parameter bge-code-v1 scores 81.77 on full CoIR — this is a 54.5M model (27× smaller) tuned for short-code retrieval.

Performance (embeddings / sec)

Backend Hardware Throughput
TensorRT INT8 NVIDIA RTX 5060 (sm_120) ~24,000 emb/s
OpenVINO INT4 Intel iGPU (Xe2, Lunar Lake) ~580 emb/s
OpenVINO INT4 Intel NPU (NPU4) ~574 emb/s
OpenVINO INT8 Intel CPU (Core Ultra) ~375 emb/s
OpenVINO — all 3 in parallel iGPU + NPU + CPU concurrently ~1,290 emb/s

The combined OpenVINO figure is genuine concurrent multi-device execution — three workers (iGPU, NPU, CPU) embed different batches at the same time and the throughputs add up. This is not OpenVINO's AUTO mode (which picks a single device per inference). Measured on a Core Ultra (Lunar Lake) laptop; the TensorRT figure is on the bucketed batch path.

License & training data

Released under the MIT license.

The teacher (Alibaba-NLP/gte-modernbert-base) is Apache-2.0, and the XLM-R architecture is MIT. As is standard practice for distilled embedding models, the weights are released under MIT. For transparency, the training corpus the teacher embedded includes:

Dataset License note
Fsoft-AIC/the-vault-function (code) dataset MIT; underlying code has mixed upstream provenance
unicamp-dl/mmarco (EN/RU retrieval) MS MARCO-derived → non-commercial research terms
sentence-transformers/all-nli SNLI (CC BY-SA 4.0) + MultiNLI
sentence-transformers/gooaq Apache-2.0
jinaai/negation-dataset see source repo

⚠️ If your use requires strict training-data-license compliance, note that mMARCO derives from MS MARCO (non-commercial). Whether a distilled model inherits dataset-use terms is legally unsettled; this is not legal advice. A data-clean variant can be retrained without the mMARCO splits if needed.

Attribution

Distilled from Alibaba-NLP/gte-modernbert-base (Apache-2.0). Backbone: XLM-RoBERTa (MIT).

Downloads last month
51
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for faxenoff/code-daemon-embed-v1

Quantized
(12)
this model

Datasets used to train faxenoff/code-daemon-embed-v1