codelion (Asankhaya Sharma)

reacted to their post with 🤗👀🚀🔥 16 days ago

Post

343

SPROG-9M — a 9.37M parameter model trained from scratch to solve GSM8K-style math without using an LLM at inference.

The model, codelion/sprog-9m, predicts symbolic programs over number slots, then a deterministic executor does the arithmetic. With a simple verifier, it reaches ~11.8% on GSM8K test.

We also released the dataset: codelion/gsm8k-synth, 117K validated synthetic GSM8K-style problems.

Tiny model, no pretraining, no LLM at inference, runs on a laptop.

posted an update 16 days ago

Post

343

SPROG-9M — a 9.37M parameter model trained from scratch to solve GSM8K-style math without using an LLM at inference.

The model, codelion/sprog-9m, predicts symbolic programs over number slots, then a deterministic executor does the arithmetic. With a simple verifier, it reaches ~11.8% on GSM8K test.

We also released the dataset: codelion/gsm8k-synth, 117K validated synthetic GSM8K-style problems.

Tiny model, no pretraining, no LLM at inference, runs on a laptop.

replied to their post about 2 months ago

Thank you!

reacted to their post with 😎🤗👀🚀🔥 about 2 months ago

Post

3237

Inspired by the Nemotron Diffusion recipe, check out dhara-250m: a 250M experimental language model that supports three decoding modes from one set of weights: autoregressive, block-diffusion, and self-speculation.

It is small, easy to try, and meant for exploring diffusion-style decoding and latency tradeoffs in compact LMs.

Model: codelion/dhara-250m

Try the chat demo here: codelion/dhara-chat

3 replies

·

posted an update about 2 months ago

Post

3237

Inspired by the Nemotron Diffusion recipe, check out dhara-250m: a 250M experimental language model that supports three decoding modes from one set of weights: autoregressive, block-diffusion, and self-speculation.

It is small, easy to try, and meant for exploring diffusion-style decoding and latency tradeoffs in compact LMs.

Model: codelion/dhara-250m

Try the chat demo here: codelion/dhara-chat

3 replies

·

replied to their post 4 months ago

Thank you 🫡 !

reacted to their post with 🤗👀🚀🔥 4 months ago

Post

3429

Scaling Pedagogical Pre-training to 10 Billion Tokens

New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.

We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.

The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.

We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.

Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.

Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens

All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.

2 replies

·

posted an update 4 months ago

Post

3429

Scaling Pedagogical Pre-training to 10 Billion Tokens

New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.

We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.

The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.

We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.

Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.

Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens

All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.

2 replies

·

reacted to their post with ❤️👍 6 months ago

Post

3288

Reverse Engineering a $500M Mystery: From HashHop to Memory-Augmented Language Models

I wrote a deep dive into how Magic AI's 100M token context window might work, starting from their HashHop benchmark and building up to MALM - a Memory-Augmented Language Model.

Key insight: treating each key as a single token enables perfect retrieval at unlimited context lengths.

The article covers:

- How HashHop works and why its perfect accuracy is suspicious
- Building a tokenized solver that achieves 100% accuracy
- Scaling to MALM for real code search tasks
- Why this approach could handle 100M+ tokens

Read the full article: https://huggingface.co/blog/codelion/reverse-engineering-magic-hashhop

Try the model: codelion/malm-165m

Code: https://github.com/codelion/hash-hop

1 reply

·

Asankhaya Sharma

AI & ML interests

Recent Activity

Organizations

Asankhaya Sharma

AI & ML interests

Recent Activity

Organizations

codelion's activity