Ant-5M

Ant-5M is an ultra-compact, 4.73-million parameter Llama-style transformer model. It was created as a personal engineering exercise to understand the fundamentals of custom tokenization and the pre-training loop on consumer-grade hardware, especially to validate custom tokenizer health.

⚠️ Critical Disclaimer: Manage Your Expectations

To be completely honest, the generation results are not satisfying, and you should not expect anything practical from this model. Due to severe capacity limits and dataset noise, the model frequently falls into infinite degenerate repetition loops (e.g., repeating "Sciences" or URL fragments like "://"). It is being released purely as an open-source log of a first-time training experiment, not as a functional assistant.

Pre-training Performance Metrics

The model was pre-trained for 10,000 steps (approximately 1.05 epochs) over a 24-hour window, logging a final validation loss of 4.20 and a word perplexity of 66.47.

Technical Architecture Post-Mortem: Structural Breakdown of the 5M Parameter Model Collapse

This section provides a rigorous, mathematical analysis of the architectural bottlenecks present in the configuration of the 5M parameter model (GODELEV/Ant-5M). While the dataset (GODELEV/Archaea-74M) has been validated as clean and structurally sound, this specific model suffered from an architectural collapse due to scaling down standard Llama hyper-parameters proportionally without adjusting internal processing ratios.

1. Vocabulary Embedding Bottleneck and Representation Choking

Configuration Setup: hidden_size: 128, vocab_size: 16384, tie_word_embeddings: true

Because tie_word_embeddings is set to true, the model's input token embedding layer and its final output language modeling head share the exact same weight matrices. The parameter count for this tied structure is calculated as:

$\text{Parameters} = \text{Vocab Size} \times \text{Hidden Size} = 16,384 \times 128 = 2,097,152$

Out of the total 4.73 million parameters allocated to this model, approximately 44.3% of its entire capacity is consumed solely by the static token identification dictionary. This leaves fewer than 2.6 million parameters distributed across all 11 transformer layers to calculate attention, sequence logic, and contextual dependencies.

Structuring a model with a 16,384-token vocabulary space into a 128-dimensional hidden bottleneck creates a representation choke point. The vector space lacks the geometric dimensionality needed to separate semantic concepts cleanly. As a result, distinct words overlap closely in vector space, and the model defaults to high-frequency training tokens like "Sciences" or "://" to satisfy its mathematical minimization constraints.

2. Microscopic Attention Head Dimension and RoPE Distortion

Configuration Setup: hidden_size: 128, num_attention_heads: 8

The dimension of an individual attention head ($d_{\text{head}}$) dictates the fidelity of the Query and Key vector calculations. It is derived using the following formula:

$d_{\text{head}} = \frac{\text{hidden\_size}}{\text{num\_attention\_heads}} = \frac{128}{8} = 16$

In production-grade Large Language Models, the attention head dimension is strictly maintained at a minimum of 64 or 128 channels. Reducing $d_{\text{head}}$ to 16 fundamentally breaks the self-attention mechanism.

Specifically, Rotary Positional Embeddings (RoPE) rely on rotating pairs of dimensions across complex wave frequencies to map sequence distances. With only 16 dimensions available per head, RoPE cannot accurately allocate positional coordinates over long context spaces. The attention heads completely lose track of where they are in the sequence after generating a few tokens, triggering infinite loops.

3. Excessive Key-Value Query Grouping

Configuration Setup: num_attention_heads: 8, num_key_value_heads: 2

Grouped-Query Attention (GQA) is an optimization technique designed to reduce memory bandwidth overhead in massive networks by forcing multiple query heads to share a single key-value projection head. In this configuration, the query-to-key-value ratio is:

$\text{GQA Ratio} = \frac{\text{num\_attention\_heads}}{\text{num\_key\_value\_heads}} = \frac{8}{2} = 4:1$

While a 4:1 GQA grouping is highly efficient for models with 7 billion parameters or more, it causes severe information degradation at the 5M parameter scale. Forcing four separate attention computational paths to share a single, cramped key-value tensor within an already narrow 128-hidden-size highway introduces massive loss of context. The model lacks the capacity to store unique context patterns, leading to structural flattening and degenerate repetitions.

Corrected Architectural Design for Future Iterations

To prevent structural collapse while remaining within the strict 5M parameter hardware envelope, the width of the processing highway must be prioritized over the total layer depth. The following structural modifications are required for the next training iteration:

Expand Hidden Size: Increase hidden_size to 256 to allow proper geometric separation of vocabulary tokens.
Optimize Head Resolution: Lower num_attention_heads to 4. When paired with a 256 hidden size, this yields an optimal head dimension of 64 (256 / 4 = 64), restoring RoPE positioning accuracy.
Remove GQA: Set num_key_value_heads to 4 to match the attention heads. Micro-models need complete, standard Multi-Head Attention (MHA) to avoid information loss.
Rebalance Depth: Reduce num_hidden_layers from 11 down to 7. Allocating parameter weight to wider vectors rather than deep, narrow stacks directly eliminates representation loops.

Evaluation & Benchmarks (0-Shot)

Task Name	Benchmark	Metric	Score (%)
boolq	BoolQ	acc	54.34%
copa	COPA	acc	54.00%
blimp	BLiMP	acc	50.76%
winogrande	WinoGrande	acc	50.12%
truthfulqa_mc2	TruthfulQA MC2	acc	49.63%
piqa	PIQA	acc_norm	48.86%
openbookqa	OpenBookQA	acc_norm	28.80%
arc_easy	ARC-Easy	acc_norm	26.35%
hellaswag	HellaSwag	acc_norm	26.00%
swag	SWAG	acc_norm	25.77%
arc_challenge	ARC-Challenge	acc_norm	25.77%
mmlu	MMLU	acc	25.01%
race	RACE	acc	21.53%
sciq	SciQ	acc_norm	20.70%
commonsense_qa	CommonsenseQA	acc	19.66%
lambada_openai	LAMBADA	acc	0.00%

ArithMark-2.0 Overall Accuracy: 24.76%
WikiText-2 Word Perplexity: 198,251.90

Hardware & Training Infrastructure

Hardware: Trained on a single NVIDIA T4 GPU (Kaggle Environment).
Training Time: Approximately 6 hours of continuous compute.
Dataset Volume: Trained on roughly 650 Million tokens via the GODELEV/Archaea-5M-T dataset.

Technical Architecture Specs

Parameters: 4.73M
Layers: 11
Hidden Size: 128
Attention Heads: 8 (utilizing 2 KV heads for Grouped-Query Attention)
Vocabulary Size: 16,384 (Custom trained tokenizer)
Max Sequence Length: 1,024 tokens
Activation Function: SiLU

Sandbox Usage & Mitigation

If you want to poke around the model weights, running standard greedy decoding will immediately break it into endless loops. You must pass heavy repetition constraints and use sampling to extract coherent text structure.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "GODELEV/Ant-5M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "Question: What is 2 + 2?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Aggressive penalties are mandatory to suppress gibberish loops
outputs = model.generate(
    **inputs,
    max_new_tokens=15,
    do_sample=True,
    temperature=0.3,
    top_p=0.85,
    repetition_penalty=2.5  # Prevents 'Sciences' and '://' loop lockups
)
print(tokenizer.decode(outputs, skip_special_tokens=True))

Downloads last month: 453

Safetensors

Model size

4.71M params

Tensor type

F32

Model tree for GODELEV/Ant-5M

Unable to build the model tree, the base model loops to the model itself. Learn more.

GODELEV
/

Ant-5M