Ant-5M
Ant-5M is an ultra-compact, 4.73-million parameter Llama-style transformer model. It was created as a personal engineering exercise to understand the fundamentals of custom tokenization and the pre-training loop on consumer-grade hardware, especially to validate custom tokenizer health.
β οΈ Critical Disclaimer: Manage Your Expectations
To be completely honest, the generation results are not satisfying, and you should not expect anything practical from this model. Due to severe capacity limits and dataset noise, the model frequently falls into infinite degenerate repetition loops (e.g., repeating "Sciences" or URL fragments like "://"). It is being released purely as an open-source log of a first-time training experiment, not as a functional assistant.
Pre-training Performance Metrics
The model was pre-trained for 10,000 steps (approximately 1.05 epochs) over a 24-hour window, logging a final validation loss of 4.20 and a word perplexity of 66.47.
Technical Architecture Post-Mortem: Structural Breakdown of the 5M Parameter Model Collapse
This section provides a rigorous, mathematical analysis of the architectural bottlenecks present in the configuration of the 5M parameter model (GODELEV/Ant-5M). While the dataset (GODELEV/Archaea-74M) has been validated as clean and structurally sound, this specific model suffered from an architectural collapse due to scaling down standard Llama hyper-parameters proportionally without adjusting internal processing ratios.
1. Vocabulary Embedding Bottleneck and Representation Choking
- Configuration Setup:
hidden_size: 128,vocab_size: 16384,tie_word_embeddings: true
Because tie_word_embeddings is set to true, the model's input token embedding layer and its final output language modeling head share the exact same weight matrices. The parameter count for this tied structure is calculated as:
Out of the total 4.73 million parameters allocated to this model, approximately 44.3% of its entire capacity is consumed solely by the static token identification dictionary. This leaves fewer than 2.6 million parameters distributed across all 11 transformer layers to calculate attention, sequence logic, and contextual dependencies.
Structuring a model with a 16,384-token vocabulary space into a 128-dimensional hidden bottleneck creates a representation choke point. The vector space lacks the geometric dimensionality needed to separate semantic concepts cleanly. As a result, distinct words overlap closely in vector space, and the model defaults to high-frequency training tokens like "Sciences" or "://" to satisfy its mathematical minimization constraints.
2. Microscopic Attention Head Dimension and RoPE Distortion
- Configuration Setup:
hidden_size: 128,num_attention_heads: 8
The dimension of an individual attention head ($d_{\text{head}}$) dictates the fidelity of the Query and Key vector calculations. It is derived using the following formula:
In production-grade Large Language Models, the attention head dimension is strictly maintained at a minimum of 64 or 128 channels. Reducing $d_{\text{head}}$ to 16 fundamentally breaks the self-attention mechanism.
Specifically, Rotary Positional Embeddings (RoPE) rely on rotating pairs of dimensions across complex wave frequencies to map sequence distances. With only 16 dimensions available per head, RoPE cannot accurately allocate positional coordinates over long context spaces. The attention heads completely lose track of where they are in the sequence after generating a few tokens, triggering infinite loops.
3. Excessive Key-Value Query Grouping
- Configuration Setup:
num_attention_heads: 8,num_key_value_heads: 2
Grouped-Query Attention (GQA) is an optimization technique designed to reduce memory bandwidth overhead in massive networks by forcing multiple query heads to share a single key-value projection head. In this configuration, the query-to-key-value ratio is:
While a 4:1 GQA grouping is highly efficient for models with 7 billion parameters or more, it causes severe information degradation at the 5M parameter scale. Forcing four separate attention computational paths to share a single, cramped key-value tensor within an already narrow 128-hidden-size highway introduces massive loss of context. The model lacks the capacity to store unique context patterns, leading to structural flattening and degenerate repetitions.
Corrected Architectural Design for Future Iterations
To prevent structural collapse while remaining within the strict 5M parameter hardware envelope, the width of the processing highway must be prioritized over the total layer depth. The following structural modifications are required for the next training iteration:
- Expand Hidden Size: Increase
hidden_sizeto 256 to allow proper geometric separation of vocabulary tokens. - Optimize Head Resolution: Lower
num_attention_headsto 4. When paired with a 256 hidden size, this yields an optimal head dimension of 64 (256 / 4 = 64), restoring RoPE positioning accuracy. - Remove GQA: Set
num_key_value_headsto 4 to match the attention heads. Micro-models need complete, standard Multi-Head Attention (MHA) to avoid information loss. - Rebalance Depth: Reduce
num_hidden_layersfrom 11 down to 7. Allocating parameter weight to wider vectors rather than deep, narrow stacks directly eliminates representation loops.
Evaluation & Benchmarks (0-Shot)
| Task Name | Benchmark | Metric | Score (%) |
|---|---|---|---|
| boolq | BoolQ | acc | 54.34% |
| copa | COPA | acc | 54.00% |
| blimp | BLiMP | acc | 50.76% |
| winogrande | WinoGrande | acc | 50.12% |
| truthfulqa_mc2 | TruthfulQA MC2 | acc | 49.63% |
| piqa | PIQA | acc_norm | 48.86% |
| openbookqa | OpenBookQA | acc_norm | 28.80% |
| arc_easy | ARC-Easy | acc_norm | 26.35% |
| hellaswag | HellaSwag | acc_norm | 26.00% |
| swag | SWAG | acc_norm | 25.77% |
| arc_challenge | ARC-Challenge | acc_norm | 25.77% |
| mmlu | MMLU | acc | 25.01% |
| race | RACE | acc | 21.53% |
| sciq | SciQ | acc_norm | 20.70% |
| commonsense_qa | CommonsenseQA | acc | 19.66% |
| lambada_openai | LAMBADA | acc | 0.00% |
- ArithMark-2.0 Overall Accuracy: 24.76%
- WikiText-2 Word Perplexity: 198,251.90
Hardware & Training Infrastructure
- Hardware: Trained on a single NVIDIA T4 GPU (Kaggle Environment).
- Training Time: Approximately 6 hours of continuous compute.
- Dataset Volume: Trained on roughly 650 Million tokens via the
GODELEV/Archaea-5M-Tdataset.
Technical Architecture Specs
- Parameters: 4.73M
- Layers: 11
- Hidden Size: 128
- Attention Heads: 8 (utilizing 2 KV heads for Grouped-Query Attention)
- Vocabulary Size: 16,384 (Custom trained tokenizer)
- Max Sequence Length: 1,024 tokens
- Activation Function: SiLU
Sandbox Usage & Mitigation
If you want to poke around the model weights, running standard greedy decoding will immediately break it into endless loops. You must pass heavy repetition constraints and use sampling to extract coherent text structure.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "GODELEV/Ant-5M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "Question: What is 2 + 2?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Aggressive penalties are mandatory to suppress gibberish loops
outputs = model.generate(
**inputs,
max_new_tokens=15,
do_sample=True,
temperature=0.3,
top_p=0.85,
repetition_penalty=2.5 # Prevents 'Sciences' and '://' loop lockups
)
print(tokenizer.decode(outputs, skip_special_tokens=True))
- Downloads last month
- 453
Model tree for GODELEV/Ant-5M
Unable to build the model tree, the base model loops to the model itself. Learn more.