SozKZ Core: Kazakh Language Models
Collection
Base, instruct, and balanced Kazakh language models trained from scratch — Llama (50M–600M), GPT2, Pythia architectures • 22 items • Updated
A 587M parameter Llama model pretrained from scratch on 9 billion Kazakh tokens. Part of the SozKZ family of Kazakh language models.
| Model | Params | val_bpb | Train Loss | Status |
|---|---|---|---|---|
| sozkz-core-llama-150m-kk-base-v1 | 152M | — | — | Released |
| sozkz-core-llama-300m-kk-base-v1 | 325M | 0.781 | 2.848 | Released |
| sozkz-core-llama-600m-kk-base-v1 | 587M | 0.756 | 2.713 | This model |
| Parameter | Value |
|---|---|
| Architecture | Llama (RMSNorm, RoPE, SwiGLU) |
| Parameters | 587M |
| Hidden size | 1280 |
| Layers | 22 |
| Attention heads | 20 |
| KV heads | 20 (MHA) |
| Intermediate size | 4480 |
| Context length | 1024 |
| Vocab size | 50,257 (GPT-2 BPE, Kazakh) |
| Precision | bfloat16 |
| Tied embeddings | Yes |
| Detail | Value |
|---|---|
| Dataset | sozkz-corpus-tokenized-kk-llama50k-v3 |
| Tokens | 9B |
| Hardware | 4x NVIDIA H100 80GB HBM3 |
| Training time | 5.9 hours |
| Throughput | 423K tok/s |
| Optimizer | AdamW (lr=4e-4, betas=0.9/0.95, wd=0.1) |
| Schedule | Cosine with 500-step warmup, min_lr=0.1x |
| Batch size | 32 per GPU x 4 GPUs = 128 |
| Gradient clipping | 1.0 |
| Framework | PyTorch 2.4 + torch.compile + DDP |
| Metric | Value |
|---|---|
| Validation BPB | 0.756 |
| Training loss | 2.713 |
| Peak VRAM | 64.0 GB/GPU |
| Tokens-to-params ratio | 15.3:1 |
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "stukenov/sozkz-core-llama-600m-kk-base-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
prompt = "Қазақстан — "
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.8,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Uses sozkz-core-gpt2-50k-kk-base-v1 — a 50K vocab ByteLevel BPE tokenizer trained on Kazakh text.
@misc{sozkz-llama-600m-kk-2026,
title={SozKZ Core Llama 600M: Kazakh Language Model},
author={Tukenov, Saken},
year={2026},
url={https://huggingface.co/stukenov/sozkz-core-llama-600m-kk-base-v1}
}
MIT (gated access — manual approval required)
Small Language Models for Kazakh: Training, Evaluation, and Scaling
@article{tukenov2026slm,
title={Small Language Models for Kazakh: Training, Evaluation, and Scaling},
author={Tukenov, Saken},
journal={arXiv preprint arXiv:2603.20854},
year={2026}
}