You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

SozKZ Core Llama 600M — Kazakh Base v1

A 587M parameter Llama model pretrained from scratch on 9 billion Kazakh tokens. Part of the SozKZ family of Kazakh language models.

Model Family

Model Params val_bpb Train Loss Status
sozkz-core-llama-150m-kk-base-v1 152M Released
sozkz-core-llama-300m-kk-base-v1 325M 0.781 2.848 Released
sozkz-core-llama-600m-kk-base-v1 587M 0.756 2.713 This model

Model Details

Parameter Value
Architecture Llama (RMSNorm, RoPE, SwiGLU)
Parameters 587M
Hidden size 1280
Layers 22
Attention heads 20
KV heads 20 (MHA)
Intermediate size 4480
Context length 1024
Vocab size 50,257 (GPT-2 BPE, Kazakh)
Precision bfloat16
Tied embeddings Yes

Training

Detail Value
Dataset sozkz-corpus-tokenized-kk-llama50k-v3
Tokens 9B
Hardware 4x NVIDIA H100 80GB HBM3
Training time 5.9 hours
Throughput 423K tok/s
Optimizer AdamW (lr=4e-4, betas=0.9/0.95, wd=0.1)
Schedule Cosine with 500-step warmup, min_lr=0.1x
Batch size 32 per GPU x 4 GPUs = 128
Gradient clipping 1.0
Framework PyTorch 2.4 + torch.compile + DDP

Results

Metric Value
Validation BPB 0.756
Training loss 2.713
Peak VRAM 64.0 GB/GPU
Tokens-to-params ratio 15.3:1

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "stukenov/sozkz-core-llama-600m-kk-base-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)

prompt = "Қазақстан — "
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.8,
        top_p=0.9,
        repetition_penalty=1.1,
        do_sample=True,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Tokenizer

Uses sozkz-core-gpt2-50k-kk-base-v1 — a 50K vocab ByteLevel BPE tokenizer trained on Kazakh text.

Limitations

  • This is a base model (not instruction-tuned) — it completes text, not answers questions
  • Training data is web-scraped Kazakh text (educational sites, Wikipedia, news)
  • Context length is 1024 tokens
  • May generate repetitive or factually incorrect text

Citation

@misc{sozkz-llama-600m-kk-2026,
  title={SozKZ Core Llama 600M: Kazakh Language Model},
  author={Tukenov, Saken},
  year={2026},
  url={https://huggingface.co/stukenov/sozkz-core-llama-600m-kk-base-v1}
}

License

MIT (gated access — manual approval required)

Paper

Small Language Models for Kazakh: Training, Evaluation, and Scaling

@article{tukenov2026slm,
  title={Small Language Models for Kazakh: Training, Evaluation, and Scaling},
  author={Tukenov, Saken},
  journal={arXiv preprint arXiv:2603.20854},
  year={2026}
}
Downloads last month
85
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stukenov/sozkz-core-llama-600m-kk-base-v1

Finetunes
1 model

Dataset used to train stukenov/sozkz-core-llama-600m-kk-base-v1

Collection including stukenov/sozkz-core-llama-600m-kk-base-v1

Paper for stukenov/sozkz-core-llama-600m-kk-base-v1

Evaluation results