master PRO

fantos

·

AI & ML interests

None yet

Recent Activity

liked a model 4 days ago

FINAL-Bench/Ourbox-35B-JGOS-GGUF

liked a Space 4 days ago

FINAL-Bench/VKUE

reacted to SeaWolf-AI's post with 🔥 4 days ago

🔵 VKUE — No GPU? Runs anyway. "Frontier models need a datacenter GPU" rests on a hidden assumption: that the model reads ALL its parameters every token. Decode is memory-bandwidth bound — sweep 34B params/token and an 8 GB card dies at 1–2 tok/s. So we ran ONE 34.7B reasoning model — Ourbox-35B-JGOS, a sparse Mixture-of-Experts — as the identical weights across the whole hardware spectrum. All measured: • B200: 18,057 tok/s (aggregate) • 1× A10G: 126 tok/s • 8 GB laptop (RTX 5060): 20 tok/s • GPU-less CPU: 17 tok/s Why it works: Ourbox holds 34.7B params but only ~3B are active per token (256 experts, top-8). Since decode is bandwidth-bound, a dense 34B moves ~16.7 GB/token while Ourbox moves ~1.45 GB — ~11× less traffic. Put the experts in system RAM, keep attention/router/shared on the GPU, and a 34.7B reasoner runs on an 8 GB laptop — or no GPU at all. Sparsity alone, proven (same laptop, same quant, ~same footprint): Ourbox-35B (A3B) 20.01 tok/s vs Qwen2.5-32B (dense) 5.36 → 3.7× from sparsity alone, ~2× the best dense-32B on any 8 GB machine. Not a toy: GPQA Diamond 86.4% (maj@8). Try it live (same prompt, GPU vs GPU-less CPU, live tok/s). Honest scope: one machine's measurements; the CPU path proves it RUNS without a GPU, not that it beats one. 📝 Article: https://huggingface.co/blog/FINAL-Bench/vkue 🔵 GPU vs CPU demo: https://final-bench-ourbox-35b-vkue-demo.hf.space/ 🔵 CPU-only demo: https://final-bench-ourbox-35b-vkue-cpu.hf.space 📊 VKUE leaderboard: https://huggingface.co/spaces/FINAL-Bench/VKUE 🤗 Model: https://huggingface.co/FINAL-Bench/Ourbox-35B-JGOS-GGUF ⚡ VKAE (speed): https://huggingface.co/spaces/VIDraft/vkae VKUE is the "runs anywhere" side of our serving line; VKAE the "fast on datacenter GPUs" side. VKAE is fast; VKUE is everywhere.

View all activity

Organizations

fantos 's models 9

fantos/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Nov 2, 2025 • 6

fantos/Qwen3-Omni-30B-A3B-Thinking

Any-to-Any • 32B • Updated Nov 2, 2025 • 6

fantos/Ming-flash-omni-Preview

Any-to-Any • 104B • Updated Nov 2, 2025 • 47

fantos/Qwen-Image-Edit-Rapid-AIO

Text-to-Image • Updated Nov 2, 2025 • 1

fantos/GLM-4.6

Text Generation • 357B • Updated Nov 2, 2025 • 6

fantos/neutts-air

Text-to-Speech • 0.7B • Updated Nov 2, 2025 • 8

fantos/PaddleOCR-VL

Image-Text-to-Text • 1.0B • Updated Nov 2, 2025 • 13

fantos/DeepSeek-OCR

Image-Text-to-Text • 3B • Updated Nov 2, 2025 • 7

fantos/QwQ-32B-bnb-4bit

Text Generation • 33B • Updated Mar 20, 2025 • 2 • 63