PackedLLM
~10B total parameters Β· ~3B active per inference Β· Routing-of-Experts (RoE) architecture
PackedLLM is a self-contained multi-expert language model system built around a Routing-of-Experts (RoE) mechanism. Rather than mixing expert outputs at the token level inside a shared transformer (Mixture-of-Experts), PackedLLM routes each request β and each stage of a multi-stage reasoning pipeline β to a dedicated, fully independent specialist model. At most one or two experts are active simultaneously, keeping peak memory around 3B parameters regardless of the 10B total footprint.
The system runs entirely on consumer hardware via llama.cpp persists its full state to a single ZIP checkpoint, and integrates persistent vector memory, sandboxed Python execution, and multi-engine web search as first-class pipeline citizens.
Architecture overview
PackedLLMRunner β user-facing shell: load, warmup, lifecycle
β
PackedLLM (PackedLLM.pt) β 9-stage orchestration pipeline
β
Expert dispatch layer β 10 specialist models, one active at a time
β
ββββββ΄ββββββββββββββββββββββββββββββ
β GATOR/MemoryBank CodeBox Web β β integrated modules
ββββββ¬ββββββββββββββββββββββββββββββ
β
PackedLM (LM.pt) β llama.cpp inference engine + ExpertHandles
How it differs from MoE
| Standard MoE (Mixtral, DeepSeek) | PackedLLM RoE | |
|---|---|---|
| Routing granularity | Per-token, inside every transformer layer | Per-task and per-pipeline-stage |
| What gets routed | FFN sub-modules sharing one transformer | Separate, fully independent specialist LLMs |
| Parameters active | Top-K experts Γ FFN size, across all layers | One expert at a time (~3B peak) |
| Router mechanism | Learned linear gating vector | HeadExpert β a full LLM returning JSON |
| Experts share weights? | Yes (all attention layers are always shared) | No β complete independence |
| Pipeline | Single transformer forward pass | 9-stage: plan β route β execute β synthesize β persona β review |
The closest related work is Composition of Experts (Chai et al., 2024, arXiv:2412.01868), which also routes at the input level to full LLM models. PackedLLM extends this with a multi-stage orchestration pipeline, per-stage retry/detour recovery, integrated memory and execution modules, affective state modelling, and a character persona layer β none of which appear in prior systems of this type.
Pipeline stages
Every call to forward() passes through these stages in order:
| Stage | Expert | Temperature | What it does |
|---|---|---|---|
| 1. Plan goal | HeadExpert | 0.2 | Parse intent, tone, routing flags (needs_web / needs_action / needs_vision) |
| 2. Consult memory | GATOR | β | Match registered commands; retrieve relevant memory |
| 3. Build route | HeadExpert | 0.5 | Generate ordered list of expert steps as JSON |
| 4. Execute route | various | per-expert | Dispatch each step; retry / detour / skip on failure |
| 5. Synthesize base | HeadExpert | 1.0 | Combine step outputs into a persona-free prose answer |
| 6. Affective state | AffectExpert | 0.5 | Generate bot emotional + physical state JSON |
| 7. Apply persona | RoleExpert | 1.0 | Rewrite base response in character |
| 8. Review | HeadExpert | 0.0 | Accept / revise / reject; extract memory facts; profile updates |
| 9. Finalize | GATOR | β | Write memory, update user/bot profiles |
Expert roster
| Expert | Active params | Role | Notes |
|---|---|---|---|
| HeadExpert | ~3B | Orchestrator, router, planner, synthesizer, reviewer | Most-called expert |
| LogicExpert | ~1B | Structured reasoning; deep-think CoT; action planning/repair | raw completion with <think> blocks |
| CodeExpert | ~1B | Python script generation for action pipeline | Temperature 0.0; raw code only, no prose |
| MathExpert | ~1B | Quantitative reasoning | Post-processes CJK spans; deduplicates repeated lines |
| AffectExpert | ~0.5B | Emotional state; step quality evaluation | Used as both emotion classifier and pass/fail judge |
| RoleExpert | ~0.5B | Persona rewriting in character | RP style chat format |
| CreativeExpert | ~1B | Writing and stylistic generation | High temperature defaults (0.9) |
| VisionExpert | ~1B | Multimodal image understanding | CLIP projector; local images β data URI |
| ToolExpert | ~0.5B | Function-call generation | outputs {"tool_calls": [...]} JSON |
| TranslationExpert | ~300M | Chinese β English | seq2seq β not an LLM; Chinese regex gate |
Total: ~10B Β· Peak active: ~3B
Forward modes
Standard (full pipeline)
bot.chat("What is the compound interest on $5000 at 4% over 10 years?")
All 9 stages. Memory read/write. Web and action pipelines if needed.
Fast think (minimum latency)
bot.chat("What time is it in Tokyo?", fast_think=True)
Skips planning, routing, memory, web, action, affective state, review. HeadExpert answers directly; RoleExpert applies persona if a bot profile exists. Maximum 2 LLM calls.
Deep think (CoT scaffolding)
bot.chat("Design a Python caching decorator with TTL support.", deep_think=True)
Before each pipeline stage, LogicExpert generates <think>...</think> blocks scoped to that stage's specific task and output contract. These blocks are prepended to the stage's prompt as if the executing expert had already done that prior reasoning. Blocks are cached within a single forward() call. Translation is excluded (not an LLM).
Integrated modules
MemoryBank (GATOR.pt)
A multi-tree semantic store built on PackedTree β a custom embedding + KMeans clustering retrieval structure. Trees: knowledge, conversation, user profiles, bot profiles, commands, assets, telemetry. Hybrid retrieval scoring: 75% semantic similarity + 20% keyword overlap + 5% importance metadata. Embedding model: Jina Embeddings v3 (GGUF, stored inside the checkpoint). GATOR's own action planner uses HeadExpert to decide which memory operations to run. Also contains DesktopControl (OS automation) and CommandRegistry (text-to-action macros).
CodeBox (CodeBox.pt)
Persistent Python sandbox with isolated virtual environment management, SHA256-verified asset registry, loader injection (from _codebox_loader import load_asset inside sandboxed code), DAG pipeline runner with $var reference passing between steps, LRU runner cache for expensive models, and hard RAM/CPU kill thresholds enforced by a monitoring thread.
Web (WebSearch.pt)
Three search engines (DuckDuckGo HTML, Google, ResultHunter) with embedding-ranked candidate deduplication. Content extraction tries 10 methods: YouTube transcripts β trafilatura β boilerpy3 β readability β newspaper3k β goose3 β inscriptis β lxml β BeautifulSoup β visible text. PDF via PyMuPDF. Summarization via DistilBART. Runs in a separate spawned process; communicates via multiprocessing.Queue. Serializes safely β live process handles are stripped on save.
Usage
Basic
from PackedLLM import PackedLLMRunner
bot = PackedLLMRunner("PackedLLM.pt", bot_id="pip", user_id="alice")
print(bot.chat("Explain gradient descent in one paragraph."))
Expert shortcuts (bypass full pipeline)
bot.creative("Write a haiku about a robot discovering music.")
bot.code("Implement binary search in Python with comments.")
bot.math("Solve: integral of xΒ² Β· sin(x) dx")
bot.logic("All A are B. Some B are C. What follows?", mode="deep_then_answer")
bot.translate("δΊΊε·₯ζΊθ½ζ£ε¨ζΉεδΈη")
bot.web("Latest developments in solid-state batteries?")
bot.action("Compute compound interest on $5000 at 4% over 10 years; save to report.txt")
Memory
bot.memory_store("User prefers concise answers under 100 words.")
results = bot.memory_recall("answer preferences", top_k=3)
bot.set_user_profile({"name": "Alice", "expertise": "ML"})
bot.set_bot_profile({"character_card": "You are Pip, a direct and slightly sarcastic assistant."})
Lifecycle
bot.unload_expert("vision_expert") # free VRAM; reloads lazily on next use
bot.reload_expert("code_expert") # hot-reload after checkpoint update
print(bot.status()) # full system diagnostic
# Context manager (auto-unload on exit)
with PackedLLMRunner("PackedLLM.pt", bot_id="pip", user_id="alice") as bot:
print(bot.chat("Summarise the Pythagorean theorem."))
Checkpointing
PackedLLM.pt is a ZIP archive containing:
manifest.ptβ all metadata, profiles, hardware state, embedded source codelm_chunk_N.binβ model weights in 32MB streaming chunksmem_chunk_N.binβ GATOR memory store chunksweb_chunk_N.binβ WebSearch module chunksbox_chunk_N.binβ CodeBox chunks
Hardware
PackedLM detects and uses CUDA, Apple Metal (MPS), WebGPU, or CPU automatically via HardwareProbe. For each expert, _plan_offload() estimates the GGUF file size and computes how many transformer layers can fit in free VRAM (with a 15% safety margin for CUDA, 40% for WebGPU). If VRAM is insufficient for a full offload, layers are split proportionally between GPU and CPU.
Citation
@software{packedllm2026,
Author = {Chance Brownfield},
title = {PackedLLM: A Routing-of-Experts System with LLM-Orchestrated Execution Pipeline},
year = {2026},
note = {RoE architecture: task-level routing to fully independent specialist LLMs.
Distinct from token-level Mixture-of-Experts.
Integrates persistent vector memory (GATOR), sandboxed Python execution (CodeBox),
and multi-engine web search in a 9-stage orchestration pipeline.}
}
License
This project is licensed under PackedLicense v1.0.
Free for personal, educational, research, and other non-commercial use.
Commercial use requires prior written authorization.
The GATOR, WebSearchModule, and CodeBox components are protected under this license and may not be extracted, redistributed, or commercially reused without authorization.
- Downloads last month
- -