PackedLLM

~10B total parameters Β· ~3B active per inference Β· Routing-of-Experts (RoE) architecture

PackedLLM is a self-contained multi-expert language model system built around a Routing-of-Experts (RoE) mechanism. Rather than mixing expert outputs at the token level inside a shared transformer (Mixture-of-Experts), PackedLLM routes each request β€” and each stage of a multi-stage reasoning pipeline β€” to a dedicated, fully independent specialist model. At most one or two experts are active simultaneously, keeping peak memory around 3B parameters regardless of the 10B total footprint.

The system runs entirely on consumer hardware via llama.cpp persists its full state to a single ZIP checkpoint, and integrates persistent vector memory, sandboxed Python execution, and multi-engine web search as first-class pipeline citizens.


Architecture overview

PackedLLMRunner          ← user-facing shell: load, warmup, lifecycle
      β”‚
PackedLLM (PackedLLM.pt) ← 9-stage orchestration pipeline
      β”‚
Expert dispatch layer    ← 10 specialist models, one active at a time
      β”‚
 β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ GATOR/MemoryBank   CodeBox   Web β”‚  ← integrated modules
 β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚
PackedLM (LM.pt)         ← llama.cpp inference engine + ExpertHandles
PackedLLM Architecture

How it differs from MoE

Standard MoE (Mixtral, DeepSeek) PackedLLM RoE
Routing granularity Per-token, inside every transformer layer Per-task and per-pipeline-stage
What gets routed FFN sub-modules sharing one transformer Separate, fully independent specialist LLMs
Parameters active Top-K experts Γ— FFN size, across all layers One expert at a time (~3B peak)
Router mechanism Learned linear gating vector HeadExpert β€” a full LLM returning JSON
Experts share weights? Yes (all attention layers are always shared) No β€” complete independence
Pipeline Single transformer forward pass 9-stage: plan β†’ route β†’ execute β†’ synthesize β†’ persona β†’ review

The closest related work is Composition of Experts (Chai et al., 2024, arXiv:2412.01868), which also routes at the input level to full LLM models. PackedLLM extends this with a multi-stage orchestration pipeline, per-stage retry/detour recovery, integrated memory and execution modules, affective state modelling, and a character persona layer β€” none of which appear in prior systems of this type.


Pipeline stages

Every call to forward() passes through these stages in order:

Stage Expert Temperature What it does
1. Plan goal HeadExpert 0.2 Parse intent, tone, routing flags (needs_web / needs_action / needs_vision)
2. Consult memory GATOR β€” Match registered commands; retrieve relevant memory
3. Build route HeadExpert 0.5 Generate ordered list of expert steps as JSON
4. Execute route various per-expert Dispatch each step; retry / detour / skip on failure
5. Synthesize base HeadExpert 1.0 Combine step outputs into a persona-free prose answer
6. Affective state AffectExpert 0.5 Generate bot emotional + physical state JSON
7. Apply persona RoleExpert 1.0 Rewrite base response in character
8. Review HeadExpert 0.0 Accept / revise / reject; extract memory facts; profile updates
9. Finalize GATOR β€” Write memory, update user/bot profiles

Expert roster

Expert Active params Role Notes
HeadExpert ~3B Orchestrator, router, planner, synthesizer, reviewer Most-called expert
LogicExpert ~1B Structured reasoning; deep-think CoT; action planning/repair raw completion with <think> blocks
CodeExpert ~1B Python script generation for action pipeline Temperature 0.0; raw code only, no prose
MathExpert ~1B Quantitative reasoning Post-processes CJK spans; deduplicates repeated lines
AffectExpert ~0.5B Emotional state; step quality evaluation Used as both emotion classifier and pass/fail judge
RoleExpert ~0.5B Persona rewriting in character RP style chat format
CreativeExpert ~1B Writing and stylistic generation High temperature defaults (0.9)
VisionExpert ~1B Multimodal image understanding CLIP projector; local images β†’ data URI
ToolExpert ~0.5B Function-call generation outputs {"tool_calls": [...]} JSON
TranslationExpert ~300M Chinese β†’ English seq2seq β€” not an LLM; Chinese regex gate

Total: ~10B Β· Peak active: ~3B


Forward modes

Standard (full pipeline)

bot.chat("What is the compound interest on $5000 at 4% over 10 years?")

All 9 stages. Memory read/write. Web and action pipelines if needed.

Fast think (minimum latency)

bot.chat("What time is it in Tokyo?", fast_think=True)

Skips planning, routing, memory, web, action, affective state, review. HeadExpert answers directly; RoleExpert applies persona if a bot profile exists. Maximum 2 LLM calls.

Deep think (CoT scaffolding)

bot.chat("Design a Python caching decorator with TTL support.", deep_think=True)

Before each pipeline stage, LogicExpert generates <think>...</think> blocks scoped to that stage's specific task and output contract. These blocks are prepended to the stage's prompt as if the executing expert had already done that prior reasoning. Blocks are cached within a single forward() call. Translation is excluded (not an LLM).


Integrated modules

MemoryBank (GATOR.pt)

A multi-tree semantic store built on PackedTree β€” a custom embedding + KMeans clustering retrieval structure. Trees: knowledge, conversation, user profiles, bot profiles, commands, assets, telemetry. Hybrid retrieval scoring: 75% semantic similarity + 20% keyword overlap + 5% importance metadata. Embedding model: Jina Embeddings v3 (GGUF, stored inside the checkpoint). GATOR's own action planner uses HeadExpert to decide which memory operations to run. Also contains DesktopControl (OS automation) and CommandRegistry (text-to-action macros).

CodeBox (CodeBox.pt)

Persistent Python sandbox with isolated virtual environment management, SHA256-verified asset registry, loader injection (from _codebox_loader import load_asset inside sandboxed code), DAG pipeline runner with $var reference passing between steps, LRU runner cache for expensive models, and hard RAM/CPU kill thresholds enforced by a monitoring thread.

Web (WebSearch.pt)

Three search engines (DuckDuckGo HTML, Google, ResultHunter) with embedding-ranked candidate deduplication. Content extraction tries 10 methods: YouTube transcripts β†’ trafilatura β†’ boilerpy3 β†’ readability β†’ newspaper3k β†’ goose3 β†’ inscriptis β†’ lxml β†’ BeautifulSoup β†’ visible text. PDF via PyMuPDF. Summarization via DistilBART. Runs in a separate spawned process; communicates via multiprocessing.Queue. Serializes safely β€” live process handles are stripped on save.


Usage

Basic

from PackedLLM import PackedLLMRunner

bot = PackedLLMRunner("PackedLLM.pt", bot_id="pip", user_id="alice")
print(bot.chat("Explain gradient descent in one paragraph."))

Expert shortcuts (bypass full pipeline)

bot.creative("Write a haiku about a robot discovering music.")
bot.code("Implement binary search in Python with comments.")
bot.math("Solve: integral of xΒ² Β· sin(x) dx")
bot.logic("All A are B. Some B are C. What follows?", mode="deep_then_answer")
bot.translate("δΊΊε·₯ζ™Ίθƒ½ζ­£εœ¨ζ”Ήε˜δΈ–η•Œ")
bot.web("Latest developments in solid-state batteries?")
bot.action("Compute compound interest on $5000 at 4% over 10 years; save to report.txt")

Memory

bot.memory_store("User prefers concise answers under 100 words.")
results = bot.memory_recall("answer preferences", top_k=3)
bot.set_user_profile({"name": "Alice", "expertise": "ML"})
bot.set_bot_profile({"character_card": "You are Pip, a direct and slightly sarcastic assistant."})

Lifecycle

bot.unload_expert("vision_expert")   # free VRAM; reloads lazily on next use
bot.reload_expert("code_expert")     # hot-reload after checkpoint update
print(bot.status())                  # full system diagnostic

# Context manager (auto-unload on exit)
with PackedLLMRunner("PackedLLM.pt", bot_id="pip", user_id="alice") as bot:
    print(bot.chat("Summarise the Pythagorean theorem."))

Checkpointing

PackedLLM.pt is a ZIP archive containing:

  • manifest.pt β€” all metadata, profiles, hardware state, embedded source code
  • lm_chunk_N.bin β€” model weights in 32MB streaming chunks
  • mem_chunk_N.bin β€” GATOR memory store chunks
  • web_chunk_N.bin β€” WebSearch module chunks
  • box_chunk_N.bin β€” CodeBox chunks

Hardware

PackedLM detects and uses CUDA, Apple Metal (MPS), WebGPU, or CPU automatically via HardwareProbe. For each expert, _plan_offload() estimates the GGUF file size and computes how many transformer layers can fit in free VRAM (with a 15% safety margin for CUDA, 40% for WebGPU). If VRAM is insufficient for a full offload, layers are split proportionally between GPU and CPU.


Citation

@software{packedllm2026,
  Author     = {Chance Brownfield},
  title     = {PackedLLM: A Routing-of-Experts System with LLM-Orchestrated Execution Pipeline},
  year      = {2026},
  note      = {RoE architecture: task-level routing to fully independent specialist LLMs.
               Distinct from token-level Mixture-of-Experts.
               Integrates persistent vector memory (GATOR), sandboxed Python execution (CodeBox),
               and multi-engine web search in a 9-stage orchestration pipeline.}
}

License

This project is licensed under PackedLicense v1.0.

Free for personal, educational, research, and other non-commercial use.

Commercial use requires prior written authorization.

The GATOR, WebSearchModule, and CodeBox components are protected under this license and may not be extracted, redistributed, or commercially reused without authorization.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for HiMind/PackedLLM