PackedLLM

~10B total parameters · ~3B active per inference · Routing-of-Experts (RoE) architecture

PackedLLM is a self-contained multi-expert language model system built around a Routing-of-Experts (RoE) mechanism. Rather than mixing expert outputs at the token level inside a shared transformer (Mixture-of-Experts), PackedLLM routes each request — and each stage of a multi-stage reasoning pipeline — to a dedicated, fully independent specialist model. At most one or two experts are active simultaneously, keeping peak memory around 3B parameters regardless of the 10B total footprint.

The system runs entirely on consumer hardware via llama.cpp persists its full state to a single ZIP checkpoint, and integrates persistent vector memory, sandboxed Python execution, and multi-engine web search as first-class pipeline citizens.

Architecture overview

PackedLLMRunner          ← user-facing shell: load, warmup, lifecycle
      │
PackedLLM (PackedLLM.pt) ← 9-stage orchestration pipeline
      │
Expert dispatch layer    ← 10 specialist models, one active at a time
      │
 ┌────┴─────────────────────────────┐
 │ GATOR/MemoryBank   CodeBox   Web │  ← integrated modules
 └────┬─────────────────────────────┘
      │
PackedLM (LM.pt)         ← llama.cpp inference engine + ExpertHandles

How it differs from MoE

	Standard MoE (Mixtral, DeepSeek)	PackedLLM RoE
Routing granularity	Per-token, inside every transformer layer	Per-task and per-pipeline-stage
What gets routed	FFN sub-modules sharing one transformer	Separate, fully independent specialist LLMs
Parameters active	Top-K experts × FFN size, across all layers	One expert at a time (~3B peak)
Router mechanism	Learned linear gating vector	`HeadExpert` — a full LLM returning JSON
Experts share weights?	Yes (all attention layers are always shared)	No — complete independence
Pipeline	Single transformer forward pass	9-stage: plan → route → execute → synthesize → persona → review

The closest related work is Composition of Experts (Chai et al., 2024, arXiv:2412.01868), which also routes at the input level to full LLM models. PackedLLM extends this with a multi-stage orchestration pipeline, per-stage retry/detour recovery, integrated memory and execution modules, affective state modelling, and a character persona layer — none of which appear in prior systems of this type.

Pipeline stages

Every call to forward() passes through these stages in order:

Stage	Expert	Temperature	What it does
1. Plan goal	HeadExpert	0.2	Parse intent, tone, routing flags (needs_web / needs_action / needs_vision)
2. Consult memory	GATOR	—	Match registered commands; retrieve relevant memory
3. Build route	HeadExpert	0.5	Generate ordered list of expert steps as JSON
4. Execute route	various	per-expert	Dispatch each step; retry / detour / skip on failure
5. Synthesize base	HeadExpert	1.0	Combine step outputs into a persona-free prose answer
6. Affective state	AffectExpert	0.5	Generate bot emotional + physical state JSON
7. Apply persona	RoleExpert	1.0	Rewrite base response in character
8. Review	HeadExpert	0.0	Accept / revise / reject; extract memory facts; profile updates
9. Finalize	GATOR	—	Write memory, update user/bot profiles

Expert roster

Expert	Active params	Role	Notes
HeadExpert	~3B	Orchestrator, router, planner, synthesizer, reviewer	Most-called expert
LogicExpert	~1B	Structured reasoning; deep-think CoT; action planning/repair	raw completion with `<think>` blocks
CodeExpert	~1B	Python script generation for action pipeline	Temperature 0.0; raw code only, no prose
MathExpert	~1B	Quantitative reasoning	Post-processes CJK spans; deduplicates repeated lines
AffectExpert	~0.5B	Emotional state; step quality evaluation	Used as both emotion classifier and pass/fail judge
RoleExpert	~0.5B	Persona rewriting in character	RP style chat format
CreativeExpert	~1B	Writing and stylistic generation	High temperature defaults (0.9)
VisionExpert	~1B	Multimodal image understanding	CLIP projector; local images → data URI
ToolExpert	~0.5B	Function-call generation	outputs `{"tool_calls": [...]}` JSON
TranslationExpert	~300M	Chinese → English	seq2seq — not an LLM; Chinese regex gate

Total: ~10B · Peak active: ~3B

Forward modes

Standard (full pipeline)

bot.chat("What is the compound interest on $5000 at 4% over 10 years?")

All 9 stages. Memory read/write. Web and action pipelines if needed.

Fast think (minimum latency)

bot.chat("What time is it in Tokyo?", fast_think=True)

Skips planning, routing, memory, web, action, affective state, review. HeadExpert answers directly; RoleExpert applies persona if a bot profile exists. Maximum 2 LLM calls.

Deep think (CoT scaffolding)

bot.chat("Design a Python caching decorator with TTL support.", deep_think=True)

Before each pipeline stage, LogicExpert generates <think>...</think> blocks scoped to that stage's specific task and output contract. These blocks are prepended to the stage's prompt as if the executing expert had already done that prior reasoning. Blocks are cached within a single forward() call. Translation is excluded (not an LLM).

Integrated modules

MemoryBank (`GATOR.pt`)

A multi-tree semantic store built on PackedTree — a custom embedding + KMeans clustering retrieval structure. Trees: knowledge, conversation, user profiles, bot profiles, commands, assets, telemetry. Hybrid retrieval scoring: 75% semantic similarity + 20% keyword overlap + 5% importance metadata. Embedding model: Jina Embeddings v3 (GGUF, stored inside the checkpoint). GATOR's own action planner uses HeadExpert to decide which memory operations to run. Also contains DesktopControl (OS automation) and CommandRegistry (text-to-action macros).

CodeBox (`CodeBox.pt`)

Persistent Python sandbox with isolated virtual environment management, SHA256-verified asset registry, loader injection (from _codebox_loader import load_asset inside sandboxed code), DAG pipeline runner with $var reference passing between steps, LRU runner cache for expensive models, and hard RAM/CPU kill thresholds enforced by a monitoring thread.

Web (`WebSearch.pt`)

Three search engines (DuckDuckGo HTML, Google, ResultHunter) with embedding-ranked candidate deduplication. Content extraction tries 10 methods: YouTube transcripts → trafilatura → boilerpy3 → readability → newspaper3k → goose3 → inscriptis → lxml → BeautifulSoup → visible text. PDF via PyMuPDF. Summarization via DistilBART. Runs in a separate spawned process; communicates via multiprocessing.Queue. Serializes safely — live process handles are stripped on save.

Usage

Basic

from PackedLLM import PackedLLMRunner

bot = PackedLLMRunner("PackedLLM.pt", bot_id="pip", user_id="alice")
print(bot.chat("Explain gradient descent in one paragraph."))

Expert shortcuts (bypass full pipeline)

bot.creative("Write a haiku about a robot discovering music.")
bot.code("Implement binary search in Python with comments.")
bot.math("Solve: integral of x² · sin(x) dx")
bot.logic("All A are B. Some B are C. What follows?", mode="deep_then_answer")
bot.translate("人工智能正在改变世界")
bot.web("Latest developments in solid-state batteries?")
bot.action("Compute compound interest on $5000 at 4% over 10 years; save to report.txt")

Memory

bot.memory_store("User prefers concise answers under 100 words.")
results = bot.memory_recall("answer preferences", top_k=3)
bot.set_user_profile({"name": "Alice", "expertise": "ML"})
bot.set_bot_profile({"character_card": "You are Pip, a direct and slightly sarcastic assistant."})

Lifecycle

bot.unload_expert("vision_expert")   # free VRAM; reloads lazily on next use
bot.reload_expert("code_expert")     # hot-reload after checkpoint update
print(bot.status())                  # full system diagnostic

# Context manager (auto-unload on exit)
with PackedLLMRunner("PackedLLM.pt", bot_id="pip", user_id="alice") as bot:
    print(bot.chat("Summarise the Pythagorean theorem."))

Checkpointing

PackedLLM.pt is a ZIP archive containing:

manifest.pt — all metadata, profiles, hardware state, embedded source code
lm_chunk_N.bin — model weights in 32MB streaming chunks
mem_chunk_N.bin — GATOR memory store chunks
web_chunk_N.bin — WebSearch module chunks
box_chunk_N.bin — CodeBox chunks

Hardware

PackedLM detects and uses CUDA, Apple Metal (MPS), WebGPU, or CPU automatically via HardwareProbe. For each expert, _plan_offload() estimates the GGUF file size and computes how many transformer layers can fit in free VRAM (with a 15% safety margin for CUDA, 40% for WebGPU). If VRAM is insufficient for a full offload, layers are split proportionally between GPU and CPU.

Citation

@software{packedllm2026,
  Author     = {Chance Brownfield},
  title     = {PackedLLM: A Routing-of-Experts System with LLM-Orchestrated Execution Pipeline},
  year      = {2026},
  note      = {RoE architecture: task-level routing to fully independent specialist LLMs.
               Distinct from token-level Mixture-of-Experts.
               Integrates persistent vector memory (GATOR), sandboxed Python execution (CodeBox),
               and multi-engine web search in a 9-stage orchestration pipeline.}
}

License

This project is licensed under PackedLicense v1.0.

Free for personal, educational, research, and other non-commercial use.

Commercial use requires prior written authorization.

The GATOR, WebSearchModule, and CodeBox components are protected under this license and may not be extracted, redistributed, or commercially reused without authorization.

Downloads last month: -

Paper for HiMind/PackedLLM

Composition of Experts: A Modular Compound AI System Leveraging Large Language Models

Paper • 2412.01868 • Published Dec 2, 2024