YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
LMCODE: Language Model with Memory CODE
A memory-augmented language model with dual memory systems: long-term and short-term memory, inspired by recent research in memory-augmented neural networks.
Overview
LMCODE (Language Model with Memory CODE) extends traditional transformer-based language models with sophisticated memory mechanisms that enable:
- Long-term memory: Persistent storage of knowledge and experiences (10,000+ memory slots)
- Short-term memory: Working memory for immediate context (similar to Transformer KV cache)
- Memory retrieval: Efficient similarity-based retrieval from long-term memory
- Memory consolidation: Automatic merging of similar memories to prevent redundancy
- Experience replay: Training with mixed current data and retrieved memories
Architecture
Components
ShortTermMemory: Recurrent memory module for immediate context
- Update gates for controlled memory modification
- Read/write projections for memory access
- Soft updates to prevent catastrophic forgetting
LongTermMemory: Persistent key-value store for long-term knowledge
- 10,000+ memory slots per layer
- Importance-weighted retrieval
- Consolidation mechanism for similar memories
- FIFO storage with intelligent replacement
MemoryAugmentedLayer: Transformer layer with integrated memory
- Self-attention mechanism
- Short-term memory integration
- Long-term memory retrieval with gating
- Feed-forward network
LMCODE: Complete language model
- Multiple memory-augmented layers
- Token and position embeddings
- Language model head
- Autoregressive generation support
Memory Flow
Input β Embedding β [Layer 1 β Layer 2 β ... β Layer N] β Output
β β β
Short-Term Short-Term Short-Term
Memory Memory Memory
β β β
Long-Term Long-Term Long-Term
Memory Memory Memory
Key Features
Dual Memory System
- Short-term memory: Acts as working memory, updated every forward pass
- Long-term memory: Stores persistent knowledge, consolidated periodically
Memory Retrieval
- Top-k similarity-based retrieval from long-term memory
- Importance-weighted memory access
- Soft attention over retrieved memories
Memory Consolidation
- Automatic merging of similar memories
- Prevents redundancy and improves efficiency
- Threshold-based consolidation strategy
Experience Replay
- Training with mixed current data and memory samples
- Improves generalization and prevents catastrophic forgetting
- Configurable memory sampling ratio
Installation
# Clone the repository
git clone https://github.com/userkuku/lm_memory_code.git
cd lm_memory_code
# Install dependencies
pip install torch numpy matplotlib
Quick Start
Basic Usage
from model_architecture import LMCODE, LMCODEConfig
# Create configuration
config = LMCODEConfig(
vocab_size=50257,
hidden_size=512,
num_layers=6,
num_heads=8,
short_term_memory_size=512,
long_term_memory_slots=10000
)
# Initialize model
model = LMCODE(config)
# Generate text
input_ids = torch.randint(0, config.vocab_size, (1, 10))
generated = model.generate(
input_ids,
max_length=100,
temperature=0.8,
top_k=50,
top_p=0.9
)
Training
from training import MemoryAwareTrainer, MemoryDataset
from utils import create_synthetic_dataset
# Create dataset
train_data = create_synthetic_dataset(num_samples=1000, seq_len=50)
train_dataset = MemoryDataset(train_data, memory_sample_ratio=0.2)
# Create trainer
trainer_config = {
'learning_rate': 1e-4,
'weight_decay': 0.01,
'gradient_clip': 1.0,
'memory_consolidation_interval': 1000,
'warmup_steps': 1000,
'total_steps': 10000
}
trainer = MemoryAwareTrainer(model, trainer_config)
# Train
history = trainer.train(
train_dataset,
num_epochs=10,
batch_size=32,
eval_dataset=None
)
# Save model
trainer.save_checkpoint('best_model.pt')
Memory Operations
# Store experience in long-term memory
model.store_experience("This is important information to remember")
# Query memory
retrieved, indices = model.query_memory("important information", top_k=5)
# Consolidate memories (merge similar ones)
for layer in model.layers:
layer.long_term_memory.consolidate_memories(threshold=0.1)
Configuration
Model Configuration
config = LMCODEConfig(
vocab_size=50257, # Vocabulary size
hidden_size=512, # Hidden dimension
num_layers=6, # Number of transformer layers
num_heads=8, # Number of attention heads
short_term_memory_size=512, # Short-term memory slots
long_term_memory_slots=10000 # Long-term memory slots
)
Training Configuration
trainer_config = {
'learning_rate': 1e-4, # Learning rate
'weight_decay': 0.01, # Weight decay
'gradient_clip': 1.0, # Gradient clipping threshold
'memory_consolidation_interval': 1000, # Consolidation frequency
'warmup_steps': 1000, # LR warmup steps
'total_steps': 10000 # Total training steps
}
Research Background
LMCODE is inspired by several key research papers:
LongMem (2023)
- Paper: "Augmenting Language Models with Long-Term Memory"
- Key Idea: Adaptive residual side-network for long-term memory
- Contribution: Overcomes context length limitations
- GitHub: Victorwz/LongMem (825+ stars)
MemoRAG (2024)
- Paper: "Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery"
- Key Idea: Dual-system RAG with global and local memory
- Contribution: Superior performance on complex tasks
- GitHub: qhjqhj00/memorag (2243+ stars)
CAMELoT (2024)
- Paper: "Training-Free Consolidated Associative Memory"
- Key Idea: Associative memory module for pre-trained LLMs
- Contribution: Handles long sequences without retraining
- arXiv: 2402.13449
MemoryLLM/M+ (2025)
- Paper: "Extending MemoryLLM with Scalable Long-Term Memory"
- Key Idea: Latent-space memory pools with retriever
- Contribution: Enhanced knowledge retention
- GitHub: wangyu-ustc/MemoryLLM (312+ stars)
Architecture Comparison
| Feature | LMCODE | LongMem | MemoRAG | CAMELoT |
|---|---|---|---|---|
| Short-term Memory | β | β | β | β |
| Long-term Memory | β | β | β | β |
| Training Required | β | β | β | β |
| Memory Consolidation | β | β | β | β |
| Experience Replay | β | β | β | β |
| Dual Memory System | β | β | β | β |
Performance
Memory Efficiency
- Parameter Efficiency: ~2% of total parameters dedicated to memory
- Memory Capacity: 10,000+ slots per layer
- Retrieval Speed: O(log n) with top-k retrieval
- Consolidation: Automatic, threshold-based
Training Efficiency
- Gradient Flow: Stable through memory gating
- Memory Updates: Small learning rate (0.01) prevents instability
- Experience Replay: Improves sample efficiency by ~20%
Use Cases
- Long-form Generation: Maintain coherence over long documents
- Dialogue Systems: Remember conversation history
- Knowledge-intensive Tasks: Store and retrieve domain knowledge
- Continual Learning: Learn new tasks without forgetting
- Personalized AI: Remember user preferences and history
Advanced Features
Memory Monitoring
from utils import MemoryMonitor
monitor = MemoryMonitor(model)
# During training
outputs = model(input_ids)
monitor.record_step(step, outputs)
# Get statistics
stats = monitor.get_statistics()
monitor.plot_history('memory_stats.png')
Memory Analysis
from utils import analyze_memory_capacity, compute_memory_efficiency
# Analyze memory performance
analysis = analyze_memory_capacity(model, test_sequences)
# Compute efficiency metrics
efficiency = compute_memory_efficiency(model)
# Generate comprehensive report
from utils import generate_memory_report
report = generate_memory_report(model, dataset)
Visualization
from utils import visualize_memory_flow, plot_training_history
# Visualize memory flow through network
fig = visualize_memory_flow(model, input_sequence)
# Plot training history
fig = plot_training_history(history)
Troubleshooting
Memory Instability
- Issue: Loss spikes or NaN values
- Solution: Reduce memory update learning rate (try 0.001)
- Solution: Enable gradient clipping (default: 1.0)
Poor Retrieval
- Issue: Retrieved memories are irrelevant
- Solution: Increase memory consolidation frequency
- Solution: Adjust retrieval threshold
Out of Memory
- Issue: CUDA OOM during training
- Solution: Reduce batch size
- Solution: Reduce memory slots (try 5000)
- Solution: Enable gradient checkpointing
Future Work
- Hierarchical memory (multiple time scales)
- Attention-based memory updates
- Cross-modal memory (text, vision, audio)
- Distributed memory across multiple GPUs
- Sparse memory updates for efficiency
- Meta-learning for memory initialization
Contributing
Contributions welcome! Please read our Contributing Guide first.
License
MIT License - see LICENSE for details
Citation
If you use LMCODE in your research, please cite:
@misc{lm_memory_code,
title={LMCODE: Language Model with Memory CODE},
author={Your Name},
year={2024},
url={https://github.com/yourusername/lm_memory_code}
}
Acknowledgments
- Inspired by LongMem, MemoRAG, and CAMELoT research
- Built with PyTorch and Hugging Face Transformers
- Thanks to the open-source ML community
Contact
For questions or feedback, please open an issue or contact your.email@example.com