Attention
updated
Selective Attention Improves Transformer
Paper
• 2410.02703
• Published • 25
Paper
• 2410.05258
• Published • 182
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent
Sparse Attention
Paper
• 2410.05076
• Published • 8
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Paper
• 2410.13276
• Published • 29
Star Attention: Efficient LLM Inference over Long Sequences
Paper
• 2411.17116
• Published • 53
KV Shifting Attention Enhances Language Modeling
Paper
• 2411.19574
• Published • 8
Entropy-Guided Attention for Private LLMs
Paper
• 2501.03489
• Published • 14
Not All Language Model Features Are Linear
Paper
• 2405.14860
• Published • 40
Your Transformer is Secretly Linear
Paper
• 2405.12250
• Published • 157
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
• 2501.08313
• Published • 302
Tensor Product Attention Is All You Need
Paper
• 2501.06425
• Published • 90
Sigma: Differential Rescaling of Query, Key and Value for Efficient
Language Models
Paper
• 2501.13629
• Published • 48
TransMLA: Multi-head Latent Attention Is All You Need
Paper
• 2502.07864
• Published • 69
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention
Paper
• 2502.11089
• Published • 170
Does Time Have Its Place? Temporal Heads: Where Language Models Recall
Time-specific Information
Paper
• 2502.14258
• Published • 26
How Do Large Vision-Language Models See Text in Image? Unveiling the
Distinctive Role of OCR Heads
Paper
• 2505.15865
• Published • 5
Learning to Skip the Middle Layers of Transformers
Paper
• 2506.21103
• Published • 18
Limitations of Normalization in Attention Mechanism
Paper
• 2508.17821
• Published • 7
Native Hybrid Attention for Efficient Sequence Modeling
Paper
• 2510.07019
• Published • 17
Attention Sinks in Diffusion Language Models
Paper
• 2510.15731
• Published • 50
Every Attention Matters: An Efficient Hybrid Architecture for
Long-Context Reasoning
Paper
• 2510.19338
• Published • 117
Paper
• 2510.23052
• Published • 30
Kimi Linear: An Expressive, Efficient Attention Architecture
Paper
• 2510.26692
• Published • 132
Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis
Paper
• 2601.21709
• Published • 3
Paper
• 2603.15031
• Published • 176
Mixture-of-Depths Attention
Paper
• 2603.15619
• Published • 79