Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents Paper • 2605.07630 • Published 6 days ago
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents Paper • 2605.07630 • Published 6 days ago
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows Paper • 2604.28139 • Published 14 days ago • 42
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows Paper • 2604.28139 • Published 14 days ago • 42
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning Paper • 2604.16029 • Published 27 days ago • 23
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning Paper • 2604.16029 • Published 27 days ago • 23
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models Paper • 2604.10866 • Published Apr 13 • 65
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration Paper • 2602.05400 • Published Feb 5 • 353
DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry Paper • 2512.11558 • Published Dec 12, 2025 • 45
CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling Paper • 2510.04204 • Published Oct 5, 2025 • 21
CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling Paper • 2510.04204 • Published Oct 5, 2025 • 21 • 2
CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling Paper • 2510.04204 • Published Oct 5, 2025 • 21