Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents Paper • 2604.06132 • Published 7 days ago • 113 • 5
Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning Paper • 2604.05404 • Published 7 days ago • 41 • 4
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding Paper • 2604.05015 • Published 8 days ago • 231 • 8
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision Paper • 2604.04934 • Published 8 days ago • 42 • 5
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models Paper • 2604.04707 • Published 8 days ago • 200 • 12
ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation Paper • 2604.03922 • Published 9 days ago • 53 • 4