Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation Paper • 2506.19352 • Published Jun 24, 2025
"I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration Paper • 2605.21363 • Published 3 days ago • 2
"I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration Paper • 2605.21363 • Published 3 days ago • 2
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models Paper • 2511.22787 • Published Nov 27, 2025 • 10
Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues Paper • 2510.19028 • Published Oct 21, 2025 • 8
Diffusion Models Through a Global Lens: Are They Culturally Inclusive? Paper • 2502.08914 • Published Feb 13, 2025
When Tom Eats Kimchi: Evaluating Cultural Bias of Multimodal Large Language Models in Cultural Mixture Contexts Paper • 2503.16826 • Published Mar 21, 2025
MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language Paper • 2505.14395 • Published May 20, 2025 • 6
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation Paper • 2506.00482 • Published May 31, 2025 • 8
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models Paper • 2410.17578 • Published Oct 23, 2024 • 1
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation Paper • 2412.10424 • Published Dec 10, 2024 • 2
Uncovering Factor Level Preferences to Improve Human-Model Alignment Paper • 2410.06965 • Published Oct 9, 2024
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages Paper • 2406.09948 • Published Jun 14, 2024 • 2
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation Paper • 2412.10424 • Published Dec 10, 2024 • 2