Grayson D.'s picture

Grayson D.

GraysonD-XYZ

·

https://GraysonD.XYZ

AI & ML interests

None yet

Recent Activity

reacted to salma-remyx's post with 🚀 3 days ago

In that benchmark comparison, do you even have the sample size to distinguish two models, or are you making decisions based on statistical noise? "Resolution Diagnostics for Paired LLM Evaluation" offers a simple check: a per-pair resolution ratio q = N/N* that flags when a displayed ranking sits below the resolution floor regardless of p-value. arXiv: https://arxiv.org/abs/2605.30315v1 Outrider automatically matched this paper to our fork of lm-evaluation-harness and opened a PR implementing the diagnostic. Configure the action to find new methods tailored to your repo: https://github.com/remyxai/outrider

View all activity

Organizations

None yet

models 0

None public yet

datasets 0

None public yet