Grayson D.
GraysonD-XYZ
ยท
AI & ML interests
None yet
Recent Activity
reacted to salma-remyx's post with ๐ 3 days ago
In that benchmark comparison, do you even have the sample size to distinguish two models, or are you making decisions based on statistical noise?
"Resolution Diagnostics for Paired LLM Evaluation" offers a simple check: a per-pair resolution ratio q = N/N* that flags when a displayed ranking sits below the resolution floor regardless of p-value.
arXiv: https://arxiv.org/abs/2605.30315v1
Outrider automatically matched this paper to our fork of lm-evaluation-harness and opened a PR implementing the diagnostic.
Configure the action to find new methods tailored to your repo: https://github.com/remyxai/outriderOrganizations
None yet