The race for dominance in generative AI has led to competition between private companies and open-source projects, often highlighted by their performance on benchmarks. A recent study from Cornell University shows that manipulating these rankings is feasible with only a few hundred votes, raising concerns about the validity of such evaluations. Researchers analyzed Chatbot Arena, a platform that crowdsources AI performance votes, proving that influence over rankings can be achieved relatively easily. This manipulation could potentially misrepresent the abilities of AI models, prompting ethical questions in the field.
"When we talk about large language models, their performance on benchmarks is very important... which makes some startups motivated to get or manipulate the benchmark."
"We just need to take hundreds of new votes to improve a single ranking position. The technique is very simple."
Collection
[
|
...
]