
"The AI industry has become adept at measuring itself. Benchmarks improve, model scores rise, and every new release arrives with a list of metrics meant to signal progress. And yet, somewhere between the lab and real life, something keeps slipping. Which model actually better to use? Which answers would a human trust? Which system would you put in front of customers, employees, or citizens and feel comfortable standing behind it?"
"LMArena's answer was simple and radical: stop scoring models in isolation. On its platform, users submit a prompt and receive two anonymized responses. No branding. No model names. Just answers. Then the user picks the better one, or neither. One vote. One comparison. Repeated millions of times. The result isn't a definitive "best," but a living signal of human preference , how people respond to tone, clarity, verbosity and real-w"
Benchmarks and standardized tests boosted measured model performance but failed to capture real-world behavior as models optimized for tests and converged. AI moved into everyday workflows where trust and reliability matter more than benchmark scores for customer-facing and professional uses. LMArena collects anonymized, pairwise comparisons by showing users two unlabeled responses to the same prompt and asking which they prefer or if neither suffices. Millions of these judgments create a continuous, living signal of human preference across tone, clarity, verbosity, and practical usefulness. Investors committed $150 million at a $1.7 billion valuation led by Felicis and UC Investments.
Read at TNW | Media
Unable to calculate read time
Collection
[
|
...
]