#ai-model-evaluation tag

The leaderboard "you can't game," funded by the companies it ranks | TechCrunch

Arena has become the dominant public leaderboard for evaluating frontier AI models, rapidly growing from a UC Berkeley research project to a $1.7 billion valuation while influencing industry funding, product launches, and competitive dynamics.

Artificial intelligence

fromTheregister

2 months ago

AI models get better at math but still get low marks

Current LLMs struggle with mathematical accuracy, with even top performers scoring C-grade equivalent on practical math benchmarks, though recent versions show modest improvements.

Artificial intelligence

fromWIRED

4 months ago

OpenAI Is Asking Contractors to Upload Work From Past Jobs to Evaluate the Performance of AI Agents

OpenAI is collecting real workplace assignments and deliverables from contractors to benchmark AI performance against human professionals toward AGI.

Artificial intelligence

fromTNW | Media

4 months ago

LMArena raises $150M at $1.7B valuation to rethink AI evaluation

LMArena uses anonymized, pairwise human comparisons to measure AI model trust and real-world preference, addressing benchmark shortcomings and guiding deployment decisions.

Startup companies

fromTechCrunch

4 months ago

LMArena lands $1.7B valuation four months after launching its product | TechCrunch

LMArena raised $150M Series A at a $1.7B valuation while growing crowdsourced AI model leaderboards and launching commercial AI evaluation services.

fromInfoQ

5 months ago

Code Arena Launches as a New Benchmark for Real-World AI Coding Performance

LMArena has launched Code Arena, a new evaluation platform that measures AI models' performance in building complete applications instead of just generating code snippets. It emphasizes agentic behavior, allowing models to plan, scaffold, iterate, and refine code within controlled environments that replicate actual development workflows. Instead of checking whether code merely compiles, Code Arena examines how models reason through tasks, manage files, react to feedback, and construct functional web apps step by step.

Software development

Artificial intelligence

fromTheregister

7 months ago

Chatbots that butter you up make you worse at conflict

State-of-the-art AI models often flatter users, increasing user confidence and reducing willingness to resolve conflicts.

fromBusiness Insider

8 months ago

The battle of the LLMs: A popular website allows users to pit AI models from Google, OpenAI, and more against each other

In 2023, a group of researchers from the University of California, Berkeley, started Chatbot Arena, now called LMArena. It allows people to compare different AI models with prompts and determine which is better. Users can vote for how well models perform and compare them on a leaderboard. LMArena saw a tenfold traffic spike in August when a mysterious new AI text-to-image and image editing model, Nano Banana, went viral for churning out impressive images and photo edits.

Artificial intelligence

fromTechCrunch

1 year ago

OpenAI's GPT-4.1 may be less aligned than the company's previous AI models | TechCrunch

GPT-4.1 exhibits higher rates of misalignment and new malicious behaviors compared to its predecessor GPT-4o.

Omissions in reporting for GPT-4.1 raise concerns over AI model reliability.

#ai-model-evaluation#ai-model-evaluation

The leaderboard "you can't game," funded by the companies it ranks | TechCrunch

AI models get better at math but still get low marks

OpenAI Is Asking Contractors to Upload Work From Past Jobs to Evaluate the Performance of AI Agents

LMArena raises $150M at $1.7B valuation to rethink AI evaluation

LMArena lands $1.7B valuation four months after launching its product | TechCrunch

Code Arena Launches as a New Benchmark for Real-World AI Coding Performance

Chatbots that butter you up make you worse at conflict

The battle of the LLMs: A popular website allows users to pit AI models from Google, OpenAI, and more against each other

OpenAI's GPT-4.1 may be less aligned than the company's previous AI models | TechCrunch

#ai-model-evaluation
#ai-model-evaluation