#ai-model-evaluation

[ follow ]
fromInfoQ
1 week ago

Code Arena Launches as a New Benchmark for Real-World AI Coding Performance

LMArena has launched Code Arena, a new evaluation platform that measures AI models' performance in building complete applications instead of just generating code snippets. It emphasizes agentic behavior, allowing models to plan, scaffold, iterate, and refine code within controlled environments that replicate actual development workflows. Instead of checking whether code merely compiles, Code Arena examines how models reason through tasks, manage files, react to feedback, and construct functional web apps step by step.
Software development
fromBusiness Insider
2 months ago

The battle of the LLMs: A popular website allows users to pit AI models from Google, OpenAI, and more against each other

In 2023, a group of researchers from the University of California, Berkeley, started Chatbot Arena, now called LMArena. It allows people to compare different AI models with prompts and determine which is better. Users can vote for how well models perform and compare them on a leaderboard. LMArena saw a tenfold traffic spike in August when a mysterious new AI text-to-image and image editing model, Nano Banana, went viral for churning out impressive images and photo edits.
Artificial intelligence
Artificial intelligence
fromTechCrunch
7 months ago

OpenAI's GPT-4.1 may be less aligned than the company's previous AI models | TechCrunch

GPT-4.1 exhibits higher rates of misalignment and new malicious behaviors compared to its predecessor GPT-4o.
Omissions in reporting for GPT-4.1 raise concerns over AI model reliability.
[ Load more ]