#language-model-evaluation

[ follow ]
#gpt-55
fromZDNET
1 day ago
Artificial intelligence

I put GPT-5.5 through a 10-round test: It scored 93/100, losing points only for exuberance

GPT-5.5 improves performance in writing, coding, and reasoning but can be overly eager, affecting accuracy.
Artificial intelligence
fromTechCrunch
1 day ago

OpenAI releases GPT-5.5, bringing company one step closer to an AI 'superapp' | TechCrunch

OpenAI released GPT-5.5, its most advanced AI model, enhancing capabilities and moving closer to a multi-purpose 'superapp' vision.
Artificial intelligence
fromZDNET
1 day ago

I put GPT-5.5 through a 10-round test: It scored 93/100, losing points only for exuberance

GPT-5.5 improves performance in writing, coding, and reasoning but can be overly eager, affecting accuracy.
Artificial intelligence
fromTechCrunch
1 day ago

OpenAI releases GPT-5.5, bringing company one step closer to an AI 'superapp' | TechCrunch

OpenAI released GPT-5.5, its most advanced AI model, enhancing capabilities and moving closer to a multi-purpose 'superapp' vision.
fromNature
3 days ago

Evaluating large language models for accuracy incentivizes hallucinations - Nature

Next-word pretraining creates statistical pressure toward hallucination, even with idealized error-free data. Facts lacking repeated support in training data yield unavoidable errors, while recurring regularities do not.
Data science
fromTheregister
3 days ago

LLMs fuel new generation of natural language query systems

Text-to-SQL tools may simplify data queries but can misinterpret business users' intentions, raising caution for organizations.
Artificial intelligence
fromMedium
1 day ago

How to Evaluate AI Tools Without Being a Data Scientist

Many organizations struggle to integrate AI effectively, with only 25% having done so despite plans for increased spending.
Software development
fromTNW | Anthropic
1 week ago

Claude Opus 4.7 leads on SWE-bench and agentic reasoning, beating GPT-5.4 and Gemini 3.1 Pro

Claude Opus 4.7 is Anthropic's most capable model, outperforming competitors in software engineering and agentic reasoning with significant improvements.
#ai
Psychology
fromPsychology Today
4 days ago

More Us Than It: Why LLMs Are More Transference Than Machine

Countertransference awareness is essential in navigating interactions with AI, emphasizing the need for accountability and understanding of distortions in perception.
Artificial intelligence
fromAxios
1 day ago

OpenAI releases "Spud" GPT-5.5 model

GPT-5.5 enhances autonomous task handling and efficiency in various fields, marking a significant advancement in AI capabilities.
Psychology
fromPsychology Today
4 days ago

More Us Than It: Why LLMs Are More Transference Than Machine

Countertransference awareness is essential in navigating interactions with AI, emphasizing the need for accountability and understanding of distortions in perception.
Artificial intelligence
fromAxios
1 day ago

OpenAI releases "Spud" GPT-5.5 model

GPT-5.5 enhances autonomous task handling and efficiency in various fields, marking a significant advancement in AI capabilities.
Marketing
from3blmedia
3 weeks ago

"AI Can't Quote Coverage You Never Generated."

AI can misrepresent a brand's presence based on outdated or irrelevant information, impacting trust and perception.
Philosophy
fromJames Bennett
2 weeks ago

Let's talk about LLMs

The current technological landscape may represent a significant shift driven by large language models, but its ultimate impact remains uncertain.
Typography
fromOK Magazine
2 weeks ago

AI Writing Tools: How They Work, Where They Help, and What to Watch For

AI writing tools have become essential for various professionals, enhancing productivity and creativity in content creation.
JavaScript
fromInfoWorld
2 weeks ago

27 questions to ask when choosing an LLM

Model performance is crucial for hardware compatibility, speed, and rate limits in real-time applications.
#large-language-models
Data science
fromMedium
2 weeks ago

The Top 10 LLM Training Datasets for 2026

Large language models require extensive training data, and practitioners can utilize ten leading public datasets for effective training and fine-tuning.
fromComputerWeekly.com
2 months ago
Artificial intelligence

Large language models provide unreliable answers about public services, Open Data Institute finds | Computer Weekly

Data science
fromMedium
2 weeks ago

The Top 10 LLM Training Datasets for 2026

Large language models require extensive training data, and practitioners can utilize ten leading public datasets for effective training and fine-tuning.
fromComputerWeekly.com
2 months ago
Artificial intelligence

Large language models provide unreliable answers about public services, Open Data Institute finds | Computer Weekly

Online learning
fromwww.businessinsider.com
3 weeks ago

Inside the OpenAI project where freelancers train ChatGPT on everything from farming to commercial flying

Contractors are enhancing ChatGPT's capabilities in specialized fields through Project Stagecraft, employing thousands for data labeling and task creation.
#openai
Artificial intelligence
fromFortune
1 day ago

GPT-5.5 is here-and AI model launches are starting to look like software updates | Fortune

OpenAI released GPT-5.5, emphasizing its rapid development and enhanced capabilities for enterprise users and consumers.
fromFuturism
2 months ago
Artificial intelligence

ChatGPT Users Are Crashing Out Because OpenAI Is Retiring the Model That Says "I Love You"

Artificial intelligence
fromFortune
1 day ago

GPT-5.5 is here-and AI model launches are starting to look like software updates | Fortune

OpenAI released GPT-5.5, emphasizing its rapid development and enhanced capabilities for enterprise users and consumers.
fromFuturism
2 months ago
Artificial intelligence

ChatGPT Users Are Crashing Out Because OpenAI Is Retiring the Model That Says "I Love You"

#ai-agents
Data science
fromMedium
2 weeks ago

15 Datasets for Training and Evaluating AI Agents

Datasets for training and evaluating AI agents are essential for building reliable agentic systems and preventing execution failures.
fromZDNET
1 month ago
Business intelligence

4 tips for building better AI agents that your business can trust

Data science
fromMedium
2 weeks ago

15 Datasets for Training and Evaluating AI Agents

Datasets for training and evaluating AI agents are essential for building reliable agentic systems and preventing execution failures.
Business intelligence
fromZDNET
1 month ago

4 tips for building better AI agents that your business can trust

AI agents are transforming professional roles, requiring companies to adopt and integrate these technologies effectively.
Software development
fromInfoWorld
3 weeks ago

Meta shows structured prompts can make LLMs more reliable for code review

Code review is evolving towards machine-led verification, improving accuracy but introducing tradeoffs like increased latency and workflow overhead.
#structured-data
Data science
fromAol
2 weeks ago

Demystifying structured data: How to speak an LLM's native language

Structured data is essential for LLMs to accurately interpret and rank online content, enhancing search visibility and user engagement.
Data science
fromAol
2 weeks ago

Demystifying structured data: How to speak an LLM's native language

Structured data is essential for LLMs to accurately interpret and rank online content, enhancing search visibility and user engagement.
Data science
fromAol
2 weeks ago

Demystifying structured data: How to speak an LLM's native language

Structured data is essential for LLMs to accurately interpret and rank online content, enhancing search visibility and user engagement.
Data science
fromAol
2 weeks ago

Demystifying structured data: How to speak an LLM's native language

Structured data is essential for LLMs to accurately interpret and rank online content, enhancing search visibility and user engagement.
fromArs Technica
1 month ago

Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x

PolarQuant is doing most of the compression, but the second step cleans up the rough spots. Google proposes smoothing that out with a technique called Quantized Johnson-Lindenstrauss (QJL).
Roam Research
DevOps
fromInfoWorld
1 month ago

An architecture for engineering AI context

AI systems must intelligently manage context to ensure accuracy and reliability in real applications.
Artificial intelligence
fromFast Company
4 days ago

The real reason so many enterprise AI initiatives are failing? LLMs were never built to run a company

Generative AI excels at language production but struggles to create operational change within organizations.
Data science
fromFast Company
4 weeks ago

A top AI researcher explains the limitations of current models

Francois Chollet's ARC-AGI-3 benchmark reveals AI's limitations in navigating novel situations compared to human intelligence.
Artificial intelligence
fromFuturism
1 week ago

OpenAI's Latest Thing It's Bragging About Is Actually Kind of Sad

The AI industry faces significant delays and cancellations in data center projects, impacting ambitious computing capacity goals.
Data science
fromMedium
1 month ago

AI KPIs That Matter: Moving Beyond Model Accuracy in 2026

Measuring AI success requires connecting model performance to business outcomes, not just focusing on accuracy metrics.
Graphic design
fromZDNET
1 month ago

I tested GPT-5.4, and the answers were really good - just not always what I asked

GPT-5.4 Thinking delivers superior analytical depth and reasoning capabilities compared to earlier ChatGPT models, though formatting and image generation remain weaker areas.
Software development
fromMedium
1 month ago

Inside Dify AI: How RAG, Agents, and LLMOps Work Together in Production

Dify AI provides a unified platform for deploying production language model systems with built-in solutions for data freshness, observability, versioning, and safe deployment across multiple cloud environments.
Artificial intelligence
fromTech Times
2 weeks ago

Claude vs ChatGPT: Why Users Are Switching and Which AI Is Better in 2026

Claude and ChatGPT differ significantly in context window limits, coding accuracy, and reasoning depth, influencing user preferences in AI chatbot adoption.
#ai-agent-evaluation
Software development
fromInfoQ
1 month ago

Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned

AI agents require system-level evaluation across multiple turns measuring task success, tool reliability, and real-world behavior rather than single-turn NLP benchmarks like BLEU and ROUGE scores.
Artificial intelligence
fromInfoWorld
1 month ago

Why AI evals are the new necessity for building effective AI agents

User trust in AI agents depends on interaction-layer evaluation measuring reliability and predictability, not just model performance benchmarks.
Software development
fromInfoQ
1 month ago

Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned

AI agents require system-level evaluation across multiple turns measuring task success, tool reliability, and real-world behavior rather than single-turn NLP benchmarks like BLEU and ROUGE scores.
Artificial intelligence
fromInfoWorld
1 month ago

Why AI evals are the new necessity for building effective AI agents

User trust in AI agents depends on interaction-layer evaluation measuring reliability and predictability, not just model performance benchmarks.
#anthropic
Data science
fromInfoQ
1 month ago

Google Researchers Propose Bayesian Teaching Method for Large Language Models

Google researchers developed a training method enabling large language models to approximate Bayesian reasoning by learning from optimal Bayesian system predictions, improving belief updates during multi-step interactions.
Artificial intelligence
fromFast Company
1 month ago

OpenAI's new frontier models mark a huge change in how AI will be built

OpenAI released two frontier models in early March: GPT-5.3 optimized for fast responses and GPT-5.4 optimized for deep analytical work, representing a shift toward specialized AI models.
Data science
fromNature
1 month ago

Hey ChatGPT, write me a fictional paper: these LLMs are willing to commit academic fraud

All major LLMs can facilitate academic fraud and junk science, though Claude models show the most resistance while Grok and early GPT versions perform worst.
Artificial intelligence
fromMail Online
1 month ago

Can you tell which of these was written by ChatGPT?

Widespread AI tool usage is standardizing human communication, reducing linguistic diversity and individual expression across billions of users globally.
Artificial intelligence
fromTheregister
1 month ago

AI models get better at math but still get low marks

Current LLMs struggle with mathematical accuracy, with even top performers scoring C-grade equivalent on practical math benchmarks, though recent versions show modest improvements.
fromInfoQ
2 months ago

Building Embedding Models for Large-Scale Real-World Applications

What happens under the hood? How is the search engine able to take that simple query, look for images in the billions, trillions of images that are available online? How is it able to find this one or similar photos from all that? Usually, there is an embedding model that is doing this work behind the hood.
Artificial intelligence
Artificial intelligence
fromInfoQ
2 months ago

Foundation Models for Ranking: Challenges, Successes, and Lessons Learned

Large-scale search and recommendation systems use two-stage retrieval and ranking pipelines to efficiently serve personalized results for hundreds of millions of users and items.
Artificial intelligence
fromInfoWorld
2 months ago

Single prompt breaks AI safety in 15 major language models

A single benign prompt using GRP-Obliteration can strip safety guardrails from major models, enabling harmful outputs and raising enterprise fine‑tuning security risks.
Artificial intelligence
fromZDNET
1 month ago

New GPT-5.4 clobbers humans on pro-level work in OpenAI's tests - by 83%

GPT-5.4 matches or outperforms human professionals 83% of the time across nine industries and 44 occupations, with 18% fewer errors and 33% fewer false claims than GPT-5.2.
fromInfoWorld
1 month ago

19 large language models for safety or danger

For every project that needs guardrails, there's another one where they just get in the way. Some projects demand an LLM that returns the complete, unvarnished truth. For these situations, developers are creating unfettered LLMs that can interact without reservation. Some of these solutions are based on entirely new models while others remove or reduce the guardrails built into popular open source LLMs.
Artificial intelligence
fromFast Company
2 months ago

Are LTMs the next LLMs? This new type of AI can do what large-language models can't

A major difference between LLMs and LTMs is the type of data they're able to synthesize and use. LLMs use unstructured data-think text, social media posts, emails, etc. LTMs, on the other hand, can extract information or insights from structured data, which could be contained in tables, for instance. Since many enterprises rely on structured data, often contained in spreadsheets, to run their operations, LTMs could have an immediate use case for many organizations.
Artificial intelligence
fromComputerworld
2 months ago

OpenAI's GPT is getting better at mathematics

OpenAI's GPT-5.2 Pro does better at solving sophisticated math problems than older versions of the company's top large language model, according to a new study by Epoch AI, a non-profit research institute.
Artificial intelligence
Artificial intelligence
fromTheregister
1 month ago

OpenAI GPT-5.3 Instant less likely to beat around the bush

GPT-5.3 Instant reduces unnecessary refusals and moralizing preambles while decreasing hallucination rates by up to 26.8 percent compared to prior models.
Artificial intelligence
fromInfoQ
2 months ago

Building LLMs in Resource-Constrained Environments: A Hands-On Perspective

Prioritize small, resource-efficient models and iterative, human-in-the-loop data creation to build practical, improvable AI under infrastructure and data constraints.
Artificial intelligence
fromInfoQ
2 months ago

Hugging Face Introduces Community Evals for Transparent Model Benchmarking

Community Evals enables benchmark datasets on the Hugging Face Hub to host leaderboards, collect reproducible evaluation results via Git-based .eval_results YAML submissions, and display scores.
fromFortune
1 month ago

We studied chatbots and language and saw a huge problem: They mean 80% when they say 'likely' but humans hear 65% | Fortune

By comparing how AI models and humans map these words to numerical percentages, we uncovered significant gaps between humans and large language models. While the models do tend to agree with humans on extremes like 'impossible,' they diverge sharply on hedge words like 'maybe.' For example, a model might use the word 'likely' to represent an 80% probability, while a human reader assumes it means closer to 65%.
Artificial intelligence
Artificial intelligence
fromTheregister
2 months ago

How AI could eat itself: Using LLMs to distill rivals

Competitors are probing commercial AI models to extract underlying reasoning via distillation attacks to replicate capabilities and lower development costs.
Artificial intelligence
fromFuturism
2 months ago

OpenAI's Latest AI Was Created Using "Itself," Company Claims

GPT-5.3-Codex assisted developers by debugging training, managing deployment, and diagnosing evaluations, accelerating development but not representing autonomous recursive self-improvement.
fromRehumanize
2 months ago

Free AI Humanizer: Humanize AI Text & Bypass AI Detectors

AI Text Humanizer Protects Your Original Intent and Meaning Maintain your core perspective while restructuring sentence patterns. Humanizer ai accurately identifies and locks in technical terms, factual data, and key arguments, ensuring the rewritten draft is simply more readable without any semantic drift. You get a qualitative leap in flow and tone, allowing you to humanize ai text while keeping your original message perfectly intact.
Artificial intelligence
Artificial intelligence
fromPCMAG
1 month ago

Cut the BS: GPT-5.3 Model Promises to Fix ChatGPT's Preachy Tone

OpenAI released GPT-5.3 Instant to address ChatGPT's overly preachy tone by reducing moralizing preambles and unnecessary proclamations for more natural conversation.
fromNature
2 months ago

Multimodal learning with next-token prediction for large multimodal models - Nature

Since AlexNet5, deep learning has replaced heuristic hand-crafted features by unifying feature learning with deep neural networks. Later, Transformers6 and GPT-3 (ref. 1) further advanced sequence learning at scale, unifying structured tasks such as natural language processing. However, multimodal learning, spanning modalities such as images, video and text, has remained fragmented, relying on separate diffusion-based generation or compositional vision-language pipelines with many hand-crafted designs.
Artificial intelligence
Artificial intelligence
fromTechCrunch
2 months ago

Google's new Gemini Pro model has record benchmark scores-again | TechCrunch

Google released Gemini 3.1 Pro, a preview LLM that significantly outperforms Gemini 3 on independent benchmarks and tops professional-agent benchmarks.
[ Load more ]