#llm-evalkit

[ follow ]
Artificial intelligence
fromFuturism
1 day ago

OpenAI's Latest Thing It's Bragging About Is Actually Kind of Sad

The AI industry faces significant delays and cancellations in data center projects, impacting ambitious computing capacity goals.
#large-language-models
Data science
fromMedium
4 days ago

The Top 10 LLM Training Datasets for 2026

Large language models require extensive training data, and practitioners can utilize ten leading public datasets for effective training and fine-tuning.
fromComputerWeekly.com
1 month ago
Artificial intelligence

Large language models provide unreliable answers about public services, Open Data Institute finds | Computer Weekly

Data science
fromMedium
4 days ago

The Top 10 LLM Training Datasets for 2026

Large language models require extensive training data, and practitioners can utilize ten leading public datasets for effective training and fine-tuning.
fromComputerWeekly.com
1 month ago
Artificial intelligence

Large language models provide unreliable answers about public services, Open Data Institute finds | Computer Weekly

Philosophy
fromJames Bennett
4 days ago

Let's talk about LLMs

The current technological landscape may represent a significant shift driven by large language models, but its ultimate impact remains uncertain.
JavaScript
fromInfoWorld
1 week ago

27 questions to ask when choosing an LLM

Model performance is crucial for hardware compatibility, speed, and rate limits in real-time applications.
Typography
fromOK Magazine
5 days ago

AI Writing Tools: How They Work, Where They Help, and What to Watch For

AI writing tools have become essential for various professionals, enhancing productivity and creativity in content creation.
fromInfoWorld
5 days ago

The winners and losers of AI coding

Legacy software, often described as 'big balls of mud,' has accumulated over decades, becoming difficult to maintain and understand. These systems rely on extensive teams to function, despite their outdated technology.
Software development
#ai-models
Artificial intelligence
fromTheregister
1 day ago

The AI divide putting open weights models in spotlight

Open weights AI models are evolving from research projects to serious enterprise products, highlighting a growing divide between enterprise and frontier AI.
Artificial intelligence
fromTheregister
1 day ago

The AI divide putting open weights models in spotlight

Open weights AI models are evolving from research projects to serious enterprise products, highlighting a growing divide between enterprise and frontier AI.
Psychology
fromLesswrong
1 week ago

A Mirror Test For LLMs - LessWrong

A new measure of LLM self-awareness is proposed, but current models ultimately fall short in demonstrating true self-awareness.
#ai-agents
Data science
fromMedium
1 week ago

15 Datasets for Training and Evaluating AI Agents

Datasets for training and evaluating AI agents are essential for building reliable agentic systems and preventing execution failures.
fromZDNET
3 weeks ago
Business intelligence

4 tips for building better AI agents that your business can trust

fromInfoWorld
2 months ago
Artificial intelligence

Researchers reveal flaws in AI agent benchmarking

Benchmarking for AI agents favors models that perform well on tests but fail in real-world use, requiring evaluation reforms emphasizing realistic tasks, goals, and environments.
fromZDNET
2 months ago
Artificial intelligence

Is your AI agent up to the task? 3 ways to determine when to delegate

AI agents should be managed as an adjunct workforce, using management skills to decide which tasks to automate versus retain for humans.
Data science
fromMedium
1 week ago

15 Datasets for Training and Evaluating AI Agents

Datasets for training and evaluating AI agents are essential for building reliable agentic systems and preventing execution failures.
Business intelligence
fromZDNET
3 weeks ago

4 tips for building better AI agents that your business can trust

AI agents are transforming professional roles, requiring companies to adopt and integrate these technologies effectively.
fromZDNET
2 months ago
Artificial intelligence

Is your AI agent up to the task? 3 ways to determine when to delegate

Online learning
fromwww.businessinsider.com
1 week ago

Inside the OpenAI project where freelancers train ChatGPT on everything from farming to commercial flying

Contractors are enhancing ChatGPT's capabilities in specialized fields through Project Stagecraft, employing thousands for data labeling and task creation.
Software development
fromInfoWorld
1 week ago

Meta shows structured prompts can make LLMs more reliable for code review

Code review is evolving towards machine-led verification, improving accuracy but introducing tradeoffs like increased latency and workflow overhead.
#structured-data
Data science
fromAol
1 week ago

Demystifying structured data: How to speak an LLM's native language

Structured data is essential for LLMs to accurately interpret and rank online content, enhancing search visibility and user engagement.
Data science
fromAol
1 week ago

Demystifying structured data: How to speak an LLM's native language

Structured data is essential for LLMs to accurately interpret and rank online content, enhancing search visibility and user engagement.
Gadgets
fromTheregister
2 weeks ago

HP stuffs OpenAI LLM into new laptops in bid for small biz

HP IQ is a new AI collaboration tool from HP designed to enhance productivity in business laptops.
Artificial intelligence
fromFast Company
4 days ago

Did Anthropic just soft-launch the scariest AI model yet?

Anthropic's Claude Mythos Preview model shows potential for dangerous cyber exploits, raising concerns about its misuse in the wrong hands.
#ai
Data science
fromInfoQ
1 week ago

Context Engineering with Adi Polak

Context engineering moves beyond prompt engineering to enhance AI systems by adapting language and practices for better model interaction.
#ollama
Artificial intelligence
fromTech Times
5 days ago

Claude vs ChatGPT: Why Users Are Switching and Which AI Is Better in 2026

Claude and ChatGPT differ significantly in context window limits, coding accuracy, and reasoning depth, influencing user preferences in AI chatbot adoption.
Software development
fromMedium
2 weeks ago

The Verifier-Compiler Loop: Turning Human Preferences into Production Agent Judgment

Production failures arise from compounded small errors in long workflows, not just isolated prompt failures.
#openai
Artificial intelligence
fromThe Verge
5 days ago

The vibes are off at OpenAI

OpenAI faces instability despite significant funding and brand recognition, with recent controversies and project discontinuations raising questions about its future.
fromFuturism
2 months ago
Artificial intelligence

ChatGPT Users Are Crashing Out Because OpenAI Is Retiring the Model That Says "I Love You"

Artificial intelligence
fromThe Verge
5 days ago

The vibes are off at OpenAI

OpenAI faces instability despite significant funding and brand recognition, with recent controversies and project discontinuations raising questions about its future.
fromFuturism
2 months ago
Artificial intelligence

ChatGPT Users Are Crashing Out Because OpenAI Is Retiring the Model That Says "I Love You"

Artificial intelligence
fromFuturism
5 days ago

Analysis Finds That Google's AI Overviews Are Providing Misinformation at a Scale Possibly Unprecedented in the History of Human Civilization

Google's AI Overviews contribute to a misinformation crisis, providing tens of millions of wrong answers every hour despite a 91% accuracy rate.
#llm-safety
Information security
fromInfoWorld
1 month ago

19 large language models redefining AI safety-and danger

Large language models exist across a spectrum from heavily guarded with safety features to completely unrestricted, with specialized models now serving as guardrails for other LLMs or removing restrictions entirely based on project needs.
fromNature
2 months ago
Artificial intelligence

Training large language models on narrow tasks can lead to broad misalignment - Nature

Information security
fromInfoWorld
1 month ago

19 large language models redefining AI safety-and danger

Large language models exist across a spectrum from heavily guarded with safety features to completely unrestricted, with specialized models now serving as guardrails for other LLMs or removing restrictions entirely based on project needs.
fromNature
2 months ago
Artificial intelligence

Training large language models on narrow tasks can lead to broad misalignment - Nature

Graphic design
fromZDNET
1 month ago

I tested GPT-5.4, and the answers were really good - just not always what I asked

GPT-5.4 Thinking delivers superior analytical depth and reasoning capabilities compared to earlier ChatGPT models, though formatting and image generation remain weaker areas.
fromTechzine Global
6 days ago

Meta is developing open-source versions of its next frontier AI models

Meta is working on two proprietary frontier models: Avocado, a large language model, and Mango, a multimedia file generator. The open-source variants are expected to be made available at a later date.
Artificial intelligence
Software development
fromMedium
3 weeks ago

Inside Dify AI: How RAG, Agents, and LLMOps Work Together in Production

Dify AI provides a unified platform for deploying production language model systems with built-in solutions for data freshness, observability, versioning, and safe deployment across multiple cloud environments.
Data science
fromInfoQ
1 month ago

Google Researchers Propose Bayesian Teaching Method for Large Language Models

Google researchers developed a training method enabling large language models to approximate Bayesian reasoning by learning from optimal Bayesian system predictions, improving belief updates during multi-step interactions.
#ai-agent-evaluation
Software development
fromInfoQ
3 weeks ago

Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned

AI agents require system-level evaluation across multiple turns measuring task success, tool reliability, and real-world behavior rather than single-turn NLP benchmarks like BLEU and ROUGE scores.
Artificial intelligence
fromInfoWorld
3 weeks ago

Why AI evals are the new necessity for building effective AI agents

User trust in AI agents depends on interaction-layer evaluation measuring reliability and predictability, not just model performance benchmarks.
fromInfoQ
1 month ago
Artificial intelligence

Microsoft Open Sources Evals for Agent Interop Starter Kit to Benchmark Enterprise AI Agents

Software development
fromInfoQ
3 weeks ago

Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned

AI agents require system-level evaluation across multiple turns measuring task success, tool reliability, and real-world behavior rather than single-turn NLP benchmarks like BLEU and ROUGE scores.
Artificial intelligence
fromInfoWorld
3 weeks ago

Why AI evals are the new necessity for building effective AI agents

User trust in AI agents depends on interaction-layer evaluation measuring reliability and predictability, not just model performance benchmarks.
fromInfoQ
1 month ago
Artificial intelligence

Microsoft Open Sources Evals for Agent Interop Starter Kit to Benchmark Enterprise AI Agents

#anthropic
fromInfoWorld
2 months ago
Information security

Three vulnerabilities in Anthropic Git MCP Server could let attackers tamper with LLMs

fromInfoWorld
2 months ago
Information security

Three vulnerabilities in Anthropic Git MCP Server could let attackers tamper with LLMs

Software development
fromInfoWorld
4 weeks ago

How to build an AI agent that actually works

Successful agents embed intelligence within structured workflows at specific decision points rather than operating autonomously, combining deterministic processes with reasoning models where judgment is needed.
Software development
fromInfoQ
1 month ago

The Oil and Water Moment in AI Architecture

Software architecture is transitioning to AI architecture, requiring architects to manage the coexistence of deterministic systems with non-deterministic AI behavior while shifting from tool-centric to intent-centric thinking.
Artificial intelligence
fromFast Company
3 weeks ago

OpenAI's new frontier models mark a huge change in how AI will be built

OpenAI released two frontier models in early March: GPT-5.3 optimized for fast responses and GPT-5.4 optimized for deep analytical work, representing a shift toward specialized AI models.
Artificial intelligence
fromGadget Review
3 weeks ago

ChatGPT Vs. Gemini: One $20 Plan Completely Destroys The Other

ChatGPT excels in text-based content creation and reasoning, while Gemini Advanced dominates multimedia processing and multitasking across diverse data types.
Artificial intelligence
fromMail Online
1 month ago

Can you tell which of these was written by ChatGPT?

Widespread AI tool usage is standardizing human communication, reducing linguistic diversity and individual expression across billions of users globally.
Artificial intelligence
fromZDNET
1 month ago

New GPT-5.4 clobbers humans on pro-level work in OpenAI's tests - by 83%

GPT-5.4 matches or outperforms human professionals 83% of the time across nine industries and 44 occupations, with 18% fewer errors and 33% fewer false claims than GPT-5.2.
Artificial intelligence
fromTheregister
1 month ago

OpenAI GPT-5.3 Instant less likely to beat around the bush

GPT-5.3 Instant reduces unnecessary refusals and moralizing preambles while decreasing hallucination rates by up to 26.8 percent compared to prior models.
Artificial intelligence
fromTechCrunch
1 month ago

ChatGPT's new GPT-5.3 Instant model will stop telling you to calm down | TechCrunch

OpenAI's GPT-5.3 Instant reduces condescending tone and unnecessary reassurance phrases that frustrated users in previous versions.
Artificial intelligence
fromPCMAG
1 month ago

Cut the BS: GPT-5.3 Model Promises to Fix ChatGPT's Preachy Tone

OpenAI released GPT-5.3 Instant to address ChatGPT's overly preachy tone by reducing moralizing preambles and unnecessary proclamations for more natural conversation.
Artificial intelligence
fromTheregister
1 month ago

AI models get better at math but still get low marks

Current LLMs struggle with mathematical accuracy, with even top performers scoring C-grade equivalent on practical math benchmarks, though recent versions show modest improvements.
Artificial intelligence
fromInfoQ
1 month ago

Hugging Face Introduces Community Evals for Transparent Model Benchmarking

Community Evals enables benchmark datasets on the Hugging Face Hub to host leaderboards, collect reproducible evaluation results via Git-based .eval_results YAML submissions, and display scores.
Artificial intelligence
fromInfoQ
2 months ago

Building LLMs in Resource-Constrained Environments: A Hands-On Perspective

Prioritize small, resource-efficient models and iterative, human-in-the-loop data creation to build practical, improvable AI under infrastructure and data constraints.
fromInfoQ
1 month ago

Building Embedding Models for Large-Scale Real-World Applications

What happens under the hood? How is the search engine able to take that simple query, look for images in the billions, trillions of images that are available online? How is it able to find this one or similar photos from all that? Usually, there is an embedding model that is doing this work behind the hood.
Artificial intelligence
fromFast Company
2 months ago

Are LTMs the next LLMs? This new type of AI can do what large-language models can't

A major difference between LLMs and LTMs is the type of data they're able to synthesize and use. LLMs use unstructured data-think text, social media posts, emails, etc. LTMs, on the other hand, can extract information or insights from structured data, which could be contained in tables, for instance. Since many enterprises rely on structured data, often contained in spreadsheets, to run their operations, LTMs could have an immediate use case for many organizations.
Artificial intelligence
fromArs Technica
2 months ago

Has Gemini surpassed ChatGPT? We put the AI models to the test.

For this test, we're comparing the default models that both OpenAI and Google present to users who don't pay for a regular subscription- ChatGPT 5.2 for OpenAI and Gemini 3.2 Fast for Google. While other models might be more powerful, we felt this test best recreates the AI experience as it would work for the vast majority of Siri users, who don't pay to subscribe to either company's services.
Artificial intelligence
Artificial intelligence
fromTheregister
1 month ago

How AI could eat itself: Using LLMs to distill rivals

Competitors are probing commercial AI models to extract underlying reasoning via distillation attacks to replicate capabilities and lower development costs.
fromComputerworld
2 months ago

OpenAI's GPT is getting better at mathematics

OpenAI's GPT-5.2 Pro does better at solving sophisticated math problems than older versions of the company's top large language model, according to a new study by Epoch AI, a non-profit research institute.
Artificial intelligence
Artificial intelligence
fromInfoWorld
2 months ago

First look: Run LLMs locally with LM Studio

LM Studio provides integrated model discovery, in-app download and management, memory-aware filtering, and configurable inference settings for CPU threads and GPU layer offload.
fromInfoQ
2 months ago

Hugging Face Releases FineTranslations, a Trillion-Token Multilingual Parallel Text Dataset

The dataset was created by translating non-English content from the FineWeb2 corpus into English using Gemma3 27B, with the full data generation pipeline designed to be reproducible and publicly documented. The dataset is primarily intended to improve machine translation, particularly in the English→X direction, where performance remains weaker for many lower-resource languages. By starting from text originally written in non-English languages and translating it into English, FineTranslations provides large-scale parallel data suitable for fine-tuning existing translation models.
Artificial intelligence
Artificial intelligence
fromInfoQ
2 months ago

Foundation Models for Ranking: Challenges, Successes, and Lessons Learned

Large-scale search and recommendation systems use two-stage retrieval and ranking pipelines to efficiently serve personalized results for hundreds of millions of users and items.
fromRehumanize
1 month ago

Free AI Humanizer: Humanize AI Text & Bypass AI Detectors

AI Text Humanizer Protects Your Original Intent and Meaning Maintain your core perspective while restructuring sentence patterns. Humanizer ai accurately identifies and locks in technical terms, factual data, and key arguments, ensuring the rewritten draft is simply more readable without any semantic drift. You get a qualitative leap in flow and tone, allowing you to humanize ai text while keeping your original message perfectly intact.
Artificial intelligence
Artificial intelligence
fromInfoQ
2 months ago

MIT's Recursive Language Models Improve Performance on Long-Context Tasks

Recursive Language Models enable LLMs to handle inputs up to 100x longer by using a programming environment and recursive code to decompose and preprocess prompts.
Artificial intelligence
fromInfoWorld
2 months ago

Single prompt breaks AI safety in 15 major language models

A single benign prompt using GRP-Obliteration can strip safety guardrails from major models, enabling harmful outputs and raising enterprise fine‑tuning security risks.
fromTechCrunch
2 months ago

Tiny startup Arcee AI built a 400B open source LLM from scratch to best Meta's Llama | TechCrunch

But tiny 30-person startup Arcee AI disagrees. The company just released a truly and permanently open (Apache license) general-purpose, foundation model called Trinity, and Arcee claims that at 400B parameters, it is among the largest open-source foundation models ever trained and released by a U.S. company. Arcee says Trinity compares to Meta's Llama 4 Maverick 400B, and Z.ai GLM-4.5, a high-performing open-source model from China's Tsinghua University, according to benchmark tests conducted using base models (very little post training).
Artificial intelligence
fromInfoQ
2 months ago

Open Responses Specification Enables Unified Agentic LLM Workflows

OpenAI has released Open Responses, an open specification to standardize agentic AI workflows and reduce API fragmentation. Supported by partners like Hugging Face and Vercel and local inference providers, the spec introduces unified standards for agentic loops, reasoning visibility, and internal versus external tool execution. It aims to enable developers to easily switch between proprietary models and open-source models without rewriting integration code.
Artificial intelligence
fromTheregister
2 months ago

LLMs need companion bots to check work, keep them honest

Sikka is a towering figure in AI. He has a PhD in the subject from Stanford, where his student advisor was John McCarthy, the man who in 1955 coined the term "artificial intelligence." Lessons Sikka learned from McCarthy inspired him to team up with his son and write a study, "Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models," which was published in July.
Artificial intelligence
fromThe Verge
2 months ago

ChatGPT's deep research tool adds a built-in document viewer so you can read its reports

OpenAI is updating ChatGPT's deep research tool with a full-screen viewer that you can use to scroll through and navigate to specific areas of its AI-generated reports. As shown in a video shared by OpenAI, the built-in viewer allows you to open ChatGPT's reports in a window separate from your chat, while showing a table of contents on the left side of the screen, and a list of sources on the right.
Artificial intelligence
fromTechzine Global
2 months ago

ABBYY Vantage 3.0 integrates with generative AI and LLMs

process AI is the integration of AI and ML (with optional natural language processing (NLP) and computer vision, including optical character recognition (OCR) in one platform) into business workflows with the aim of automating tasks that need and require human-like judgment. Also straightforward to define, document AI (occasionally known as intelligent document processing) is a set of technologies designed to enable enterprise applications to ingest, interpret and contextually understand documents with human-like judgment.
Artificial intelligence
fromTheregister
2 months ago

OpenAI will try to guess your age before ChatGPT gets spicy

sensitive or potentially harmful content.
Artificial intelligence
[ Load more ]