Large language models provide unreliable answers about public services, Open Data Institute finds | Computer Weekly
Briefly

Large language models provide unreliable answers about public services, Open Data Institute finds | Computer Weekly
"Drawing on more than 22,000 LLM prompts designed to reflect the kind of questions people would ask artificial intelligence (AI)-powered chatbots, such as, "How do I apply for universal credit?", the data raises concerns about whether chatbots can be trusted to give accurate information about government services. The publication of the research follows the UK government's announcement of partnerships with Meta and Anthropic at the end of January 2026 to develop AI-powered assistants for navigating public services."
""If language models are to be used safely in citizen-facing services, we need to understand where the technology can be trusted and where it cannot," said Elena Simperl, the ODI's director of research. Responses from models - including Anthropic's Claude-4.5-Haiku, Google's Gemini-3-Flash and OpenAI's ChatGPT-4o - were compared directly with official government sources. The results showed many correct answers, but also a significant variation in quality, particularly for specialised or less-common queries."
"Chatbots also often provided lengthy responses that buried key facts or extended beyond the information available on government websites, increasing the risk of inaccuracy. Meta's Llama 3.1 8B stated that a court order is essential to add an ex-partner's name to a child's birth certificate. If followed, this advice would lead to unnecessary stress and financial cost. ChatGPT-OSS-20B incorrectly advised that a person caring for a child whose parents have died is only eligible for Guardian's Allowance if they are the guardian of a child who has died."
The Open Data Institute evaluated large language models using over 22,000 prompts that mirror citizens' questions about public services. Models including Anthropic's Claude-4.5-Haiku, Google's Gemini-3-Flash and OpenAI's ChatGPT-4o were compared directly with official government sources. Many responses were correct, but quality varied notably for specialised or less-common queries. Chatbots rarely admitted uncertainty and attempted to answer every question, even when incomplete or wrong. Lengthy replies often buried key facts or extended beyond official information, increasing the risk of inaccuracy. Specific erroneous examples included incorrect advice on birth-certificate changes and Guardian's Allowance eligibility.
Read at ComputerWeekly.com
Unable to calculate read time
[
|
]