AI is ready to take over Python programming, but not much else
Briefly

AI is ready to take over Python programming, but not much else
"Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction."
"The findings show that current LLMs introduce substantial errors when editing work documents, with frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4) losing an average 25% of document content over 20 delegated interactions, and an average degradation across all models of 50%."
"The benchmark contains 310 work environments across 52 professional domains including coding, crystallography, genealogy and music sheet notation. Each environment consists of real documents totaling around 15K tokens in length, and five to 10 complex editing tasks that a user might ask an LLM to perform."
"Tests of how well 19 large language models (LLMs) complete and perform complicated multi-step tasks has shown that they are both error-prone and, in many cases, unreliable."
Nineteen large language models were tested on complicated multi-step tasks that simulate knowledge-worker workflows. The benchmark, DELEGATE-52, covers 310 work environments across 52 professional domains such as coding, crystallography, genealogy, and music sheet notation. Each environment uses real documents totaling about 15K tokens and includes five to 10 complex editing tasks. Results show models are error-prone and often unreliable, producing sparse but severe mistakes that silently corrupt documents. Errors compound over long interactions, with frontier models losing about 25% of document content over 20 delegated interactions and all models showing about 50% average degradation.
Read at InfoWorld
Unable to calculate read time
[
|
]