
"The UK AI Security Institute (AISI) has found that frontier models are quickly becoming more efficient when asked to do some cybersecurity work. AISI measures this with its "time window benchmark for cybersecurity," which estimates how much work an AI can do compared to a human. Using the benchmark could lead to findings such as Claude Sonnet 4.5 can do what a human cybersecurity expert can do in 16 minutes about 80 percent of the time, given a budget of 2.5m tokens."
"AISI has found the human-comparable task time - 16 minutes in this instance - is growing, fast. If tokens flowed freely instead of being arbitrarily capped, AI models might do better still. In February 2026, AISI internally reduced the expected task time doubling period from 8 to 4.7 months, based on progress made since late 2024."
""In February 2026, we estimated that frontier models' 80 percent-reliability cyber time horizon had doubled every 4.7 months since reasoning models emerged in late 2024, given a 2.5M token limit," the AISI said in a post on Wednesday. "This was around half our November 2025 doubling time estimate, which was 8 months for both 50 percent and 80 percent reliability. Claude Mythos Preview and GPT-5.5 have since significantly outperformed this trend.""
"The recalculated doubling time estimate, given what Mythos Preview and GPT-5.5 can do, is even shorter than 4.7 months. AISA does not cite a specific value but the organization points to similar time horizon estimates based on measurements of a broader skillset, software engineering, made by non-profit AI research house METR. "Their results imply a consistent doubling time of 4.2 months on software tasks since late 2024," AISI said, noting that with the latest Mythos Pre"
UK AI Security Institute measurements show frontier models becoming more efficient at cybersecurity tasks. A time window benchmark compares AI work output to human performance. Results indicate models can reach human-comparable cybersecurity capability in far less time than before, with reliability levels such as 80 percent. The human-comparable task time has been shrinking quickly, and token limits constrain potential performance. Internal estimates were updated as models improved, reducing the expected doubling period from 8 months to 4.7 months after progress since late 2024. New model releases required further compression of projections, and the institute cites related software engineering measurements suggesting a similar multi-month doubling cadence.
Read at theregister
Unable to calculate read time
Collection
[
|
...
]