
""How and why general-purpose AI models acquire new capabilities and behave in certain ways is often difficult to predict, even for developers. An 'evaluation gap' means that benchmark results alone cannot reliably predict real-world utility or risk,""
""Whether current safeguards will be sufficiently effective for more capable systems is unclear," it adds. "Together, these gaps define the limits of what any current assessment can confidently claim.""
"It further notes that while general-purpose AI capabilities have improved in the past year through "inference-time scaling" (a technique that allows models to use more computing power to generate intermediate steps before giving a final answer), the overall picture remains "jagged", with leading systems excelling at some difficult tasks while failing at simpler ones."
The overall trajectory of general-purpose AI systems remains deeply uncertain even as wider deployment produces new empirical evidence about impacts. A broad set of risks spans effects on jobs, human autonomy, the environment, system malfunctions and malicious use. Benchmarking limitations create an evaluation gap that prevents reliable prediction of real-world utility or risk. Systemic data on prevalence and severity of most AI-related harms remain limited. Recent capability gains via inference-time scaling coexist with a jagged performance profile: some systems excel at hard tasks while failing at simpler ones. The sufficiency of current safeguards for more capable systems is unclear.
Read at ComputerWeekly.com
Unable to calculate read time
Collection
[
|
...
]