The advantage of swarm inference, the company says, is that frontier AI models often become less accurate when " reasoning" - the process by which models solve complex problems by breaking them into a series of smaller steps. Swarm inference supposedly helps avoid this problem by considering responses from multiple smaller models and ranking them by quality to obtain a better answer. Also, it's supposedly more affordable because it runs on distributed consumer hardware instead of in billion-dollar datacenters.
Some lawyers have learnt that the hard way, and have been fined for filing AI-generated court briefs that misrepresented principles of law and cited non-existent cases. The same is true in other fields. For example, AI models can pass the gold-standard test in finance - the Chartered Financial Analyst exam - yet score poorly on simple tasks required of entry-level financial analysts (see go.nature.com/42tbrgb).
Even the best artificial intelligence agents are fairly hopeless at online freelance work, according to an experiment that challenges the idea of AI replacing office workers en masse. The Remote Labor Index, a new benchmark developed by researchers at data annotation company Scale AI and the Center for AI Safety (CAIS), a nonprofit, measures the ability of frontier AI models to automate economically valuable work.
But there's a problem with this sort of trick: how do you know the compiler will keep doing it? What happens when the compiler's next release comes out? How can you catch performance regressions? One solution is benchmarking: you measure your code's speed, and if it gets a lot slower, something has gone wrong. This is useful and important if you care about speed. But it's also less localized, so it won't necessarily immediately pinpoint where the regression happened.
Going into actual benchmarks, Geekbench 6.4.0, shows just under 2,500 for single-core and just over 8,700 for multi-core. For comparison, the outgoing Galaxy Tab S10 Ultra with its Dimensity 9300+ gets around 2,200 on the single and 7,500 on the multi-core test. However, the result is below the Galaxy S25 Ultra score (Snapdragon 8 Elite) with 3,000 single and 9,800 multi-core tests.
Anshul Kundaje sums up his frustration with the use of artificial intelligence in science in three words: "bad benchmarks propagate". He expresses concern about questionable claims made by researchers about AI models, which take months to verify and often turn out to be false due to poorly defined benchmarks. This problem creates misinformation and wrong predictions, as flawed benchmarks are misused by enthusiastic users. The lack of reliable benchmarks threatens to undermine AI's potential to accelerate scientific progress rather than enhance it.
MiniMax's M1 model stands out with its open-weight reasoning capabilities, scoring high on multiple benchmarks, including an impressive 86.0% accuracy on AIME 2024.
Coding agents powered by large language models excel in software engineering tasks, yet comprehensive performance evaluation remains a significant challenge across diverse programming languages and real-world scenarios.