Apple working to cram massive Gemini model into iPhone to power new Siri
Briefly

Apple working to cram massive Gemini model into iPhone to power new Siri
Apple has delayed AI-enhanced Siri since promising it in 2024, and a Google deal will merge Siri with Gemini later this year. Apple has emphasized privacy benefits from running AI locally, but a report indicates the Gemini-upgraded Siri will run on-device and in the cloud. Smartphone chips include AI-focused components like Neural Engine, yet phone GPUs often handle more AI tokens than NPUs. Even with faster processing, phones lack enough RAM to keep very large models in memory. On-device models are smaller, with at most a few billion parameters, while Gemini models have trillions of parameters. On-device models are also quantized to lower precision, improving speed but reducing token-generation accuracy, leading to less capable assistants than cloud models.
"Apple has long crowed about the privacy value of running AI locally, but a new report suggests that despite Apple's best efforts, the iPhone's Gemini makeover will lean heavily on Google and Nvidia in the cloud. The Information reports that Apple's Gemini-infused Siri will run both on-device and in the cloud, an apparent reversal of its privacy-focused preference for local AI."
"With every new chip announcement, we hear about how the silicon has been optimized for AI-even Apple does this with its focus on Neural Engine upgrades. You may think from the grandiose language that smartphones are equipped to handle beefy AI models, but that's not necessarily the case. In fact, the GPUs in most phones can process more AI tokens than the AI-focused NPUs."
"Components like Apple's Neural Engine are designed for contextual, efficient AI processing. Even if phones had faster AI processing, they lack the RAM to keep enormous models in memory. Even the largest AI models are still middling assistants, and that makes local AI very challenging."
"The AI models that run on phones are physically smaller, featuring at most a few billion parameters. Compare that to Google's latest Gemini models, which have trillions of parameters, The Information reports. On-device AI models are also "quantized" to run at lower precision, making them faster but affecting the accuracy of token generation."
Read at Ars Technica
Unable to calculate read time
[
|
]