
"The two large Gemma variants, 26B Mixture of Experts and 31B Dense, are designed to run unquantized in bfloat16 format on a single 80GB Nvidia H100 GPU."
"Google claims it has focused on reducing latency to really take advantage of Gemma's local processing, with the 26B Mixture of Experts model activating only 3.8 billion of its 26 billion parameters in inference mode."
"The Effective 2B and Effective 4B models are aimed at mobile devices, designed to maintain low memory usage during inference, running at an effective 2 billion or 4 billion parameters."
"Google touts 'near-zero latency' this time around, ensuring that the new models use less memory and battery than Gemma 3."
Gemma 4 introduces four optimized models for local usage, enhancing performance and reducing latency. The 26B Mixture of Experts and 31B Dense models are designed for high performance on powerful GPUs, while the Effective 2B and Effective 4B models target mobile devices with low memory usage. Google has collaborated with Qualcomm and MediaTek to optimize these models for various devices, ensuring better efficiency and near-zero latency compared to previous versions. The elimination of the custom licensing model aims to alleviate developer frustrations.
Read at Ars Technica
Unable to calculate read time
Collection
[
|
...
]