Gemma 3n is designed for mobile-first, on-device AI applications and incorporates new techniques for increased efficiency and performance. Per-Layer Embeddings (PLE) reduce RAM requirements by loading core transformer weights into VRAM while other parameters remain on the CPU. The MatFormer technique enables nesting of transformers, allowing models to be more flexible and efficient. In the future, Gemma 3n will support elastic inference, facilitating dynamic switching between model sizes. KV cache sharing aims to accelerate time-to-first-token for better performance in streaming applications.
Gemma 3n employs Per-Layer Embeddings (PLE) to optimize RAM usage by loading core transformer weights into VRAM while keeping other parameters on the CPU.
The MatFormer technique allows for nesting transformers, facilitating elastic inference and enabling developers to create intermediate-size models for flexibility in deployment.
Future updates for Gemma 3n will enable dynamic switching between full and sub-models on-the-fly, enhancing adaptability according to task demands.
KV cache sharing in Gemma 3n accelerates time-to-first-token, significantly benefiting streaming applications and improving performance with long context scenarios.
Collection
[
|
...
]