Python
fromPyImageSearch
2 days agoDeepSeek-V3 from Scratch: Mixture of Experts (MoE) - PyImageSearch
Mixture of Experts (MoE) enhances DeepSeek-V3 by scaling model capacity efficiently without increasing computational costs.
IBM attributes those improved characteristics vs. larger models to its hybrid architecture that combines a small amount of standard transformer-style attention layers with a majority of Mamba layers-more specifically, Mamba-2. With 9 Mamba blocks per 1 Transformer block, Granite gets linear scaling vs. context length for the Mamba parts (vs. quadratic scaling in transformers), plus local contextual dependencies from transformer attention (important for in-context learning or few-shots prompting).
According to Microsoft, MAI-1-preview uses an in-house mixture-of-experts model that was pre-trained and post-trained on 15,000 Nvidia H100 GPUs, a more modest infrastructure than the 100,000 H100 cluster sizes reportedly used for model development by some rivals. However, with an eye to ramping up performance, Microsoft AI is now running MAI-1-preview on Nvidia's more powerful GB200 cluster, the company said.