DeepSeek Releases v3.1 Model with Hybrid Reasoning Architecture
Briefly

DeepSeek Releases v3.1 Model with Hybrid Reasoning Architecture
"DeepSeek has released version V3.1 of its large language model, introducing a hybrid architecture that combines thinking and non-thinking modes in a single system. The thinking mode, named DeepSeek-V3.1-Think, is designed to deliver faster reasoning compared to the previous DeepSeek-R1-0528 model, while maintaining similar response quality. This update also improves tool use and multi-step task execution through additional post-training adjustments."
"The development of DeepSeek-V3.1 builds on the DeepSeek-V3-Base checkpoint and follows a two-phase context extension strategy. The first phase extended the context window to 32,000 tokens using 630 billion tokens of training data. The second phase extended the context further to 128,000 tokens with an additional 209 billion training tokens. This approach enables the model to handle significantly longer input sequences compared to earlier versions."
"Training for V3.1 also adopted FP8 UE8M0 precision for weights and activations. This format provides efficiency benefits and maintains compatibility with microscaling techniques, allowing for more efficient deployment of large-scale models. In terms of size, the full DeepSeek-V3.1 model contains 671 billion total parameters, with approximately 37 billion parameters activated per token, while supporting the extended 128,000-token context length. DeepSeek V3.1 ranks near the top of open-source coding and reasoning benchmarks."
DeepSeek V3.1 combines hybrid thinking and non-thinking modes with a dedicated DeepSeek-V3.1-Think mode to speed reasoning while preserving response quality. Post-training adjustments improve tool usage and multi-step task execution. A two-phase context extension expanded the window to 32,000 tokens using 630 billion training tokens, then to 128,000 tokens with an additional 209 billion tokens, enabling much longer input handling. Training adopted FP8 UE8M0 precision for weights and activations to boost efficiency and support microscaling. The full model contains 671 billion parameters with about 37 billion activated per token and shows top-tier open-source coding and reasoning performance with high cost efficiency.
Read at InfoQ
Unable to calculate read time
[
|
]