DeepSeek releases 'sparse attention' model that cuts API costs in half

"The most important feature of the new model is called DeepSeek Sparse Attention, an intricate system described in detail in the diagram below. In essence, the system uses a module called a "lightning indexer" to prioritize specific excerpts from the context window. After that, a separate system called a "fine-grained token selection system" chooses specific tokens from within those excerpts to load into the module's limited attention window."

"For long-context operations, the benefits of the system are significant. Preliminary testing by DeepSeek found that the price of a simple API call could be reduced by as much as half in long-context situations. Further testing will be required to build a more robust assessment, but because the model is open-weight and freely available on Hugging Face, it won't be long before third-party tests can assess the claims made in the paper."

"DeepSeek's new model is one of a string of recent breakthroughs tackling the problem of inference costs - essentially, the server costs of operating a pre-trained AI model, as distinct from the cost of training it. In DeepSeek's case, the researchers were looking for ways to make the fundamental transformer architecture operate more efficiently - and finding that there are significant improvements to be made."

DeepSeek released V3.2-exp, an experimental model designed to reduce inference costs for long-context operations. The model implements DeepSeek Sparse Attention, which uses a lightning indexer to prioritize context excerpts and a fine-grained token selection system to load selected tokens into a limited attention window. These mechanisms allow processing long contexts while keeping server load low. Preliminary tests indicated API call costs can fall by as much as half in long-context scenarios. The model's weights are openly available on Hugging Face and the linked paper is on GitHub. The work targets inference/server cost reduction and transformer efficiency improvements.

#sparse-attention #long-context-inference #transformer-efficiency #open-weight-model

Read at TechCrunch

Unable to calculate read time

Collection

[

...

]

DeepSeek releases 'sparse attention' model that cuts API costs in half | TechCrunchDeepSeek releases 'sparse attention' model that cuts API costs in half | TechCrunch Briefly

DeepSeek releases 'sparse attention' model that cuts API costs in half | TechCrunch
DeepSeek releases 'sparse attention' model that cuts API costs in half | TechCrunch
Briefly