In this paper, we present a detailed examination of recurrent neural network architectures, revealing how scaling recurrent models can achieve efficiency comparable to that of Transformer networks.
Our findings suggest that with the right optimizations and techniques, recurrent models can be trained extremely efficiently on-device, surpassing expectations for speed and performance.
The study illustrates that while Transformer models dominate in many areas, there are specific tasks where enhanced long context modeling in recurrent networks provides measurable advantages.
We demonstrate through various experiments that incorporating retrieval capabilities into recurrent models greatly improves their performance in next token prediction tasks.
Collection
[
|
...
]