Anchor-based Large Language Models: More Experimental Results | HackerNoon
Briefly

Our anchor-based caching method enhances inference efficiency compared to conventional full-caching methods by saving only the keys/values caches of anchor tokens, achieving acceleration ratios up to ×3.5.
The testing acceleration ratios we observed, particularly for AnLLMEP-AnSAN and AnLLM-AC-AnSAN, showed remarkable improvements in various tasks, demonstrating significant potential for more efficient language model inference.
Read at Hackernoon
[
|
]