Our anchor-based caching method enhances inference efficiency compared to conventional full-caching methods by saving only the keys/values caches of anchor tokens, achieving acceleration ratios up to ×3.5.
The testing acceleration ratios we observed, particularly for AnLLMEP-AnSAN and AnLLM-AC-AnSAN, showed remarkable improvements in various tasks, demonstrating significant potential for more efficient language model inference.
Collection
[
|
...
]