Once trained, LLMs are deployed as a conditional generation service, where the generation process involves sequentially sampling tokens based on all previous inputs.
In the sequential generation process, key and value vectors of existing tokens are cached, with each token's KV cache depending on all its previous tokens.
Collection
[
|
...
]