LLM Service & Autoregressive Generation: What This Means | HackerNoon
Briefly

Once trained, LLMs are deployed as a conditional generation service, where the generation process involves sequentially sampling tokens based on all previous inputs.
In the sequential generation process, key and value vectors of existing tokens are cached, with each token's KV cache depending on all its previous tokens.
Read at Hackernoon
[
|
]