Batching Techniques for LLMs | HackerNoon
Briefly

The compute utilization in serving LLMs can be improved by batching multiple requests, sharing model weights to reduce compute overhead. However, naive batching leads to significant delays and wasted resources.
To improve performance in LLM services, fine-grained batching mechanisms like cellular batching and iteration-level scheduling have been proposed, allowing new requests to be processed more efficiently after each iteration.
Read at Hackernoon
[
|
]