In evaluations on the ShareGPT dataset, vLLM supports 1.7-2.7 times higher request rates than Orca, managing memory more efficiently through PagedAttention technology.
The latency patterns demonstrate that when the request rate exceeds system capacity, queue length—and thus latency—grows indefinitely, illustrating the challenges in memory management.
Collection
[
|
...
]