Old RTX 3090 enough to serve thousands of LLM users
Briefly

A single Nvidia RTX 3090 can successfully serve a modest LLM like Llama 3.1 8B at FP16, handling over 100 concurrent requests and supporting thousands of users.
Backprop contends that because only a small fraction of users make requests simultaneously, a single RTX 3090 could sufficiently power thousands of end users.
While typically considered consumer hardware, Backprop’s use of the RTX 3090 demonstrates its capability; it provides 142 teraFLOPS of FP16 performance, suitable for LLM tasks.
The RTX 3090's limitations include memory capacity; with only 24GB, it can't handle larger models like Llama 3 70B, prompting Backprop to choose a smaller model.
Read at Theregister
[
|
]