A single Nvidia RTX 3090 can successfully serve a modest LLM like Llama 3.1 8B at FP16, handling over 100 concurrent requests and supporting thousands of users.
Backprop contends that because only a small fraction of users make requests simultaneously, a single RTX 3090 could sufficiently power thousands of end users.
While typically considered consumer hardware, Backprop’s use of the RTX 3090 demonstrates its capability; it provides 142 teraFLOPS of FP16 performance, suitable for LLM tasks.
The RTX 3090's limitations include memory capacity; with only 24GB, it can't handle larger models like Llama 3 70B, prompting Backprop to choose a smaller model.
Collection
[
|
...
]