Scaling Human Judgment: How Dropbox Uses LLMs to Improve Labeling for RAG Systems
Briefly

Scaling Human Judgment: How Dropbox Uses LLMs to Improve Labeling for RAG Systems
"Because there are millions (and, in very large enterprises, billions) of documents in the enterprise search index, Dash can pass along only a small subset of the retrieved documents to the LLM. This makes the quality of search ranking-and the labeled relevance data used to train it-critical to the quality of the final answer."
"To address the limitations of purely human judge-based labelling, which is expensive, slow, and inconsistent, Dropbox introduced a complementary approach in which an LLM generates relevance judgments at scale. This method is cheaper, more consistent, and can easily scale to large document sets."
"This approach, called 'human-calibrated LLM labeling', is straightforward: humans label a small, high-quality dataset, which is later used to calibrate the LLM evaluator."
Document retrieval quality is the critical bottleneck in retrieval-augmented generation (RAG) systems like Dropbox Dash. With millions or billions of enterprise documents, only a small subset can be passed to LLMs, making search ranking quality essential. Dropbox trains ranking models using supervised learning with labeled query-document pairs. Pure human labeling is expensive, slow, and inconsistent. To overcome these limitations, Dropbox implemented LLM-based relevance evaluation at scale, which is cheaper and more consistent. However, LLM judgments require validation before training use. The solution combines human-calibrated LLM labeling, where humans label a small high-quality dataset to calibrate the LLM evaluator, balancing automation with human oversight.
Read at InfoQ
Unable to calculate read time
[
|
]