
"Because there are millions (and, in very large enterprises, billions) of documents in the enterprise search index, Dash can pass along only a small subset of the retrieved documents to the LLM. This makes the quality of search ranking-and the labeled relevance data used to train it-critical to the quality of the final answer."
"To address the limitations of purely human judge-based labelling, which is expensive, slow, and inconsistent, Dropbox introduced a complementary approach in which an LLM generates relevance judgments at scale. This method is cheaper, more consistent, and can easily scale to large document sets."
"This approach, called 'human-calibrated LLM labeling', is straightforward: humans label a small, high-quality dataset, which is later used to calibrate the LLM evaluator."
Document retrieval quality is the critical bottleneck in retrieval-augmented generation (RAG) systems like Dropbox Dash. With millions or billions of enterprise documents, only a small subset can be passed to LLMs, making search ranking quality essential. Dropbox trains ranking models using supervised learning with labeled query-document pairs. Pure human labeling is expensive, slow, and inconsistent. To overcome these limitations, Dropbox implemented LLM-based relevance evaluation at scale, which is cheaper and more consistent. However, LLM judgments require validation before training use. The solution combines human-calibrated LLM labeling, where humans label a small high-quality dataset to calibrate the LLM evaluator, balancing automation with human oversight.
#retrieval-augmented-generation-rag #llm-based-labeling #document-ranking #human-in-the-loop-ai #enterprise-search
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]